Error in arrow file creation (.addGeneScoreMat): "HDF5. Dataset. Unable to initialize object"

dagarfield commented 3 years ago

Hi,

I have an issue that I suspect may be related to the recently fixed error (https://github.com/GreenleafLab/ArchR/issues/395) in saving tile matrices when there is no coverage (short scaffold). During arrow file creation, I get the following error:

2021-02-18 12:14:00 : ERROR Found in .addGeneScoreMat for (earlyDev : 1 of 1) LogFile = ArchRLogs/ArchR-createArrows-2078d60dd0313-Date-2021-02-18_Time-11-12-15.log <simpleError in h5writeDataset.data.frame(obj, loc$H5Identifier, name, ...): HDF5. Dataset. Unable to initialize object.>

The genome is pretty rough, and many of the scaffolds are quite small -- small enough that they have no reads in some datasets (fixed in 1.0.1). Alas, these small scaffolds contain some important genes, and so I'd prefer not to get rid of them.

Do you have any suggestions what I might do to either fix this or better nail down the issue?

Session info follows:

sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /fast/work/users/dgarfie_m/miniconda3/envs/ArchR/lib/libopenblasp-r0.3.12.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] ArchR_1.0.1 magrittr_2.0.1
[3] rhdf5_2.34.0 Matrix_1.3-2
[5] data.table_1.13.6 SummarizedExperiment_1.20.0 [7] Biobase_2.50.0 GenomicRanges_1.42.0
[9] GenomeInfoDb_1.26.2 IRanges_2.24.1
[11] S4Vectors_0.28.1 BiocGenerics_0.36.0
[13] MatrixGenerics_1.2.0 matrixStats_0.57.0
[15] ggplot2_3.3.3

loaded via a namespace (and not attached): [1] Rcpp_1.0.6 pillar_1.4.7 compiler_4.0.3
[4] XVector_0.30.0 rhdf5filters_1.2.0 bitops_1.0-6
[7] tools_4.0.3 zlibbioc_1.36.0 lattice_0.20-41
[10] lifecycle_0.2.0 tibble_3.0.5 gtable_0.3.0
[13] pkgconfig_2.0.3 rlang_0.4.10 DelayedArray_0.16.1
[16] DBI_1.1.1 GenomeInfoDbData_1.2.4 withr_2.4.1
[19] dplyr_1.0.3 generics_0.1.0 vctrs_0.3.6
[22] grid_4.0.3 tidyselect_1.1.0 glue_1.4.2
[25] R6_2.5.0 Rhdf5lib_1.12.0 purrr_0.3.4
[28] scales_1.1.1 ellipsis_0.3.1 assertthat_0.2.1
[31] colorspace_2.0-0 RCurl_1.98-1.2 munsell_0.5.0
[34] crayon_1.3.4 Cairo_1.5-12.2
ArchR-createArrows-2078d60dd0313-Date-2021-02-18_Time-11-12-15.log

dagarfield commented 3 years ago

And if you set addGeneScoreMat = FALSE the arrow file appears to be created successfully (though the [empty] tmp directory remains). So there does indeed seem to be something "special" about that step in my highly fragmented genome.

One note that might (?) help diagnosis this -- because some scaffolds are small, there are a number of genes right at the edge of a scaffold such that extensions of the TSS region (and presumably extensions to the gene models too) end up out range (triggering a note about trimming). Could this contribute?

jgranja24 commented 3 years ago

Hi @dagarfield,

Something is a bit screwed up with the featureDF (nCols should be equal to 5)

2021-02-18 12:13:40 : earlyDev .addGeneScoreMat FeatureDF, Class = data.frame
earlyDev .addGeneScoreMat FeatureDF: nRows = 30892, nCols = 30897
        seqnames start   end strand name.LOC115919120 name.LOC115923689
1 NW_022144748.1 36453 54759      1      LOC115919120      LOC115923689
2 NW_022144751.1  1560  1066      2      LOC115919120      LOC115923689
3 NW_022144751.1  2712  2344      2      LOC115919120      LOC115923689

This error would cause your issue because there is no longer a "name" column to be written to the hdf5 file.

The error is because the symbol column in your Genes is a list see below

2021-02-18 12:04:43 : earlyDev .addGeneScoreMat geneRegions, Class = GRanges
GRanges object with 30892 ranges and 5 metadata columns:
                seqnames            ranges strand |      gene_id       symbol
                   <Rle>         <IRanges>  <Rle> |  <character>       **<list>**
      [1] NW_022144748.1       31453-54759      + | LOC115919120 LOC115919120
      [2] NW_022144751.1         1066-6560      - | LOC115923689 LOC115923689
      [3] NW_022144751.1         2344-7712      - | LOC115921875 LOC115921875
      [4] NW_022144751.1         3446-8853      - | LOC115922805 LOC115922805
      [5] NW_022144751.1        4540-14415      - | LOC115919121 LOC115919121
      ...            ...               ...    ... .          ...          ...
  [30888] NW_022145615.1 35137653-35153317      - | LOC115920560 LOC115920560
  [30889] NW_022145615.1 35189209-35227587      - | LOC115920599 LOC115920599
  [30890] NW_022145615.1 35197417-35218628      + | LOC115920600 LOC115920600
  [30891] NW_022145615.1 35209826-35222716      - | LOC105446019 LOC105446019
  [30892] NW_022145615.1 35217770-35232477      + |    LOC586979    LOC586979

This will cause a lot of issues since this needs to be a character vector. I added new safeguards to prevent this. See getGenes or whatever genes GRanges you are supplying to fix this.

dagarfield commented 3 years ago

(Facepalm) I’m surprised that list error made it through the TxDb construction. Thanks for the catch.

Do you expect any issues due to some genes being close to chromosome/scaffold ends (and this negative coordinates if you take a window around, say, the TSS)

On Mon 22. Feb 2021 at 05:20, jgranja24 notifications@github.com wrote:

Hi @dagarfield https://github.com/dagarfield,

Something is a bit screwed up with the featureDF (nCols should be equal to 5)

2021-02-18 12:13:40 : earlyDev .addGeneScoreMat FeatureDF, Class = data.frame earlyDev .addGeneScoreMat FeatureDF: nRows = 30892, nCols = 30897 seqnames start end strand name.LOC115919120 name.LOC115923689 1 NW_022144748.1 36453 54759 1 LOC115919120 LOC115923689 2 NW_022144751.1 1560 1066 2 LOC115919120 LOC115923689 3 NW_022144751.1 2712 2344 2 LOC115919120 LOC115923689

This error would cause your issue because there is no longer a "name" column to be written to the hdf5 file.

The error is because the symbol column in your Genes is a list see below

2021-02-18 12:04:43 : earlyDev .addGeneScoreMat geneRegions, Class = GRanges GRanges object with 30892 ranges and 5 metadata columns: seqnames ranges strand | gene_id symbol
| **** [1] NW_022144748.1 31453-54759 + | LOC115919120 LOC115919120 [2] NW_022144751.1 1066-6560 - | LOC115923689 LOC115923689 [3] NW_022144751.1 2344-7712 - | LOC115921875 LOC115921875 [4] NW_022144751.1 3446-8853 - | LOC115922805 LOC115922805 [5] NW_022144751.1 4540-14415 - | LOC115919121 LOC115919121 ... ... ... ... . ... ... [30888] NW_022145615.1 35137653-35153317 - | LOC115920560 LOC115920560 [30889] NW_022145615.1 35189209-35227587 - | LOC115920599 LOC115920599 [30890] NW_022145615.1 35197417-35218628 + | LOC115920600 LOC115920600 [30891] NW_022145615.1 35209826-35222716 - | LOC105446019 LOC105446019 [30892] NW_022145615.1 35217770-35232477 + | LOC586979 LOC586979 This will cause a lot of issues since this needs to be a character vector. I added new safeguards to prevent this. See getGenes or whatever genes GRanges you are supplying to fix this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe . -- Sent from Gmail Mobile

rcorces commented 3 years ago

Do you expect any issues due to some genes being close to chromosome/scaffold ends (and this negative coordinates if you take a window around, say, the TSS)

There have been other instances where genes within 2 kb of scaffold ends had to be excluded from the transcript database (#288

Closing this since it seems resolved. Comment again if you need this re-opened.

dagarfield commented 3 years ago

There have been other instances where genes within 2 kb of scaffold ends had to be excluded from the transcript database (#288

Great, thanks for the tip. So far it seems to have worked fine (at least the arrow file with GeneScore matrix was created). But there do appear to be a few out of range features in there, presumably as a result of feature expansion on short scaffolds, e.g.

Warning message in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): “GRanges object contains 5 out-of-bound ranges located on sequences NW_022145461.1, NW_022145376.1, NW_022145380.1, NW_022145397.1, and NW_022144911.1. Note that ranges located on a sequence whose length is unknown (NA) or on a circular sequence are not considered out-of-bound (use seqlengths() and isCircular() to get the lengths and circularity flags of the underlying sequences). You can use trim() to trim these ranges. See ?trim,GenomicRanges-method for more information.”

If I run into trouble with this, I know what to do (drop those troublesome scaffolds), but I do wonder -- would it be possible to request as a feature that a trim step on the GenomoicsRange objects is added to the validity check? Or an option for the trimming to actually happen (rather than just printing a notice).

Thanks!

dagarfield commented 2 years ago

The status of this error seems to be solved (not getting the error in the current ArchR version), but rather getting warnings that can be ignored.

GreenleafLab / ArchR

Error in arrow file creation (.addGeneScoreMat): "HDF5. Dataset. Unable to initialize object" #538