charles-plessy / CAGEr

Mirror of Bioconductor's CAGEr package repository
https://bioconductor.org/packages/CAGEr
6 stars 4 forks source link

getCTSS: BiocParallel errors element index: 1 first error: too many positions in 'pos' #51

Closed pna059 closed 2 years ago

pna059 commented 2 years ago

Hi, I am running CAGEr using a (large) custom genome. Starting with the bam files, I am getting BiocParallel error already with the getCTSSes function. Could you help me to identify the problem? THANKS!

> getCTSS(ce, sequencingQualityThreshold = 10,
+         mappingQualityThreshold = 20, removeFirstG = FALSE,
+         correctSystematicG = FALSE, useMulticore = TRUE, nrCores = 8)

Reading in file: D4.A.bam...
    -> Filtering out low quality reads...
Error: BiocParallel errors
  element index: 1
  first error: too many positions in 'pos'
In addition: Warning message:
stop worker failed:
  attempt to select less than one element in OneIndex 

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS/LAPACK: /cvmfs/software.metacentrum.cz/spack1/software/intel-parallel-studio/linux-debian10-x86_64/cluster.2019.4-intel-iifs5g/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=cs_CZ.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=cs_CZ.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicFeatures_1.44.2           AnnotationDbi_1.54.1             BSgenome.MorexV3.Gatersleben_3.0 BSgenome_1.60.0                 
 [5] rtracklayer_1.52.1               Biostrings_2.60.2                XVector_0.32.0                   CAGEr_1.34.0                    
 [9] MultiAssayExperiment_1.18.0      SummarizedExperiment_1.22.0      Biobase_2.52.0                   GenomicRanges_1.44.0            
[13] GenomeInfoDb_1.28.4              IRanges_2.26.0                   S4Vectors_0.30.2                 BiocGenerics_0.38.0  
[17] MatrixGenerics_1.4.3             matrixStats_0.61.0              

loaded via a namespace (and not attached):
 [1] nlme_3.1-153             bitops_1.0-7             bit64_4.0.5              filelock_1.0.2           progress_1.2.2          
 [6] httr_1.4.2               tools_4.1.1              utf8_1.2.2               R6_2.5.1                 vegan_2.5-7             
[11] KernSmooth_2.23-20       DBI_1.1.1                mgcv_1.8-38              colorspace_2.0-2         permute_0.9-5           
[16] tidyselect_1.1.1         prettyunits_1.1.1        curl_4.3.2               bit_4.0.4                compiler_4.1.1          
[21] xml2_1.3.2               DelayedArray_0.18.0      scales_1.1.1             rappdirs_0.3.3           digest_0.6.28           
[26] stringr_1.4.0            Rsamtools_2.8.0          stringdist_0.9.8         pkgconfig_2.0.3          dbplyr_2.1.1            
[31] fastmap_1.1.0            rlang_0.4.12             rstudioapi_0.13          RSQLite_2.2.8            VGAM_1.1-5              
[36] BiocIO_1.2.0             generics_0.1.0           BiocParallel_1.26.2      gtools_3.9.2             dplyr_1.0.7             
[41] RCurl_1.98-1.5           magrittr_2.0.1           GenomeInfoDbData_1.2.6   Matrix_1.3-4             Rcpp_1.0.7              
[46] munsell_0.5.0            fansi_0.5.0              lifecycle_1.0.1          stringi_1.7.5            yaml_2.2.1              
[51] MASS_7.3-54              zlibbioc_1.38.0          BiocFileCache_2.0.0      plyr_1.8.6               grid_4.1.1              
[56] formula.tools_1.7.1      blob_1.2.2               crayon_1.4.1             lattice_0.20-45          splines_4.1.1           
[61] hms_1.1.1                KEGGREST_1.32.0          beanplot_1.2             pillar_1.6.4             rjson_0.2.20            
[66] biomaRt_2.48.3           XML_3.99-0.8             glue_1.4.2               data.table_1.14.2        operator.tools_1.6.3    
[71] vctrs_0.3.8              png_0.1-7                gtable_0.3.0             purrr_0.3.4              reshape_0.8.8           
[76] assertthat_0.2.1         cachem_1.0.6             ggplot2_3.3.5            restfulr_0.0.13          tibble_3.1.5            
[81] som_0.3-5.1              GenomicAlignments_1.28.0 memoise_2.0.0            cluster_2.1.2            ellipsis_0.3.2   
charles-plessy commented 2 years ago

Thanks for your report; can you try the latest stable version ? We fixed a lot of bugs there... (v 2.0.2)

pna059 commented 2 years ago

The version 2.0.2 still does not allow correctSystematicG=TRUE

Error: BiocParallel errors
  element index: 1
  first error: correctSystematicG not supported yet for CAGEexp objects

but when FALSE, the bam files are read in, but the following functions complain.

A CAGEexp object of 1 listed
 experiment with a user-defined name and respective class.
 Containing an ExperimentList class object of length 1:
 [1] tagCountMatrix: RangedSummarizedExperiment with 6473697 rows and 6 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save all data to files

> CTSStagCountSE(ce)
Error in CTSStagCountSE(ce) : 
  Could not find CTSS tag counts, see ‘?getCTSS’.
> CTSScoordinatesGR(ce)
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'rowRanges': Could not find CTSS tag counts, see ‘?getCTSS’.

sessionInfo()

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS/LAPACK: /cvmfs/software.metacentrum.cz/spack1/software/intel-parallel-studio/linux-debian10-x86_64/cluster.2019.4-intel-iifs5g/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=cs_CZ.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=cs_CZ.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] CAGEr_2.0.2                 MultiAssayExperiment_1.20.0 SummarizedExperiment_1.24.0
 [4] Biobase_2.54.0              GenomicRanges_1.46.1        GenomeInfoDb_1.30.0        
 [7] IRanges_2.28.0              S4Vectors_0.32.3            BiocGenerics_0.40.0        
[10] MatrixGenerics_1.6.0        matrixStats_0.61.0         

loaded via a namespace (and not attached):
 [1] VGAM_1.1-5                splines_4.1.1             DelayedMatrixStats_1.16.0 gtools_3.9.2             
 [5] assertthat_0.2.1          BSgenome_1.62.0           GenomeInfoDbData_1.2.7    Rsamtools_2.10.0         
 [9] yaml_2.2.1                pillar_1.6.4              lattice_0.20-45           glue_1.5.1               
[13] XVector_0.34.0            colorspace_2.0-2          Matrix_1.3-4              plyr_1.8.6               
[17] XML_3.99-0.8              pkgconfig_2.0.3           zlibbioc_1.40.0           purrr_0.3.4              
[21] scales_1.1.1              stringdist_0.9.8          BiocParallel_1.28.2       tibble_3.1.6             
[25] mgcv_1.8-38               generics_0.1.1            ggplot2_3.3.5             ellipsis_0.3.2           
[29] cachem_1.0.6              formula.tools_1.7.1       magrittr_2.0.1            crayon_1.4.2             
[33] memoise_2.0.1             fansi_0.5.0               operator.tools_1.6.3      nlme_3.1-153             
[37] MASS_7.3-54               vegan_2.5-7               tools_4.1.1               data.table_1.14.2        
[41] BiocIO_1.4.0              lifecycle_1.0.1           stringr_1.4.0             munsell_0.5.0            
[45] cluster_2.1.2             DelayedArray_0.20.0       Biostrings_2.62.0         som_0.3-5.1              
[49] compiler_4.1.1            rlang_0.4.12              grid_4.1.1                RCurl_1.98-1.5           
[53] rstudioapi_0.13           rjson_0.2.20              bitops_1.0-7              restfulr_0.0.13          
[57] gtable_0.3.0              DBI_1.1.1                 reshape2_1.4.4            R6_2.5.1                 
[61] GenomicAlignments_1.30.0  dplyr_1.0.7               rtracklayer_1.54.0        fastmap_1.1.0            
[65] utf8_1.2.2                KernSmooth_2.23-20        permute_0.9-5             stringi_1.7.6            
[69] parallel_4.1.1            Rcpp_1.0.7                vctrs_0.3.8               tidyselect_1.1.1         
[73] sparseMatrixStats_1.6.0  
charles-plessy commented 2 years ago

Version 2.1.2 is from the devel branch; I tested it when I uploaded, but it can break because of changes in the API of other packages. The release version (2.0.2) is stable and intended for broad use. Can you try with that one? If it also fails, can you prepare a reproducible example?

pna059 commented 2 years ago

Sorry, my mistake in the last message. I updated the comment to correct the version I am currently using and added a complete session Info.

pna059 commented 2 years ago

I think there might be a problem with the CAGEexp object

> str(ce)
Formal class 'CAGEexp' [package "CAGEr"] with 5 slots
  ..@ ExperimentList:Formal class 'ExperimentList' [package "MultiAssayExperiment"] with 4 slots
  .. .. ..@ listData       : list()
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ colData       :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : chr [1:6] "D4.A" "D4.B" "D24.A" "D24.B" ...
  .. .. ..@ nrows          : int 6
  .. .. ..@ listData       :List of 3
  .. .. .. ..$ inputFiles    : chr [1:6] "Result/BWA_HISAT2_mapped/IMorexV3//DAG4/D4.A.bam" "Result/BWA_HISAT2_mapped/IMorexV3//DAG4/D4.B.bam" "Result/BWA_HISAT2_mapped/IMorexV3//DAP24/D24.A.bam" "Result/BWA_HISAT2_mapped/IMorexV3//DAP24/D24.B.bam" ...
  .. .. .. ..$ inputFilesType: chr [1:6] "bam" "bam" "bam" "bam" ...
  .. .. .. ..$ sampleLabels  : chr [1:6] "D4.A" "D4.B" "D24.A" "D24.B" ...
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ sampleMap     :Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 0
  .. .. ..@ listData       : Named list()
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ drops         : list()
  ..@ metadata      :List of 1
  .. ..$ genomeName: chr "BSgenome.MorexV3.Gatersleben"

What would be the reproducible example? Our BSgenome and a subset of one of the BAM files?

charles-plessy commented 2 years ago

Yes, the best for me is one or two BAM files subsetted so that they contain only a few megabytes of data, plus your BSgenome package if it is needed to reproduce the bug (that is: if the bug happen only if genomeName is not NULL), plus the R commands to trigger the bug.

By the way, please note that in CAGEr 2.0 the output of a command needs to be explicitly assigned, like in: ce <- getCTSS(ce).

pna059 commented 2 years ago

Thank you for the "BTW"! Not explicitly assigning the output was my fault. I can move forward in the pipeline now :)

charles-plessy commented 2 years ago

Excellent; you can close this issue if the original problem is fixed. (Feel free to open new issues for new problems)