Aligner parameter doesn't stay assigned when calling qAlign on aligned bam files

GuidoBarzaghi commented 3 years ago

I'm not sure whether this is a desired behaviour or not, but I've noticed that calling qAlign as follows on aligned data results on a QuasRprj object which doesn't retain the aligner information

prj <- qAlign(sampleFile=QuasR_pointer, genome="BSgenome.Mmusculus.UCSC.mm10", aligner = "Rbowtie", paired="fr", bisulfite="undir", projectName="prj", checkOnly=F)

Later reassignment as prj@aligner = "Rbowtie" readily solves the issue anyway.

mbstadler commented 3 years ago

@GuidoBarzaghi I have tried to repruduce this, but could not:

library(QuasR)

td <- tempdir()
file.copy(system.file("extdata", package = "QuasR"), td, recursive=TRUE)

samplefile <- file.path(td, "extdata", "samples_chip_single.txt")
genomefile <- file.path(td, "extdata", "hg19sub.fa")

p1 <- qAlign(samplefile, genomefile)
p2 <- qAlign(samplefile, genomefile, bisulfite = "undir")

samplefile2 <- file.path(td, "extdata", "samples_rna_paired.txt")

p3 <- qAlign(samplefile2, genomefile, bisulfite = "undir")
p4 <- qAlign(samplefile2, genomefile, bisulfite = "undir", paired = "fr")

p1@aligner
p2@aligner
p3@aligner
p4@aligner

All four projects have the aligner slot set correctly. Could you provide more details what you mean by "doesn't retain the aligner information", ideally with a reproducible example code and session information?

My tests were done using the most recent release of QuasR:

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /tungstenfs/groups/gbioinfo/Appz/easybuild/software/OpenBLAS/0.3.7-GCC-8.3.0/lib/libopenblas_skylakex-r0.3.7.so

Random number generation:
 RNG:     Mersenne-Twister
 Normal:  Inversion
 Sample:  Rounding

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] QuasR_1.30.0         Rbowtie_1.30.0       GenomicRanges_1.42.0 GenomeInfoDb_1.26.2  IRanges_2.24.1       S4Vectors_0.28.1
[7] BiocGenerics_0.36.0  RColorBrewer_1.1-2

loaded via a namespace (and not attached):
 [1] MatrixGenerics_1.2.0        Biobase_2.50.0              httr_1.4.2                  bit64_4.0.5
 [5] GenomicFiles_1.26.0         assertthat_0.2.1            askpass_1.1                 BiocManager_1.30.10
 [9] BiocFileCache_1.14.0        latticeExtra_0.6-29         blob_1.2.1                  BSgenome_1.58.0
[13] GenomeInfoDbData_1.2.4      Rsamtools_2.6.0             progress_1.2.2              pillar_1.4.7
[17] RSQLite_2.2.3               lattice_0.20-41             glue_1.4.2                  XVector_0.30.0
[21] Matrix_1.2-18               XML_3.99-0.5                pkgconfig_2.0.3             ShortRead_1.48.0
[25] biomaRt_2.46.1              zlibbioc_1.36.0             purrr_0.3.4                 jpeg_0.1-8.1
[29] BiocParallel_1.24.1         tibble_3.0.5                openssl_1.4.3               generics_0.1.0
[33] ellipsis_0.3.1              cachem_1.0.1                SummarizedExperiment_1.20.0 GenomicFeatures_1.42.1
[37] fortunes_1.5-4              magrittr_2.0.1              crayon_1.3.4                memoise_2.0.0
[41] xml2_1.3.2                  hwriter_1.3.2               tools_4.0.3                 prettyunits_1.1.1
[45] hms_1.0.0                   lifecycle_0.2.0             matrixStats_0.57.0          stringr_1.4.0
[49] DelayedArray_0.16.1         AnnotationDbi_1.52.0        Biostrings_2.58.0           compiler_4.0.3
[53] rlang_0.4.10                grid_4.0.3                  RCurl_1.98-1.2              VariantAnnotation_1.36.0
[57] rappdirs_0.3.1              bitops_1.0-6                DBI_1.1.1                   curl_4.3
[61] R6_2.5.0                    GenomicAlignments_1.26.0    dplyr_1.0.3                 rtracklayer_1.50.0
[65] fastmap_1.1.0               bit_4.0.4                   stringi_1.5.3               Rcpp_1.0.6
[69] vctrs_0.3.6                 png_0.1-7                   dbplyr_2.0.0                tidyselect_1.1.0

GuidoBarzaghi commented 3 years ago

Thanks a lot for your exceptionally quick reply. I did try to update QuasR to its latest version but that did not solve it.

This is how I align the data

prj <- qAlign(sampleFile=INPUT_FILE_NAME,
              genome="BSgenome.Mmusculus.UCSC.mm10",
              aligner = "Rbowtie",
              paired="fr",
              bisulfite="undir",
              projectName="prj",
              alignmentParameter = "-e 70 -X 1000 -k 2 --best -strata",
              alignmentsDir="./",
              snpFile = SNPs,
              clObj = cl)

And this is how I define a prj for downstream analysis

prj <- qAlign(sampleFile=INPUT_FILE_NAME,
              genome="BSgenome.Mmusculus.UCSC.mm10",
              aligner = "Rbowtie",
              paired="fr",
              bisulfite="undir",
              projectName="prj",
              checkOnly=F)

Left like this prj@aligner equals to NA

sessionInfo()
R Under development (unstable) (2021-01-17 r79842)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /g/funcgen/gbcs/public/software/rstudio-server/R-devel-2021-01-19/lib64/R/lib/libRblas.so
LAPACK: /g/funcgen/gbcs/public/software/rstudio-server/R-devel-2021-01-19/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Mmusculus.UCSC.mm10_1.4.0 BSgenome_1.59.2                    rtracklayer_1.51.4                 Biostrings_2.59.2                  XVector_0.31.1                     QuasR_1.31.0                       Rbowtie_1.31.6                    
 [8] GenomicRanges_1.43.3               GenomeInfoDb_1.27.3                IRanges_2.25.6                     S4Vectors_0.29.6                   BiocGenerics_0.37.0               

loaded via a namespace (and not attached):
 [1] fs_1.5.0                    bitops_1.0-6                matrixStats_0.57.0          usethis_2.0.0               devtools_2.3.2              bit64_4.0.5                 filelock_1.0.2              RColorBrewer_1.1-2         
 [9] progress_1.2.2              httr_1.4.2                  rprojroot_2.0.2             GenomicFiles_1.27.0         tools_4.1.0                 R6_2.5.0                    DBI_1.1.1                   withr_2.3.0                
[17] processx_3.4.5              tidyselect_1.1.0            prettyunits_1.1.1           bit_4.0.4                   curl_4.3                    compiler_4.1.0              cli_2.2.0                   Biobase_2.51.0             
[25] xml2_1.3.2                  desc_1.2.0                  DelayedArray_0.17.7         callr_3.5.1                 askpass_1.1                 rappdirs_0.3.1              stringr_1.4.0               digest_0.6.27              
[33] Rsamtools_2.7.0             rmarkdown_2.6               jpeg_0.1-8.1                pkgconfig_2.0.3             htmltools_0.5.0             sessioninfo_1.1.1           MatrixGenerics_1.3.0        dbplyr_2.0.0               
[41] rlang_0.4.10                rstudioapi_0.13             RSQLite_2.2.2               BiocIO_1.1.2                generics_0.1.0              hwriter_1.3.2               BiocParallel_1.25.2         dplyr_1.0.3                
[49] VariantAnnotation_1.37.1    RCurl_1.98-1.2              magrittr_2.0.1              GenomeInfoDbData_1.2.4      Matrix_1.3-2                Rcpp_1.0.5                  fansi_0.4.2                 lifecycle_0.2.0            
[57] stringi_1.5.3               yaml_2.2.1                  SummarizedExperiment_1.21.1 zlibbioc_1.37.0             pkgbuild_1.2.0              BiocFileCache_1.15.1        grid_4.1.0                  blob_1.2.1                 
[65] crayon_1.3.4                lattice_0.20-41             GenomicFeatures_1.43.3      hms_1.0.0                   ps_1.5.0                    knitr_1.30                  pillar_1.4.7                rjson_0.2.20               
[73] pkgload_1.1.0               biomaRt_2.47.1              XML_3.99-0.5                glue_1.4.2                  evaluate_0.14               ShortRead_1.49.1            latticeExtra_0.6-29         remotes_2.2.0              
[81] BiocManager_1.30.10         vctrs_0.3.6                 png_0.1-7                   testthat_3.0.1              openssl_1.4.3               purrr_0.3.4                 assertthat_0.2.1            xfun_0.20                  
[89] restfulr_0.0.13             tibble_3.0.5                GenomicAlignments_1.27.2    AnnotationDbi_1.53.0        memoise_1.1.0               ellipsis_0.3.1

mbstadler commented 3 years ago

Ok, I think I understand now. I think this is simply because you use two different ways to define prf. The idea is that you use the exact same qAlign-call also for downstream analysis. QuasR should identify pre-existing bam files and would not re-create them.

Could you check if prj@aligner is set correctly in the first case?

GuidoBarzaghi commented 3 years ago

Ah I see. You are exactly right, that worked. Thanks a lot for your support :) In that case though, when coming to QuasR directly for downstream analyses without knowing what exact call was used for alignments, the aligner must still be specified separately

mbstadler commented 3 years ago

I cannot run your code examples, for example I don't have the file INPUT_FILE_NAME. Could you provide a reproducible example?

My guess is that maybe the samples file contains bam files, and thus the aligner slot is not set.

GuidoBarzaghi commented 3 years ago

Exactly, sorry for not being clear.

When starting a QuasR project with aligned bam files for which I do not have the original fastqs, the aligner slot must be filled separately (that's all I wanted to understand).

Is allowing to define the aligner directly in the qAlign call in this case as well something you might contemplate?

mbstadler commented 3 years ago

Actually, what you experience is expected behaviour: For a "bamfile-project", QuasR has no way to validate the aligner, and therefore chooses not to set that field. The field is only set if QuasR can guarantee its correctness.

I think setting it from the qAlign(..., aligner) argument would often do the wrong thing - aligner defaults to "Rbowtie" and would thus set it to that value even if the bam files were generated using another aligner.

You can of course set the @aligner slot as you have done above, or preferentially, create your project using the original qAlign call.

GuidoBarzaghi commented 3 years ago

Thank you very much for the clarification

fmicompbio / QuasR

Aligner parameter doesn't stay assigned when calling qAlign on aligned bam files #22