Input genome file not readable => bambu does not support gzip'd input

nick-youngblut commented 1 month ago

My code:

bamba_ret = bambu(
    reads = bam_file, 
    annotations = gtf_file, 
    genome = fna_file, 
    quant = FALSE
)

The error:

Error in value[[3L]](cond): Input genome file not readable.Requires a FASTA or BSgenome name
Traceback:

1. bambu(reads = bam_file, annotations = gtf_file, genome = fna_file, 
 .     quant = FALSE)
2. bambu.processReads(reads, annotations, genomeSequence = genome, 
 .     readClass.outputDir = rcOutDir, yieldSize, bpParameters, 
 .     stranded, verbose, isoreParameters, trackReads = trackReads, 
 .     fusionMode = fusionMode, lowMemory = lowMemory)
3. checkInputSequence(genomeSequence)
4. tryCatch({
 .     if (.Platform$OS.type == "windows") {
 .         genomeSequence <- Biostrings::readDNAStringSet(genomeSequence)
 .         newlevels <- unlist(lapply(strsplit(names(genomeSequence), 
 .             " "), "[[", 1))
 .         names(genomeSequence) <- newlevels
 .     }
 .     else {
 .         indexFileExists <- file.exists(paste0(genomeSequence, 
 .             ".fai"))
 .         if (!indexFileExists) 
 .             indexFa(genomeSequence)
 .         genomeSequence <- FaFile(genomeSequence)
 .     }
 . }, error = function(cond) {
 .     stop("Input genome file not readable.", "Requires a FASTA or BSgenome name")
 . })
5. tryCatchList(expr, classes, parentenv, handlers)
6. tryCatchOne(expr, names, parentenv, handlers[[1L]])
7. value[[3L]](cond)
8. stop("Input genome file not readable.", "Requires a FASTA or BSgenome name")

If I uncompress the genome fasta file, there is no error. It would be helpful if bambu supported gzip'd input, given the potentially large size of the input files.

Also, the space (or line return) is missing in:

Error in value[[3L]](cond): Input genome file not readable.Requires a FASTA or BSgenome name

sessionInfo

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS/LAPACK: /home/nickyoungblut/miniforge3/envs/ont_10x/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] bambu_3.4.0                 BSgenome_1.70.1
 [3] rtracklayer_1.62.0          BiocIO_1.12.0
 [5] Biostrings_2.70.1           XVector_0.42.0
 [7] SummarizedExperiment_1.32.0 Biobase_2.62.0
 [9] GenomicRanges_1.54.1        GenomeInfoDb_1.38.1
[11] IRanges_2.36.0              S4Vectors_0.40.2
[13] BiocGenerics_0.48.1         MatrixGenerics_1.14.0
[15] matrixStats_1.3.0

loaded via a namespace (and not attached):
 [1] KEGGREST_1.42.0          rjson_0.2.21             lattice_0.22-6
 [4] vctrs_0.6.5              tools_4.3.3              bitops_1.0-7
 [7] generics_0.1.3           curl_5.1.0               parallel_4.3.3
[10] tibble_3.2.1             fansi_1.0.6              AnnotationDbi_1.64.1
[13] RSQLite_2.3.7            blob_1.2.4               pkgconfig_2.0.3
[16] Matrix_1.6-5             data.table_1.15.2        dbplyr_2.5.0
[19] lifecycle_1.0.4          GenomeInfoDbData_1.2.11  compiler_4.3.3
[22] stringr_1.5.1            Rsamtools_2.18.0         progress_1.2.3
[25] codetools_0.2-20         RCurl_1.98-1.14          yaml_2.3.8
[28] tidyr_1.3.1              pillar_1.9.0             crayon_1.5.2
[31] BiocParallel_1.36.0      DelayedArray_0.28.0      cachem_1.0.8
[34] abind_1.4-5              tidyselect_1.2.1         digest_0.6.35
[37] stringi_1.8.4            purrr_1.0.2              dplyr_1.1.4
[40] restfulr_0.0.15          biomaRt_2.58.0           fastmap_1.1.1
[43] grid_4.3.3               cli_3.6.2                SparseArray_1.2.2
[46] magrittr_2.0.3           S4Arrays_1.2.0           GenomicFeatures_1.54.1
[49] utf8_1.2.4               XML_3.99-0.16.1          rappdirs_0.3.3
[52] filelock_1.0.3           prettyunits_1.2.0        xgboost_2.1.1.1
[55] bit64_4.0.5              httr_1.4.7               bit_4.0.5
[58] png_0.1-8                hms_1.1.3                memoise_2.0.1
[61] BiocFileCache_2.10.1     rlang_1.1.3              Rcpp_1.0.12
[64] glue_1.7.0               DBI_1.2.3                xml2_1.3.6
[67] jsonlite_1.8.8           R6_2.5.1                 GenomicAlignments_1.38.0
[70] zlibbioc_1.48.0

nick-youngblut commented 1 month ago

I minor typo in the README: annotations <- prepareAnnotation(gtf.file) should be annotations <- prepareAnnotations(gtf.file)

andredsim commented 1 month ago

Hi,

Thank you for reporting the typos in the documentation and error messages. I will have that fixed when we do our next update.

You should be able to provide fa.gz files for the genome, but on non windows machines you need to have the index and compressed index .fai and .gzi. Unfortuantely this is not yet written in the documentation but I will add it. Could you let me know if you had these files and if not try again and let me know if that works?

Kind Regards, Andre Sim

nick-youngblut commented 1 month ago

You should be able to provide fa.gz files for the genome

As you see from my post above, I can't use gzip-compressed fasta input on my Ubuntu 22.04.4 system.

You specifically stated "fa.gz files". Bambu doesn't support alternative (gzip'd) fasta file extensions (e.g., .fastq.gz or .fna.gz)?

andredsim commented 1 month ago

Bambu doesn't check the file extension, and as for our purposes .fa.gz, .fastq.gz and .fna.gz are all the same format they should all work so long as include in the same directory, the respective index files. So if you are using .fna.gz there should also be a .fna.gz.fai and .fna.gz.gzi in the directory. If you compressed your genome with bgzip you can generate the index fails with samtools faidx.

Below is the script I used to test it, the console output (warnings removed for clarity), and the directory where the .fna.gz is stored so you can compare

sample <- system.file("extdata", "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam", package = "bambu")
fa.file <- "./Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fna.gz"
annotations <- readRDS(system.file("extdata", "annotationGranges_txdbGrch38_91_chr9_1_1000000.rds", package = "bambu"))
se = bambu(reads = sample, annotations = annotations, genome = fa.file)

--- Start generating read class files ---
'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories",
package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cloud.r-project.org
Detected 3 warnings across the samples during read class construction. Access warnings with metadata(bambuOutput)$warnings
--- Start extending annotations ---
WARNING - Less than 50 TRUE or FALSE read classes for NDR precision stabilization.
NDR will be approximated as: (1 - Transcript Model Prediction Score)
Using a novel discovery rate (NDR) of: 0
WARNING - No novel transcripts meet the given thresholds. Try a higher NDR.
--- Start isoform quantification ---
--- Finished running Bambu ---

> list.files()
[1] "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fna.gz"    
[2] "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fna.gz.fai"
[3] "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fna.gz.gzi"

nick-youngblut commented 1 month ago

Thanks for the explanation. Maybe it would best to add some input file checks to provide a more informative error message than Input genome file not readable.Requires a FASTA or BSgenome name? For instance: Your input genome appears to be compressed; you then must provide corresponding .gz.fai and .gz.gzi files

GoekeLab / bambu

Input genome file not readable => bambu does not support gzip'd input #446

sessionInfo