Error between bam and genome before extending annotations in precise locations

BerrocalRubio333 commented 7 months ago

Hi,

I am trying to run bambu using (genome that was used to map the bams). I get this error:

[E::faidx_fetch_seq2] Failed to retrieve block. (Seeking in a compressed, .gzi unindexed, file?) Error: BiocParallel errors 1 remote errors, element index: 1 0 unevaluated and other errors first remote error: Error in value[3L]: record 153004 (chr13:40333583-40333584) failed

Things I have checked: Running bambu with annotations = NULL gives me the same error The chr sizes are the same in the bams and the genome .fa The error location changes from bam to bam (there needs to be reads in the gene for there to be a problem). It is only in some chromosomes (chromosomes 13 or 14 give me similar errors, while other like chr 1 or 8 don't).

I was wondering if you could provide some indications of what the problem might be given the error and suggestions on how to fix it.

Best.

Miguel

andredsim commented 7 months ago

Hi Miguel,

I havn't encountered this error before so unfortunately I cannot give you a direct solution, but I am happy to try and troubleshoot it with you. Would you be able to post the script you used to run Bambu as well as the full console output when running it with verbose = TRUE. Are you running each chromosome as a different bam? How are the bams different from each other, are they replicates or something different? Judging from the error I believe it is occurring during the bam to granges conversation. Could you please also try grlist = bambu:::prepareDataFromBam("path/to/bam/file", verbose = TRUE) and see if you get the same error message?

Kind Regards, Andre Sim

BerrocalRubio333 commented 7 months ago

Hi Andre,

Thank you for your very fast and kind response.

I have run the original bam, which gave me the error. What I have done them is filter certain chromosomes from the bam using samtools view. In doing that, the chromosome that gave errors in the unfiltered bam still gives the exact same error, but other chromosomes don't and run perfectly. It seems like difference samples (cell types) struggle in similar chromosomes, and if a chromosome works, works for all ( I haven't tried every possible combination).

I have run the command you sent, and it run without mistakes, generating a Large CompressedGRangeList (4691147 elements, 233.2 MB).

Best.

Miguel

andredsim commented 7 months ago

Hi Miguel,

Interesting that that function ran smoothly, that suggests the error is elsewhere. Did you use the original bam while running that function? Given it isn't occurring there, I will really need the verbose console output to get a better idea of where the error is happening in the code. Let me know once you have that and I can continue trying to troubleshoot.

Kind Regards, Andre Sim

BerrocalRubio333 commented 7 months ago

Hi Andre,

Yes I am running the bam with all the chromosomes.

This was the mapping:

minimap2 -t 10 -2 -ax splice -uf -k 15 --MD --junc-bed ~/tools/minimap2-2.26_x64-linux/reference/human/gencode.v39.annotation.bed --secondary=no ~/tools/minimap2-2.26_x64-linux/reference/human/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta S01_R1.fq | samtools view -Sb | samtools sort - -o S01_R1.bam

And this is my console right now running with verbose: se_prog1_filt <- bambu(reads = bam_filtered, annotations = bambuAnnotations, genome = fa.file, verbose = TRUE) --- Start generating read class files --- 'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories", package = "BiocManager")' for details. Replacement repositories: CRAN: https://cran.rstudio.com/ S1_Q10_MAPQ60_filtered [E::faidx_fetch_seq2] Failed to retrieve block. (Seeking in a compressed, .gzi unindexed, file?) Error: BiocParallel errors 1 remote errors, element index: 1 0 unevaluated and other errors first remote error: Error in value[3L]: record 138462 (chr13:40559035-40559036) failed file: C:/Users/MBERROCAL/OneDrive - The University of Melbourne/Desktop/Paper NRG1/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa In addition: Warning messages: 1: In bambu.processReadsByFile(bam.file = reads[bamFileName], genomeSequence = genomeSequence, : not all chromosomes present in reference annotations, annotations might be incomplete. Please compare objects on the same reference 2: In bambu.processReadsByFile(bam.file = reads[bamFileName], genomeSequence = genomeSequence, : 26 reads are mapped outside the provided genomic regions. These reads will be dropped. Check you are using the same genome used for the alignment

Thanks for the help

Best.

Miguel

andredsim commented 7 months ago

Hi Miguel,

Thanks for that. The error seems to be happening during the junction identification. I am currently guessing there must be an alignment that breaks an assumption we have made, specifically when it goes to get the bases of the junction.

I propose the following. Depending on the privacy of your data, are you able to send me one of the bam files from the chromosome where the error arises (such as chromosome 13?) and I can try replicate it on my end and hopefully debug it.

If you are unable to send me the data, or would like to try troubleshoot it simultaneously, would you be able to view this region in a genome browser along with the reads (chr13:40559035-40559036) and see if there are any strange alignments present. Also have you tried running minimap without providing the bed file. Depending on how long the mapping takes, it may be worth trying it without it and seeing if the error goes away.

Let me know what is possible on your end, Kind Regards, Andre Sim

BerrocalRubio333 commented 7 months ago

Hi Andre,

You are correct, the location is always in a junction, normally from what I have seen in the most 3' exon of the read (not sure if that info helps at all).

I have tried mapping without the bed file and I got the same error. However, if I remove the reads that map to the gene that gives the error (it was 2 genes for me in chr 13) the command runs. It seems to be oddly specific to certain genes/reads. Any ideas?

Best.

Miguel

andredsim commented 7 months ago

Hi Miguel,

This is very strange indeed. Were you able to send the bam file at all? Given the complexity of this issue it will probably be the quickest way for me to solve it. I think just the reads that map to those 2 genes should be sufficient to replicate the bug. If you need to remove sequence information for privacy reasons, you can convert it into a granges object first using prepareDataFromBam() on that subsetted bam and saving it as an .rds before sending it, which will remove any sequence information and just keep the alignment positions. However the bam would be preferred if there is no privacy issues. You can contact me directly at andre_sim[at]gis.a-star.edu.sg.

Ultimately I want to see the read structure of the read triggering this issue, are there many reads which have either a junction start or end next to chr13:40559035-40559036?

Kind Regards, Andre Sim

BerrocalRubio333 commented 7 months ago

Hi Andre, happy to send you a bam with one of the chromosomes that give me trouble so we can try and troubleshoot it there. Would you let me know what is the best way to transfer the file? My email is : m.berrocal@unimelb.edu.au

I have tried mapping the bam with a different genome version, and that changes the location of the error (kind of tells me the error comes from the interaction of how a read was mapped in the bam file with the reference genome given). If I use the same bam but provide a different genome to the one used to map, somehow it can bypass the error but finds a different error in a subsequent record and different location.

I am very intrigued, it is very odd. Look forward to hearing from you.

Best.

Miguel

BerrocalRubio333 commented 7 months ago

Hi Andre,

it seems like the error was coming from a problem with the versions of co-dependent packages. I solved the problem by creating a new environment from scratch, and it seems to be running ok now. Thank you for all the help!

Best.

Miguel

GoekeLab / bambu

Error between bam and genome before extending annotations in precise locations #423