Exception: Cannot read past the end of the alignment's sequence

100kristine commented 10 years ago

Hi,

I've been hitting this error on a couple of bam files that I have tried. I attached a screenshot of the area causing the error in igv viewer. Input and error are below. The file already has read groups added. Thank you for any help that you can provide!

Kristine

freebayes -b ../../vcftesting/bamaddrg_LCC23.bam -v ../../vcftesting/freebayes_with_lastexons.vcf -f ../../hg19param/ucsc.hg19.fasta -t ../../hg19param/last_exons.bed Exception: Cannot read past the end of the alignment's sequence. CTTTAAAGATACAAAAAACACGTGTCTTCTGTGGAGCTCTGAGAACAGGACTCCAGCAAAGCACTTTTCAGCCTTGTGGTCTTCAAGCATTTCCAAGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCT GTGACATTGGAAAGAAAAAAAATTAAGACAGATTTGGGTCTCCATGAGCAGAGAGTATTCTTTCCCTTAGTTCTGGGTAGCATAGTGCCCAAGGAAACCTAATCAGTACAGTGGGAAACCAAATCTATTCCAAATCATTACCCATAAGCAATCCTATGGACTCTCTGAGGTTTGACAGAGTCAGCGGCCCTCTCCAATAAATGTGTTTTTCTATAATAATGTGCAGTGTGAGTAATTTGATGTTTTTGACAAATATTTACCCAGCGTTTTATGAAGGAGAAGGTTTGGGGGGCAAGCTGAAGGAGCTCTGACTGCCTTTTGGAGGAGTGTGAGTGGCTTACCTT chr12:9220338 Abort trap: 6

screen shot 2013-12-04 at 1 24 43 pm

ekg commented 10 years ago

Which version of freebayes are you running? Which aligner did you use?

On Wed, Dec 4, 2013 at 9:27 PM, 100kristine notifications@github.comwrote:

Hi,

I've been hitting this error on a couple of bam files that I have tried. I attached a screenshot of the area causing the error in igv viewer. Input and error are below. The file already has read groups added. Thank you for any help that you can provide!

Kristine

freebayes -b ../../vcftesting/bamaddrg_LCC23.bam -v ../../vcftesting/freebayes_with_lastexons.vcf -f ../../hg19param/ucsc.hg19.fasta -t ../../hg19param/last_exons.bed Exception: Cannot read past the end of the alignment's sequence.

CTTTAAAGATACAAAAAACACGTGTCTTCTGTGGAGCTCTGAGAACAGGACTCCAGCAAAGCACTTTTCAGCCTTGTGGTCTTCAAGCATTTCCAAGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCT

GTGACATTGGAAAGAAAAAAAATTAAGACAGATTTGGGTCTCCATGAGCAGAGAGTATTCTTTCCCTTAGTTCTGGGTAGCATAGTGCCCAAGGAAACCTAATCAGTACAGTGGGAAACCAAATCTATTCCAAATCATTACCCATAAGCAATCCTATGGACTCTCTGAGGTTTGACAGAGTCAGCGGCCCTCTCCAATAAATGTGTTTTTCTATAATAATGTGCAGTGTGAGTAATTTGATGTTTTTGACAAATATTTACCCAGCGTTTTATGAAGGAGAAGGTTTGGGGGGCAAGCTGAAGGAGCTCTGACTGCCTTTTGGAGGAGTGTGAGTGGCTTACCTT chr12:9220338 Abort trap: 6

[image: screen shot 2013-12-04 at 1 24 43 pm]https://f.cloud.github.com/assets/5431793/1677614/8a8a171c-5d2a-11e3-933b-22f957f065fe.png

— Reply to this email directly or view it on GitHubhttps://github.com/ekg/freebayes/issues/58 .

100kristine commented 10 years ago

I'm using version v9.9.2-27-g5d5b8ac. Someone else created the bam that I am looking at, but I believe that they used Tophat.

mjafin commented 10 years ago

I'm getting this error with all of my RNA-seq samples too. I believe FreeBayes chokes on 'N's in the CIGAR string (splice junctions). I wonder how these are interpreted in the code?

ekg commented 10 years ago

I try to handle them pet the spec but I should find some test data to cause the error here to fix the problem. If you can share your failing case please email me! On Mar 10, 2014 6:39 AM, "mjafin" notifications@github.com wrote:

I'm getting this error with all of my RNA-seq samples too. I believe FreeBayes chokes on 'N's in the CIGAR string (splice junctions). I wonder how these are interpreted in the code?

Reply to this email directly or view it on GitHubhttps://github.com/ekg/freebayes/issues/58#issuecomment-37169709 .

mjafin commented 10 years ago

Hi Erik, Many thanks for the prompt reply - I'm just throwing guesses around here obviously! I'll email you an example sam file with syntax to how I was attempting to use FreeBayes..

mjafin commented 10 years ago

@freemao FreeBayes doesn't dig splice junctions. You could split the reads with splice junctions (N in CIGAR) using either Gatk 3.0 or, if you're adventurous, try my python alternative: https://github.com/mjafin/splitNreads

ekg commented 10 years ago

Hi Miika,

Would you elaborate a bit. If possible I'd like to resolve this within freebayes itself.

Erik

On Wed, Jun 11, 2014 at 1:47 PM, Miika Ahdesmaki notifications@github.com wrote:

@freemao https://github.com/freemao FreeBayes doesn't dig splice junctions. You could split the reads with splice junctionsc (N in CIGAR) using either Gatk 3.0 or, if you're adventurous, try my python alternative: https://github.com/mjafin/splitNreads

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/58#issuecomment-45775411.

mjafin commented 10 years ago

@ekg Hi Erik, As mentioned above, I sent you an example .sam file in email on March 10th that you can use to reproduce this error. It seems to be a more complex CIGAR string actually than just a splice junction in 'v0.9.14-15-gc6f49c0' that causes the error. Here's one: 16M1D6M115N28M

Let me know if I can be of further help.

Cheers, Miika

mjafin commented 9 years ago

@ekg I wonder if you have any updates on this by any chance?

holtgrewe commented 9 years ago

:+1: I'm getting this problem as well.

inti commented 9 years ago

I am getting this as well with spliced RNA-seq alignments

visze commented 9 years ago

+1

ekg commented 9 years ago

Would one of you please send me a test case? Ideally something small I could add to the tests. Sorry for the trouble but this is the easiest way to resolve things! On Mar 27, 2015 1:45 PM, "Max" notifications@github.com wrote:

+1

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/58#issuecomment-86944040.

mjafin commented 9 years ago

@ekg I sent you an example file about a year ago (see above). If it went missing can you please contact me and I'll resend it?

ekg commented 9 years ago

Ah! please resend! On Mar 27, 2015 2:19 PM, "Miika Ahdesmaki" notifications@github.com wrote:

@ekg https://github.com/ekg I sent you an example file about a year ago (see above). If it went missing can you please contact me and I'll resend it?

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/58#issuecomment-86954182.

mjafin commented 9 years ago

@ekg I've just emailed you a 5MB bam file to your bc.edu email address (from my AZ email). Can you please let me know if you're not in receipt of the email in the next few hours. I couldn't reproduce the same exact error with the latest freebayes I'm using but it's probably related (different error output slightly).

inti commented 9 years ago

Digging a bit more. On my case it was related to spliced alignments produced by HISAT. I do not think the software is an issue but the fact that the cigar string had Ns. I had to use the GATK tool to split reads with Ns on the cigar string and it is now working fine.

I reckon I had this before when working with very small contigs ….

regards, Inti

On Mar 27, 2015, at 11:15, Erik Garrison notifications@github.com wrote:

Would one of you please send me a test case? Ideally something small I could add to the tests. Sorry for the trouble but this is the easiest way to resolve things! On Mar 27, 2015 1:45 PM, "Max" notifications@github.com wrote:

+1

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/58#issuecomment-86944040.

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/58#issuecomment-86953160.

inti commented 9 years ago

Is this solved? The GATK tool SplitNCigarReads is slow and duplicating a bam file for large projects is not very fun on top of the time to write the file.

On my case I am calling twice from the bam file, so streaming does not help a great deal since I would need to reformat twice the original bam file to call with freebayes.

:(

inti commented 9 years ago

Hi, This still bugs me ... sorry

Would it be safe to replace the Ns for a deletion, that is code introns as deletions? for example 24S8M8401N18M into 24S8M8401D18M and then filter out the deletions matching exactly the intro coordinates? when working with RNA-seq one could restrict calling to intervals with coverage > 1 as calculated with the bedtools -bg -split command and potentially avoid the calls of introns as deletions.

Would FreeBayes behave as expected after recording the cigar string like this? for example for variants calling and haplotyping exonic variants with RNA-seq?

@mjafin @visze have you solve this?

mjafin commented 9 years ago

@inti I've personally given up on anything and everything except VarDict, which happily processes any RNA-seq data I throw at it.

inti commented 9 years ago

The suggestion above actually seems to work fine, for example using

sambamba view -L covereged_regions.bed -h sorted.bam | awk -v OFS='\t' '{{ gsub(/N/,"D", $6); print }}' | sambamba view --format bam -S /dev/stdin | freebayes -f ref_genome.fa --stdin

which will produce variant calls and indels for the introns by knowing which region where covered by the reads, as with bedtools -bg -split, that it should be possible to get rid of the intros=indels quite easily.

I do not see why this would affect the variant calling on bases just next to the exon-intro boundary more than the GATK tool, but I would be great if @ekg could confirm if he thinks the same.

Thanks a lot for pointing me to VarDict, i will have a look right now.

EDIT: it seems VarDict does not do multisample variant calling. I tend to work with very polymorphic genomes and with little previous genetic variation resources, so I feel I need multiple sample calling to improve sensitivity and chances to detect variation ...

travc commented 9 years ago

Yep, freebayes is choking on my tophat bams as well... RNA-seq friendly freebayes would be a great thing.

If you only need SNPs...

Hacking the bam to change 'N' to 'D' in the CIGAR strings appears to work for me too.

using @inti suggested sambamba+awk line took 198 sec (without the -L covered_regions.bed part, since I'll deal with that sort of filtering at the vcf level): ./sambamba view -h F1.bam | awk -v OFS='\t' '{{ gsub(/N/,"D", $6); print }}' | ./sambamba view --format bam -S /dev/stdin > F1_hack1.bam
I also tried using samtools (v1.0.1) + awk, which took 773 sec (sambamba really is faster): ./samtools view -h F1.bam | awk -v OFS='\t' '{{ gsub(/N/,"D", $6); print }}' | ./samtools view -hb - > F1_hack2.bam
splitNreads from @mjafin is rather slow (still running), but I'm guessing it is a bit more robust. calmd appears to be the bottleneck. ./samtools view -h F1.bam | ./splitNreads.py | sed 's/RG:A:/RG:Z:/g' | ./samtools calmd -Su - ref.fa 2> /dev/null | ./samtools sort -T _sorted_ -o F1_splitNreads.bam (samtools calmd produces a lot of warnings, so redirecting them to /dev/null)

This is definitely a hackish workaround since splices will be reported as deletions, but that is OK is some cases (eg: only care about snps). After filtering with vcfallelicprimitives, vt normalize -r <ref>, vcfsnps it looks fine as far as I can tell.

Note: I've only tried this on a single sample so far, but I'm going to do multisample and will report back if I see any problems.

UPDATE: I've hit only one minor snag. Out of my entire dataset, freebayes generated one variant which causes vcfallelicprimitives problems (it hangs). A 19kb long intron with a variant right on the edge... so pretty understandable (and not difficult to work around). https://github.com/ekg/vcflib/issues/94

inti commented 9 years ago

I have been working with VarDict and it seems to do a good job so far. I have been planning a better workaround. What I have in mind is to write a script that changes the N for D on the cigar string, it stores the information on the genomic position of the Ns and then uses this position to filter out the introns called and deletions of the final VCF. I think this would make the approach cleaner in the sense that I reduce the chances of confounding downstream a true deletion with intron marked as a deletion.

I am not sure when I will get time for this and hopefully soon ... any comments on the idea are welcomed.

travc commented 9 years ago

@inti ... The difficulty I see is dealing with when freebayes combines "CIGAR N to D recoded" introns into adjacent variations. It might be easier to just fixup freebayes itself.

Since freebayes doesn't appear to have any problem handling introns as deletions, maybe it can be made to just treat Ns as Ds except with respect to merging them with other variations. Introns would then come out as deletions where the CIGAR is all Ns (or N#). That is trivial to filter and makes conceptual sense to me. I've never looked at the freebayes code myself though... so I have no idea how hard it would be to implement.

PS: VarDict looks like it is pretty use-case specific. It doesn't appear to be appropriate for the sort of multisample population genomics I work on.

inti commented 9 years ago

Yes, a suitable change on the freebayes code would be way better. However, it may be easier to work on filtering out the freebayes output and potentially faster to develop a good practice than tinkering with the freebayes code. I have not looked at it myself and I am not good at C so steep uphills to me to contribute on that.

Really, my idea was to keep track of the deletions in the output of the VCF that are supported by Ns on the original CIGAR strings. It is not much of an insight, I recognise.

On Jun 1, 2015, at 15:58, Travis Collier notifications@github.com wrote:

@inti https://github.com/inti ... The difficulty I see is dealing with when freebayes combines "CIGAR N to D recoded" introns into adjacent variations. It might be easier to just fixup freebayes itself.

Since freebayes doesn't appear to have any problem handling introns as deletions, maybe it can be made to just treat Ns as Ds except with respect to merging them with other variations. Introns would then come out as deletions where the CIGAR is all Ns (or N#). That is trivial to filter and makes conceptual sense to me. I've never looked at the freebayes code myself though... so I have no idea how hard it would be to implement.

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/58#issuecomment-107671406.

ekg commented 9 years ago

I think this would be a very easy change to freebayes. There is a single loop in the alleleparser that handles the cigar. Apparently splices aren't handled correctly... But my problem is that I've not ever worked with RNAseq data. So I have no way to test. On Jun 1, 2015 9:28 PM, "Inti Pedroso" notifications@github.com wrote:

Yes, a suitable change on the freebayes code would be way better. However, it may be easier to work on filtering out the freebayes output and potentially faster to develop a good practice than tinkering with the freebayes code. I have not looked at it myself and I am not good at C so steep uphills to me to contribute on that.

Really, my idea was to keep track of the deletions in the output of the VCF that are supported by Ns on the original CIGAR strings. It is not much of an insight, I recognise.

On Jun 1, 2015, at 15:58, Travis Collier notifications@github.com wrote:

@inti https://github.com/inti ... The difficulty I see is dealing with when freebayes combines "CIGAR N to D recoded" introns into adjacent variations. It might be easier to just fixup freebayes itself.

Since freebayes doesn't appear to have any problem handling introns as deletions, maybe it can be made to just treat Ns as Ds except with respect to merging them with other variations. Introns would then come out as deletions where the CIGAR is all Ns (or N#). That is trivial to filter and makes conceptual sense to me. I've never looked at the freebayes code myself though... so I have no idea how hard it would be to implement.

— Reply to this email directly or view it on GitHub < https://github.com/ekg/freebayes/issues/58#issuecomment-107671406>.

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/58#issuecomment-107678471.

travc commented 9 years ago

@ekg I'm no RNAseq expert either, but do have some data. Here is a little (7Kbp) example: https://popi.ucdavis.edu/~travc/tmp/RNAseq_bams_for_freebayes/ There is a bam (extracted from a tophat mapping) and the relevant part of the ref. I also included a gtf for that little region.

ETA: I also included a vcf of that region produced by running freebayes on that sample (+14 others) with the "N to D" hack. The introns are pretty obvious.

peterjc commented 8 years ago

Could this issue be retitled to something like "SAM/BAM CIGAR N operator for skipping over introns not supported"?

agata-sm commented 8 years ago

Hi, I am aware that this is a couple of months old thread, have you found any solution? I have just seen this warning, followed by a core dump when analysing RNA-seq data. I am using version v0.9.21-26-gbfd9832-dirty

ekg commented 8 years ago

Would you confirm this to be an issue on the current build? On Feb 2, 2016 8:54 AM, "agata-sm" notifications@github.com wrote:

Hi, I am aware that this is a couple of months old thread, have you found any solution? I have just seen this warning, followed by a core dump when analysing RNA-seq data. I am using version v0.9.21-26-gbfd9832-dirty

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/58#issuecomment-178454150.

agata-sm commented 8 years ago

Hey, thanks for the quick reply! I have no idea, haven't used any older version. I can test 0.8.9, 0.9.21, 0.9.4, 0.9.8 (these are installed on the cluster I am using). Depending on the job overload, I should be ready within a few hours.

ekg commented 8 years ago

I think the problem you are encountering is fixed in a very recent version. Would you please ask for a recent version to be installed and let me know if it is resolved? On Feb 2, 2016 9:00 AM, "agata-sm" notifications@github.com wrote:

Hey, thanks for the quick reply! I have no idea, haven't used any older version. I can test 0.8.9, 0.9.21, 0.9.4, 0.9.8 (these are installed on the cluster I am using). Depending on the job overload, I should be ready within a few hours.

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/58#issuecomment-178455799.

agata-sm commented 8 years ago

Thanks! I've requested installation of the newest version, it may take some time though. I will post the result of my test here.

colindaven commented 8 years ago

The current version v1.0.2-15-g357f175 seems to be working just fine with RNA-seq BAMs created with the aligner STAR.

Thanks to all for reporting, discussing and Erik for fixing this.

winni2k commented 3 years ago

It looks like this issue can be closed?

freebayes / freebayes

Exception: Cannot read past the end of the alignment's sequence #58