Open AlexanderDilthey opened 7 years ago
Thanks for submitting this. I've been able to replicate the issue with the sample code you have provided, and will investigate further.
Hi @rishidev, any news on that bug by any chance? Solving it would really help dealing with long, e.g. Nanopore, reads.
Hi @AlexanderDilthey, apologies, I have moved on to a different project now, but this is the last thing I have to fix. The SAM and CRAM behaviour are quite different in the code. I think I've identified where the issue is occurring, but need further clarification to see what the fix is supposed to be.
This seems to be around the CIGAR value. The SAM/BAM contain the sequence themselves so don't have the issue. The CRAM reads reference a CIGAR string with a P in it which is 'a silent deletion from the padded reference'. The padded_alignment call looks like it is returning a padded value but the string is being compared to something that hasn't analysed the P when it's chomping out letters, but I want to go through the spec to confirm that making it do so is the correct decision.
Unfortunately, I'm not able to replicate this, I get (v2.9 with htslib 1.7):
[E::cram_get_ref] Failed to populate reference for id 20 [E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761 [E::cram_next_slice] Failure to decode slice
with 2.7 and/or htslib 1.5 is the same.
@AlexanderDilthey, @rishidev can you confirm you still get the same invalid output and warnings? @AlexanderDilthey is this still relevant for you?
@avullo Definitely still relevant - I think the pairwise alignment feature is one of the most useful of all! I won't be able to check whether it reproduces with 1.7 immediately, but it's on my list!
Thanks @AlexanderDilthey for letting me know. I'm gonna try with 1.5, but can you tell me your set up (OS/Bio::DB::HTS/htslib version)?
Forgot to mention, https://github.com/Ensembl/Bio-DB-HTS/issues/54 is resolved.
From the module's README:
This is probably an indication that the CRAM Reference archive is unavailable.
@avullo, I'm sure you're already aware of this but check REF_PATH and REF_CACHE env values as well as any proxy settings. If they are correct the last thing it could be is the ref server is down at ensembl.
http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES
@keiranmraine thanks for the trust but you shouldn't be sure: I'm not (yet) well versed with these tools; I was asked to contribute since I have XS experience ;)
ENA you mean @keiranmraine. Ensembl doesn't host the CRAM reference registry
On 7 Mar 2018, at 12:08, Keiran Raine notifications@github.com wrote:
@avullo, I'm sure you're already aware of this but check REF_PATH and REF_CACHE env values as well as any proxy settings. If they are correct the last thing it could be is the ref server is down at ensembl.
http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Sorry yes, ENA at EBI.
Hi @AlexanderDilthey, little but not negligible progress so far.
I and @rishidev managed to isolate the problem and we think fixing would require a reimplementation of the way the reference sequence is reconstructed using the CIGAR and MD strings together (borrowed from the Bio::DB::Sam ancestor module). This is not entirely trivial so bear with me until I properly attack the problem.
But you have a workaround: just instantiate the HTS instance passing the -force_refseq option. This way, it will use the indexed fasta file so you get the correct reference properly aligned to the read.
Let me know if that suits you for your immediate needs. I'm going to address the problem anyway.
Hi @avullo,
It seems that I now have the opposite problem when using -force_refseq.
I instantiate with
my $sam = Bio::DB::HTS->new(-fasta => $reference, -bam => $BAM, -force_refseq => 1);
... and then later retrieve the alignment via:
my ($ref, $matches, $query) = $inputAlignment->padded_alignment;
... then $query is string of Ns.
Hello @avullo, @rishidev
I'm noticing the same issue discussed here by the OP, @AlexanderDilthey
This function doesn't work always. It's particularly a problem with secondary/supplementary alignments
(1) Has this issue been resolved? I haven't tried the code posted above in April 2017, but it appears to be a reproducible example---does this code still not work?
(2) It's somewhat urgent---if you're willing to walk me through the issue at hand, I could try to write up a patch.
If it's easier, we could discuss this via e-mail. Thanks, Evan
HI Evan,
it's been a pretty intense period doing other things under deadlines. The newest code doesn't address this issue, so it's not resolved but is on the spotlight.
Before walking through this, I need to recollect thoughts that I had when I first looked at it. I will try to allocate some time next week.
When would you absolutely need it solved?
cheers
Alessandro
Hi @avullo
I'm working on using this code now. If it's easier, please collect your notes on this issue, and we can discuss this via e-mail. I will CC @AlexanderDilthey
I'm very happy to help patching this up.
Thanks, Evan
Ok @evanbiederstedt
from what I vaguely recollect, the problem is that the code doesn't correctly reconstruct the sequence in the absence of the reference, that's why the problem occurs for CRAM but not in the other cases.
Hi @avullo
I've reproduced the example from @AlexanderDilthey as follows:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Bio::DB::HTS;
my $referenceFasta = 'GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa';
my $sam = Bio::DB::HTS->new(-fasta => $referenceFasta, -bam => 'testCase.cram');
my $alignment_iterator = $sam->features(-iterator => 1);
while(my $alignment = $alignment_iterator->next_seq)
{
my ($ref,$matches,$query) = $alignment->padded_alignment;
unless(length($ref) == length($query))
{
warn Dumper("Alignment length mismatch", length($ref), length($query), $alignment->query->name);
next;
}
}
It is true that the issue exists for the CRAM @AlexanderDilthey provided, i.e. here's the error:
[E::cram_get_ref] Failed to populate reference for id 20
[E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761
[E::cram_next_slice] Failure to decode slice
However, the bug is not CRAM-dependent; it exists for BAM as well.
That's the error I was getting when initially trying to reproduce the error, but then I've learned you need to set REF_PATH env variable, as pointed out by @keiranmraine, see http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES. When you properly set the above, you'll get the warnings on the CRAM file, but not on BAM.
@avullo
Thanks for the reply.
I'm a bit confused: in order to resolve the issue raised above, you set the environment variable REF_PATH
to the reference FASTA?
that is
export REF_PATH=GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
Hi @avullo
Apologies if you're busy. Just to clarify, by this statement
When you properly set the above, you'll get the warnings on the CRAM file, but not on BAM.
do you mean that if you set the REF_PATH
variable, the operation on the CRAM throws warnings but works correctly, or throws warnings and doesn't work?
Not a problem @evanbiederstedt
I meant, you don't get the error: [E::cram_get_ref] Failed to populate reference for id 20 [E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761 [E::cram_next_slice] Failure to decode slice
but you do get the warning originally reported by @AlexanderDilthey, which point to the existence of bugs.
Hi @avullo
CC @AlexanderDilthey
I see what you mean now. Just to clarify for future reference:
When the variable REF_PATH
is not defined, here is the output I see, using the script testCram.pl
shown here: https://github.com/Ensembl/Bio-DB-HTS/issues/40#issuecomment-415804123
$ perl testCram.pl
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
$VAR1 = 'Alignment length mismatch';
$VAR2 = 3071526;
$VAR3 = 3074241;
$VAR4 = 'NA19240.gi|939191008|gb|LKPB01001610.1|';
When you do set the variable, here is the output:
$ export REF_PATH=GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
$ echo $REF_PATH
## GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
$ perl testCram.pl
[E::cram_get_ref] Failed to populate reference for id 20
[E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761
[E::cram_next_slice] Failure to decode slice
It's the other way around for me, but I use this: $ echo $REF_PATH /home/avullo/.cache/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s
Do you still get the error on not CRAM?
Hi @avullo
Thanks for the help. Perhaps this is an elementary question (and somewhat outside this github issue), but I'm still deeply confused how MD5 checksums solve this issue. Is this is fundamental property of the CRAM format?
I'm used to MD5 checksums in order to check whether something has downloaded correctly...
The md5 of each chr seq is used to store a reference to it in the header of the BAM/CRAM file. This ensures that the sequence being used to reconstruct the read seq is actually the correct one, and not a different build or something similar.
The /.../.cache/%2s/%2s/%s
version is a pre-populated cache. My understanding is that you cannot use a file for REF_PATH/REF_CACHE. A file has to be provided as a parameter to the hts library.
http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES
Hi @keiranmraine
Thanks for the help! I appreciate the explanation, and I think I understand now how this works.
Hi,
The padded_alignment() function sometimes produces invalid output, in that the strings that specify the pairwise alignment are not of equal length.
Interestingly, this only happens (on the same alignment) when using CRAM format -- BAM and SAM are fine.
I also get these warnings:
substr outside of string at /home/diltheyat/perl5/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/AlignWrapper.pm line 307. Use of uninitialized value in concatenation (.) or string at /home/diltheyat/perl5/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/AlignWrapper.pm line 307.
Files to reproduce this error:
Reference genome GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa: https://gembox.cbcb.umd.edu/shared/graph_hackathon/reference/GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
CRAM/SAM with one test alignment: https://www.dropbox.com/s/2dbb413ulfrhpma/testCase.zip?dl=0
Code:
use strict; use warnings; use Data::Dumper; use Bio::DB::HTS;
my $referenceFasta = 'GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa'; my $sam = Bio::DB::HTS->new(-fasta => $referenceFasta, -bam => 'testCase.cram');
my $alignment_iterator = $sam->features(-iterator => 1);
while(my $alignment = $alignment_iterator->next_seq) { my ($ref,$matches,$query) = $alignment->padded_alignment; unless(length($ref) == length($query)) { warn Dumper("Alignment length mismatch", length($ref), length($query), $alignment->query->name); next; }
}