Ensembl / Bio-DB-HTS

Git repo for Bio::DB::HTS module on CPAN, providing Perl links into HTSlib
Apache License 2.0
24 stars 16 forks source link

padded_alignment() produces invalid output #40

Open AlexanderDilthey opened 7 years ago

AlexanderDilthey commented 7 years ago

Hi,

The padded_alignment() function sometimes produces invalid output, in that the strings that specify the pairwise alignment are not of equal length.

Interestingly, this only happens (on the same alignment) when using CRAM format -- BAM and SAM are fine.

I also get these warnings:

substr outside of string at /home/diltheyat/perl5/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/AlignWrapper.pm line 307. Use of uninitialized value in concatenation (.) or string at /home/diltheyat/perl5/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/AlignWrapper.pm line 307.

Files to reproduce this error:

Reference genome GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa: https://gembox.cbcb.umd.edu/shared/graph_hackathon/reference/GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa

CRAM/SAM with one test alignment: https://www.dropbox.com/s/2dbb413ulfrhpma/testCase.zip?dl=0

Code:

use strict; use warnings; use Data::Dumper; use Bio::DB::HTS;

my $referenceFasta = 'GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa'; my $sam = Bio::DB::HTS->new(-fasta => $referenceFasta, -bam => 'testCase.cram');

my $alignment_iterator = $sam->features(-iterator => 1);
while(my $alignment = $alignment_iterator->next_seq) { my ($ref,$matches,$query) = $alignment->padded_alignment; unless(length($ref) == length($query)) { warn Dumper("Alignment length mismatch", length($ref), length($query), $alignment->query->name); next; }
}

rishidev commented 7 years ago

Thanks for submitting this. I've been able to replicate the issue with the sample code you have provided, and will investigate further.

AlexanderDilthey commented 7 years ago

Hi @rishidev, any news on that bug by any chance? Solving it would really help dealing with long, e.g. Nanopore, reads.

rishidev commented 7 years ago

Hi @AlexanderDilthey, apologies, I have moved on to a different project now, but this is the last thing I have to fix. The SAM and CRAM behaviour are quite different in the code. I think I've identified where the issue is occurring, but need further clarification to see what the fix is supposed to be.

This seems to be around the CIGAR value. The SAM/BAM contain the sequence themselves so don't have the issue. The CRAM reads reference a CIGAR string with a P in it which is 'a silent deletion from the padded reference'. The padded_alignment call looks like it is returning a padded value but the string is being compared to something that hasn't analysed the P when it's chomping out letters, but I want to go through the spec to confirm that making it do so is the correct decision.

avullo commented 6 years ago

Unfortunately, I'm not able to replicate this, I get (v2.9 with htslib 1.7):

[E::cram_get_ref] Failed to populate reference for id 20 [E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761 [E::cram_next_slice] Failure to decode slice

with 2.7 and/or htslib 1.5 is the same.

@AlexanderDilthey, @rishidev can you confirm you still get the same invalid output and warnings? @AlexanderDilthey is this still relevant for you?

AlexanderDilthey commented 6 years ago

@avullo Definitely still relevant - I think the pairwise alignment feature is one of the most useful of all! I won't be able to check whether it reproduces with 1.7 immediately, but it's on my list!

avullo commented 6 years ago

Thanks @AlexanderDilthey for letting me know. I'm gonna try with 1.5, but can you tell me your set up (OS/Bio::DB::HTS/htslib version)?

Forgot to mention, https://github.com/Ensembl/Bio-DB-HTS/issues/54 is resolved.

avullo commented 6 years ago

From the module's README:

  1. "Failed to populate reference for id X" error message

This is probably an indication that the CRAM Reference archive is unavailable.

keiranmraine commented 6 years ago

@avullo, I'm sure you're already aware of this but check REF_PATH and REF_CACHE env values as well as any proxy settings. If they are correct the last thing it could be is the ref server is down at ensembl.

http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES

avullo commented 6 years ago

@keiranmraine thanks for the trust but you shouldn't be sure: I'm not (yet) well versed with these tools; I was asked to contribute since I have XS experience ;)

andrewyatz commented 6 years ago

ENA you mean @keiranmraine. Ensembl doesn't host the CRAM reference registry

On 7 Mar 2018, at 12:08, Keiran Raine notifications@github.com wrote:

@avullo, I'm sure you're already aware of this but check REF_PATH and REF_CACHE env values as well as any proxy settings. If they are correct the last thing it could be is the ref server is down at ensembl.

http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

keiranmraine commented 6 years ago

Sorry yes, ENA at EBI.

avullo commented 6 years ago

Hi @AlexanderDilthey, little but not negligible progress so far.

I and @rishidev managed to isolate the problem and we think fixing would require a reimplementation of the way the reference sequence is reconstructed using the CIGAR and MD strings together (borrowed from the Bio::DB::Sam ancestor module). This is not entirely trivial so bear with me until I properly attack the problem.

But you have a workaround: just instantiate the HTS instance passing the -force_refseq option. This way, it will use the indexed fasta file so you get the correct reference properly aligned to the read.

Let me know if that suits you for your immediate needs. I'm going to address the problem anyway.

AlexanderDilthey commented 6 years ago

Hi @avullo,

It seems that I now have the opposite problem when using -force_refseq.

I instantiate with

my $sam = Bio::DB::HTS->new(-fasta => $reference, -bam => $BAM, -force_refseq => 1);

... and then later retrieve the alignment via:

my ($ref, $matches, $query) = $inputAlignment->padded_alignment;

... then $query is string of Ns.

evanbiederstedt commented 6 years ago

Hello @avullo, @rishidev

I'm noticing the same issue discussed here by the OP, @AlexanderDilthey

This function doesn't work always. It's particularly a problem with secondary/supplementary alignments

(1) Has this issue been resolved? I haven't tried the code posted above in April 2017, but it appears to be a reproducible example---does this code still not work?

(2) It's somewhat urgent---if you're willing to walk me through the issue at hand, I could try to write up a patch.

If it's easier, we could discuss this via e-mail. Thanks, Evan

avullo commented 6 years ago

HI Evan,

it's been a pretty intense period doing other things under deadlines. The newest code doesn't address this issue, so it's not resolved but is on the spotlight.

Before walking through this, I need to recollect thoughts that I had when I first looked at it. I will try to allocate some time next week.

When would you absolutely need it solved?

cheers

Alessandro

evanbiederstedt commented 6 years ago

Hi @avullo

I'm working on using this code now. If it's easier, please collect your notes on this issue, and we can discuss this via e-mail. I will CC @AlexanderDilthey

I'm very happy to help patching this up.

Thanks, Evan

avullo commented 6 years ago

Ok @evanbiederstedt

from what I vaguely recollect, the problem is that the code doesn't correctly reconstruct the sequence in the absence of the reference, that's why the problem occurs for CRAM but not in the other cases.

evanbiederstedt commented 6 years ago

Hi @avullo

I've reproduced the example from @AlexanderDilthey as follows:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;
use Bio::DB::HTS;

my $referenceFasta = 'GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa';
my $sam = Bio::DB::HTS->new(-fasta => $referenceFasta, -bam => 'testCase.cram');

my $alignment_iterator = $sam->features(-iterator => 1);
while(my $alignment = $alignment_iterator->next_seq)
{
    my ($ref,$matches,$query) = $alignment->padded_alignment;
    unless(length($ref) == length($query))
    {
        warn Dumper("Alignment length mismatch", length($ref), length($query), $alignment->query->name);
        next;
    }
}

It is true that the issue exists for the CRAM @AlexanderDilthey provided, i.e. here's the error:

[E::cram_get_ref] Failed to populate reference for id 20
[E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761
[E::cram_next_slice] Failure to decode slice

However, the bug is not CRAM-dependent; it exists for BAM as well.

avullo commented 6 years ago

That's the error I was getting when initially trying to reproduce the error, but then I've learned you need to set REF_PATH env variable, as pointed out by @keiranmraine, see http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES. When you properly set the above, you'll get the warnings on the CRAM file, but not on BAM.

evanbiederstedt commented 6 years ago

@avullo

Thanks for the reply.

I'm a bit confused: in order to resolve the issue raised above, you set the environment variable REF_PATH to the reference FASTA?

that is

export REF_PATH=GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
evanbiederstedt commented 6 years ago

Hi @avullo

Apologies if you're busy. Just to clarify, by this statement

When you properly set the above, you'll get the warnings on the CRAM file, but not on BAM.

do you mean that if you set the REF_PATH variable, the operation on the CRAM throws warnings but works correctly, or throws warnings and doesn't work?

avullo commented 6 years ago

Not a problem @evanbiederstedt

I meant, you don't get the error: [E::cram_get_ref] Failed to populate reference for id 20 [E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761 [E::cram_next_slice] Failure to decode slice

but you do get the warning originally reported by @AlexanderDilthey, which point to the existence of bugs.

evanbiederstedt commented 6 years ago

Hi @avullo
CC @AlexanderDilthey

I see what you mean now. Just to clarify for future reference:

When the variable REF_PATH is not defined, here is the output I see, using the script testCram.pl shown here: https://github.com/Ensembl/Bio-DB-HTS/issues/40#issuecomment-415804123

$ perl testCram.pl
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
$VAR1 = 'Alignment length mismatch';
$VAR2 = 3071526;
$VAR3 = 3074241;
$VAR4 = 'NA19240.gi|939191008|gb|LKPB01001610.1|';

When you do set the variable, here is the output:

$ export REF_PATH=GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
$ echo $REF_PATH
## GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
$ perl testCram.pl
[E::cram_get_ref] Failed to populate reference for id 20
[E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761
[E::cram_next_slice] Failure to decode slice
avullo commented 6 years ago

It's the other way around for me, but I use this: $ echo $REF_PATH /home/avullo/.cache/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s

Do you still get the error on not CRAM?

evanbiederstedt commented 6 years ago

Hi @avullo

Thanks for the help. Perhaps this is an elementary question (and somewhat outside this github issue), but I'm still deeply confused how MD5 checksums solve this issue. Is this is fundamental property of the CRAM format?

I'm used to MD5 checksums in order to check whether something has downloaded correctly...

keiranmraine commented 6 years ago

The md5 of each chr seq is used to store a reference to it in the header of the BAM/CRAM file. This ensures that the sequence being used to reconstruct the read seq is actually the correct one, and not a different build or something similar.

The /.../.cache/%2s/%2s/%s version is a pre-populated cache. My understanding is that you cannot use a file for REF_PATH/REF_CACHE. A file has to be provided as a parameter to the hts library.

http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES

evanbiederstedt commented 6 years ago

Hi @keiranmraine

Thanks for the help! I appreciate the explanation, and I think I understand now how this works.