bxlab / bx-python

Tools for manipulating biological data, particularly multiple sequence alignments
MIT License
145 stars 53 forks source link

get on bx.align.maf.MAFIndexedAccess results in wrong alignment blocks #61

Closed dandaman closed 4 years ago

dandaman commented 4 years ago

Hi, either I'm doing something wrong or I've stumbled over a peculiar bug in the MAF indexing code. My setup: Linux/Python 3.7.1/Anaconda bx-python=0.8.8 installed from -c bioconda

I use a multi-species, multi sequence maf file indexed via: maf_build_index.py test.maf test.idx

I want to study SNPs locations on the MSA. I have created a small example repo that contains a test set and an ipynb to demonstrate the bug: https://github.com/dandaman/bxmafissue/blob/master/BugTracer.ipynb

As you can see in the notebook, get (or probably the index) is somehow mixing up positions. In my full data set some positions are wrong some are not?!

Am I doing something wrong or is this a bug?

This old report https://www.biostars.org/p/122420/ seems to have encountered the same issue.

All the best, Daniel

rsharris commented 4 years ago

Looking at the resulting blocks, it appears that they are consistent with the fact that the reference sequence, PT.PT1S09221, appears on the reverse strand.

In other words, for the first block, I believe it does contain position 570 along the forward strand of PT.PT1S09221.

I was initially confused about why it reports 127 as component.start. But working this through: s PT.PT1S09221 117 12 - 698 GCTGAAGTTGGC is interval (117,129) on the reverse strand (using zero-based half-open coordinates). That corresponds to GCCAACTTCAGC on the forward strand in (569,581). The request for position 570 would be the second position in that interval. Converting that position as a starting position on the reverse strand is 127, right? I.e. the sub-interval starting at 127 is (127,129), which is the final two characters, e.g. the aligned sequence is split into GCTGAAGTTG/GC and GC is reported.

Bob H

On Apr 17, 2020, at 3:56 AM, Daniel Lang notifications@github.com wrote:

Hi, either I'm doing something wrong or I've stumbled over a peculiar bug in the MAF indexing code. My setup: Linux/Python 3.7.1/Anaconda bx-python=0.8.8 installed from -c bioconda

I use a multi-species, multi sequence maf file indexed via: maf_build_index.py test.maf test.idx

I want to study SNPs locations on the MSA. I have created a small example repo that contains a test set and an ipynb to demonstrate the bug: https://github.com/dandaman/bxmafissue/blob/master/BugTracer.ipynb https://github.com/dandaman/bxmafissue/blob/master/BugTracer.ipynb As you can see in the notebook, get (or probably the index) is somehow mixing up positions. In my full data set some positions are wrong some are not?!

Am I doing something wrong or is this a bug?

This old report https://www.biostars.org/p/122420/ https://www.biostars.org/p/122420/ seems to have encountered the same issue.

All the best, Daniel

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bxlab/bx-python/issues/61, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFDOI6Y6O3H6COBOIT6GB3RNADUFANCNFSM4MKRBIFQ.

rsharris commented 4 years ago

Continuing that thought ... from what you've reported it appears that the lookup always considers the position to be on the forward strand, and the resulting start position is reported along whatever strand the alignment is on.

I wonder if that is documented anywhere.

dandaman commented 4 years ago

You're right. I did forget about how MAF represents alignments on the negative strand :-(

It's mentioned in the documentation of slice_by_component and I overread it.

I solved my issue by checking the returned alignment strand for my target sequence and the reverse complementing it.

I've updated my example repo and keeping it in case anyone falls into the same trap I have. Updated the master of my repo accordingly. https://github.com/dandaman/bxmafissue/blob/master/BugTracer.ipynb

If you're not very familiar with MAF it might be a common misconception. Dealing with GFF3 or VCF you'd expect different. Maybe also add this to the doc of get() and point to the MAF standard description at ucsc or so.

From my side we can close this issue, Bob. Thanks for your explanation and for your work on bx-python!

rsharris commented 4 years ago

Reverse-strand treatment, as well as origin-zero vs origin-one, pop up over and over on forums. Reverse strand in MAF bit me early on, probably in 2004.

It would be helpful if the situation you encountered was in some bx-python doc, e.g. if there is an example of how to use maf index lookup it ought to include an example where the 'hit' is on the reverse strand.

I haven't been a contributor to bx-python in over a decade. And my contribution was pretty small to begin with. I haven't been active in the project as it has matured and improved, and with James' death I don't know any of the people who have done that work. But I do occasionally respond to issues when I think I can understand them.