bxlab / bx-python

Tools for manipulating biological data, particularly multiple sequence alignments
MIT License
145 stars 53 forks source link

Got wrong positions on chomped query maf subset #68

Closed songtaogui closed 3 years ago

songtaogui commented 3 years ago

Hi,

Thank you for developing this nice package.

However, I have encountered issues when subset maf with maf_extract_ranges_indexed.py, the query position of chomped maf seems incorrect:

For example, I have a REF.maf like:

##maf version=1
a score=1643
s Ref.3                            219977717 1643 + 235667834 CCTGTAAAATATATAATCTTGAGCAAACTAGTTAGTCCAATTATTTGTGTTGGGCATTTCAACCACCAAAATCATTTAGGAAAAAGTTTGACCCTATTTCCCTTTCAGGTATCAATATAATATAAAATTTGTTAATGTTTATATGTTTGTCATGAAATAAATATATGAATGAATAAGATATATTGTGTCTATAAATTTTTATTTTCATGAATGATGAAGGTGTGTCCTTCATGACCTTCGTCTGAAGATCGTTATATCCGAAGGAGATAATGCTTCGAAGAACAAAGGTCCTTAATACTTAACGATTGTGTTGTCTTGTTCTTGATTCACAACATTTGAGAACAAGTGACCAACAGCTCTTTTGAGCCTAACTCCTCTAACTTTTTTGGCATTTCTCGAGGATTCCTTACGACTTAGACAAACATAGTTAGCTTGTAAAACAATTGACTAAGCCTAGGAAACGTACCTTTTTTCACATAGTTTTTATAGGATTTGCAATATGCTCAAAACTAAGTCTAAAAGTCAATTTTCTTGCACAAAAGAGTCAGTAAATCAAGGTGTAGGAAAGATAGTATAAGGTGTGTGC-AACTTATTCAAACACTTAAATCTCATGTTTTATAGTTTTACCCAAAGTTGCACTTTTGGCCTTTATTTCCATTCTTTTGAACTTGTTGCCTTTAATTTGCTTTTGCTAGAAGCCTAGAACGTGTGCTCAAGCTGGCATCAAAGTAATTAGTCTGATGTGCTATCACTATGGGTGACAATAAGCTCTAAATTTTACGCTATAAGATTTAAGAATCAGATCAGATTAGAATCAGACCATATTTTTATTCATTTTTGAACTAAAATTTATTTAGGGCACTACCATTTTGTGAAGAAACATTTGGATCGTGATCCATTACCACTCCTAACTGTCACTTGAATGTGAATCATTACCATTGTCATTTGCGTTACAATTCGTAAATCAAACAACTTGTTTGAACCAAATAACTTTCTACGTCATCCTCTACCGTGCACAACTTACATCAACTTTTTCACAAGCCACGACTTAAGGGGTGTTTGAATGCAATAGAGCTAATAGTTAGCTGTTAAAATTAGCTGAAGACATCTAAACAATATAGCTAATAGTTCAGCTATTAGCTATTTTTAGCAAATTAGCTAATAGTTAGCTAGCTATTTGTTAGCTAGCTAATTTCACCGGCAA-TTTTTTGTCAACTAACTATTAGCTCTAATGCATTGAAACACCATCTTTATTAAGGCCCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAAAAGAGCTAATGATGGTGTGATAAAATTTAGCACATATTATCTCATAAACTGCTCAAGAGATGCTAAAAGTAGCTAAACGTGGGCCAGGAACCCATTTTTCATCCATACCCTCCAACATCATTCCCTCT--CTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTACTAAACTTTATTAGAGTAGTCCAAATACCATAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTGAAGTTTAGCAGGAAGCTAAAATTTATCCTGTGAGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC
s query                                 2314 1632 -      3946 CCTGTAGAATATATAATCTTGAGCAAACTAGTTAGTCCAATTATTTGTGTTGGGCATTTCAACCACCAAAATCATTTAGGAAAAGGTTTGACCCTATTTCCCTTTCAGATATCAATATAATATAAACTTTGTTAATGTTTATATGTTTGTCATGAAATAAATATATGAATGAATAAGATATATTGTGTGTATAAATTTTTATTTTCATGAATGATGAAGGTGTGTCCTTCATGACCTTCGTCTGAAGATCGTTATATCCGAAGGAGATAATGCTTCGAAGGACGAAGGTCCTTAATACTTAACGATTGTGTTGTCTTGTTCTTGATTCACAACATTTGAGAACAAGTGACCAACAGCTCTTTTGAGCCTAACTCCTCTAACTTTTTTGGCATTTCTCGAGGGTTCCTTACGACTTAGACAAACATAGTTAGCTTGTAAAACAATTGACTAAGCCTAGGAAACATACC-TTTTTCACATAGTTTTTATAGGATTTGCAATATGCTCAAAACTAAGTCCAAAAGTCAATTTTCTTGCACAAAAGAGTTAGTAAATCAAGGTGCAGGAAAGATAGTATAAGATGTGTGCTAACTTATTCAAACACTTAAATCTCATGTTTTATAGTTTTACCCAAAGTTGCACTTTTGGCCTTCATTTCCATTCTTTTGAACTTGTTGCCTTTAATTTGCTTTTGCTAGAAGCCTAGAACGTGTGCTCAAGCTGACATCAAAGTAATTAGTCTGATGTGCTATCACTATGGGTGACAATAAGCTCTAAATTTTACAATATAAGATTTAATGATCAGATCAGATTAGAATCGGACCATATTTTTATTCATTTTTGAACTAAAATTTATTTAGGGCACTACCATTTTGTGAAGAAACATTTGGATCGTGATCCATTACCACTCCTAGCTGTCACTTGAGTGTGAATCATTATCATTGTCATTTGCGTTACAATTCGTAAATCAAACAACTTGTTTGAACCAAATAATTTTCTACGTCATCCTCTACCGTGCACAACTTACATCAACTTTTTCACAAGCCACGACTTAAGGGGTGTTTGAATGCAATAGAGCTAATAGTTAGCTGTTAAAATTAGCTGAAGACATCTAAACAATATAGCTAATAGTTCAGCTATTAGCTATTTTTAGCAAATTAGCTAATA-------------GTTAGCTAGCTAATTCCACCGACAATTTTTTTGTCAACTAACTATTAGCTCTAATGCATTTAAACACCATCTTTATTAAGGCTCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAGGAGAGCTAATGGTGGTGTGATAAAATTTAGCACATATTATCTCATAAGCTGCTCAAGAGGTGCTAAAAGTAGCTAAACGTGGACCAGGAAACCCCATTTTCCATCCATA-CCTCCAACATCATTCCCTCTCCCTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTGCTAAACTTTATTAGAGTAGTCCAAACACCACAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTAAAGTTTAGCAAGAAGCTAAAATTTATCCTGTGGGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC

And I was trying to subset the maf based on a region with command:

# >cat test.bed
# Ref.3   219978958       219979958

maf_extract_ranges_indexed.py -c REF.maf < test.bed > test.maf

The output test.maf was like:

##maf version=1
a score=1643
s Ref.3 219978958 402 + 235667834 ACACCATCTTTATTAAGGCCCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAAAAGAGCTAATGATGGTGTGATAAAATTTAGCACATATTATCTCATAAACTGCTCAAGAGATGCTAAAAGTAGCTAAACGTGGGCCAGGAACCCATTTTTCATCCATACCCTCCAACATCATTCCCTCT--CTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTACTAAACTTTATTAGAGTAGTCCAAATACCATAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTGAAGTTTAGCAGGAAGCTAAAATTTATCCTGTGAGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC
s query      3543 403 -      3946 ACCATCTTTATTAAGGCTCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAGGAGAGCTAATGGTGGTGTGATAAAATTTAGCACATATTATCTCATAAGCTGCTCAAGAGGTGCTAAAAGTAGCTAAACGTGGACCAGGAAACCCCATTTTCCATCCATA-CCTCCAACATCATTCCCTCTCCCTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTGCTAAACTTTATTAGAGTAGTCCAAACACCACAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTAAAGTTTAGCAAGAAGCTAAAATTTATCCTGTGGGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC

However, the exact position of the aligned query sequence was nothing like the position reported in the test.maf:

# the query whole sequence was stored in query.fa, locating the aligned block in the test.maf with seqkit locate function:
# > seqkit locate --gtf -p "ACCATCTTTATTAAGGCTCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAGGAGAGCTAATGGTGGTGTGATAAAATTTAGCACATATTATCTCATAAGCTGCTCAAGAGGTGCTAAAAGTAGCTAAACGTGGACCAGGAAACCCCATTTTCCATCCATACCTCCAACATCATTCCCTCTCCCTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTGCTAAACTTTATTAGAGTAGTCCAAACACCACAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTAAAGTTTAGCAAGAAGCTAAAATTTATCCTGTGGGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC" query.fa -o test_location.gtf
# > cat test_location.gtf
query   SeqKit  location        2315    2717    0       -       .       gene_id "ACCATCTTTATTAAGGCTCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAGGAGAGCTAATGGTGGTGTGATAAAATTTAGCACATATTATCTCATAAGCTGCTCAAGAGGTGCTAAAAGTAGCTAAACGTGGACCAGGAAACCCCATTTTCCATCCATACCTCCAACATCATTCCCTCTCCCTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTGCTAAACTTTATTAGAGTAGTCCAAACACCACAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTAAAGTTTAGCAAGAAGCTAAAATTTATCCTGTGGGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC";

It seems to me that maf_extract_ranges_indexed.py takes the left part of the chomped maf block as the right part when the query strand is "-".

Did I misunderstand something ? How could I get the correct postions when chomping maf ?

Please find attached the aforementioned files to reproduce the result: chomp_test.tar.gz

Any suggestion would be grateful.

Best wishes,

Songtao Gui

rsharris commented 3 years ago

Usually reports like this are due to a user misunderstanding of the MAF format, for which treatment of the reverse strand is not necessarily obvious. See http://genome.ucsc.edu/FAQ/FAQformat#format5 . I'm pretty sure MAF originated at UCSC, so this should be considered as the formal specification of the format.

I haven't looked at the details of your post, so I'm not 100% positive this is what is happening. And I do appreciate that your post seems to be thorough. But your conclusion is consistent with the usual misunderstanding of MAF. I had similar problems in my own early encounters with MAF. Could you check to see whether this explains what you are seeing?

songtaogui commented 3 years ago

@rsharris

Yes, you are right.

After another look at my MAF file, it turns out that the start postion of the query was always based on the positive line regardless of the strand. After changing to the reverse complement line based start position, everything just works fine !

Sorry for the bothering, and thank you for the help.

Best,

Songtao Gui