JLSteenwyk / ClipKIT

a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference
https://jlsteenwyk.com/ClipKIT/
MIT License
63 stars 4 forks source link

Output sequence as digits #14

Closed andersgs closed 3 years ago

andersgs commented 3 years ago

Hello.

I am having the following issue, and below I post how I fixed it. Thank you for the great tool.

The issue occurred on Python 3.8 and clipkit version 1.1.3, BioPython 1.79 and Numpy 1.20.3, and using the following command line:

clipkit clean.aln -m kpic-smart-gap -l -o filtered.aln -if fasta -of fasta

Which in the past has worked well, but today my output from a FASTA alignment (requesting a FASTA output alignment) is returning numerical data:

>MN908947.3
676888886866667666888666686878787768786686776876687688678766
686667667868668866866686688668786788766676666768666867886686
876676876886678886786677887666768686676666868677888678667778
787666766677867687767676688786687788866676766666666678666686
678887668788886667788676766787687866787768887767668667877677

I have tracked the issue down to this line of code:

https://github.com/JLSteenwyk/ClipKIT/blob/cccc8bfcb6fd70e3fe0fa033c317ea015fc07b49/clipkit/modes.py#L83

Essentially, that is returning an int because the code is indexing a bytes object (https://docs.python.org/3/library/stdtypes.html#bytes-objects). I am unsure if there has been some change to BioPython or Numpy here that might be causing the issue.

Changing the line to:

            keepD[entry.id][alignment_position] = entry.seq._data.decode()[alignment_position]

Makes thing work, but at a significant cost to speed.

So, I tried a different approach that works at a good speed but requires the following changes:

  1. Instantiate np.zeros rather than np.empty, with dtype=bytes to store the state of locations to keep and discard:

https://github.com/JLSteenwyk/ClipKIT/blob/a43975dbf89dc09168e1e19c8c58d5f10d88909e/clipkit/helpers.py#L109-L114

So, it looks like this:

keepD[entry.id] = np.zeros([alignment_length], dtype=bytes)

It has to be zeros so that we get an array of b'', which will crucial below when joining things (if you use np.empty, you get random bytes that don't decode back to strings and can't be joined).

  1. These two lines need to change to:

https://github.com/JLSteenwyk/ClipKIT/blob/a43975dbf89dc09168e1e19c8c58d5f10d88909e/clipkit/modes.py#L83

https://github.com/JLSteenwyk/ClipKIT/blob/a43975dbf89dc09168e1e19c8c58d5f10d88909e/clipkit/modes.py#L96

trimD[entry.id][alignment_position] = entry.seq._data[alignment_position:alignment_position+1]

This ensures you get a bytes object of length 1 back rather than an int.

  1. Finally, this bit of code needs to decode the bytes arrays to strings before joining:

https://github.com/JLSteenwyk/ClipKIT/blob/a43975dbf89dc09168e1e19c8c58d5f10d88909e/clipkit/helpers.py#L133-L136

    for k, v in keepD.items():
        keepD[k] = "".join(np.char.decode(v))
    for k, v in trimD.items():
        trimD[k] = "".join(np.char.decode(v))

Thanks again.

JLSteenwyk commented 3 years ago

Hi Anders!

Thank you for notifying us of this issue. Also, thank you for taking an in-depth look into the driver of the problem. I will consult with the other major developer and we will resolve this issue in ~1-2 days.

Thank you again for pointing this out to us!

All the best,

Jacob L. Steenwyk

JLSteenwyk commented 3 years ago

Hi Anders,

Thank you again for notifying us of this issue and taking the time to write us a well-thought-out comment. We identified the issue stems from using biopython, v1.79, instead of ClipKIT's pinned version, v1.76.

We will begin working on the next release of ClipKIT, which will use biopython, v1.79. Given the developer team has their hands full (work, vacation, etc.) and I want to make sure nothing else breaks, I anticipate the next release of ClipKIT will come out in about a week. Sorry if this causes any inconvenience for you.

Again, thank you for your insight, Anders. Please feel free to message us with any other comments, concerns, or suggestions you may have (no matter how big or small!). We hope that ClipKIT continues to fulfill your research needs.

All the best,

Jacob L. Steenwyk

JLSteenwyk commented 3 years ago

Hi Anders,

Our team was able to update the codebase to work with biopython, v1.79. This new change is implemented in ClipKIT version 1.1.5. Biopython, v1.79, is not available on the Anaconda cloud. At this time, ClipKIT, v1.1.5, is available via GitHub and PyPi.

Thank you again for using ClipKIT! You may be interested in our sister toolkit, PhyKIT https://jlsteenwyk.com/PhyKIT/, a broadly applicable software for analyzing and processing multiple sequence alignments and phylogenies.

All the best,

Jacob

andersgs commented 3 years ago

Nice work @JLSteenwyk. Thank you!

andersgs commented 3 years ago

@JLSteenwyk in case you are curious, this is where I am using clipkit: https://github.com/MDU-PHL/kovid-trees-nf/

I usually use gotree for tree manipulation and goalign for alignment manipulation: https://github.com/evolbioinfo/{gotree,goalign}

Will check phykit out...

JLSteenwyk commented 3 years ago

Thank you @andersgs! It is really humbling that you decided to use ClipKIT in kovid-trees-nf. Thank you for your support!