JLSteenwyk / ClipKIT

a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference
https://jlsteenwyk.com/ClipKIT/
MIT License
61 stars 4 forks source link

Change to trimming behavior with gappyness threshold? #38

Closed chtsai0105 closed 1 year ago

chtsai0105 commented 1 year ago

Bug Summary

Not a bug actually. But I found the trimming behavior is different from the previous version.

In the previous version, using gappy or smart_gap mode trims sites that the gappyness above a threshold. But in the latest version it trims sites equal or above the threshold. Here are the codes I found before implementing the numpy acceleration: https://github.com/JLSteenwyk/ClipKIT/blob/7c0dee803c8fbf33edd1f018a56527b1da7ac268/clipkit/msa.py#L63-L64

However, in the latest version: https://github.com/JLSteenwyk/ClipKIT/blob/f4cb92a2c2248e21e3ecf161c1a1cd08152d4b16/clipkit/msa.py#L158-L159

Just want to confirm you're expecting this behavioral changes, since it will produce a slightly different results from the previous version.

Steps to Reproduce

from Bio.Align import MultipleSeqAlignment
from clipkit.msa import MSA
from clipkit.settings import DEFAULT_AA_GAP_CHARS

records = ["MG--A", "M-GTC"]
pep_msa = MultipleSeqAlignment([SeqRecord(Seq(rec)) for rec in records])
gaps = 0.5

clipkit_pep_msa = MSA.from_bio_msa(pep_msa, gap_chars=DEFAULT_AA_GAP_CHARS)
clipkit_pep_msa.trim(mode=TrimmingMode.gappy, gap_threshold=0.5)
print(clipkit_pep_msa.to_bio_msa())

The previous codes do not trim any of the sites and will get:

Alignment with 2 rows and 5 columns
MG--A <unknown description>
M-GTC <unknown description>

The current codes will trim the sites that gappyness == 0.5 and get:

Alignment with 2 rows and 2 columns
MA <unknown description>
MC <unknown description>

Technical Details

JLSteenwyk commented 1 year ago

Hi Cheng-Hung,

Thank you for your careful reading of ClipKIT. Attentive users like yourself help ensure ClipKIT performs well (and, in some cases, even better!).

Regarding this, I will refer to the original manuscript — Steenwyk et al. (2020), Current Biology. Specifically, we note in the pseudocode for the execution clipkit <MSA> -g 0.95 that ClipKIT will do the following:

FOR site in alignment:
>IF site has <95% gaps
>>keep the site

Thus, sites with greater than or equal to the specified gaps threshold should be removed.

The code in commit 7c0dee8 was erroneous. This error stemmed from incorrectly implementing the Numpy acceleration. (Note, 7c0dee8 does implement the Numpy acceleration.)

As you mentioned in your comment, the error was later fixed, including in version 2.1.1.

Thank you again for your careful review of our code. We appreciate your insights.

best,

Jacob

chtsai0105 commented 1 year ago

Thank you! I'll update my tests file to reflect the changes.