JLSteenwyk / ClipKIT

a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference
https://jlsteenwyk.com/ClipKIT/
MIT License
61 stars 4 forks source link

2023-08-24: Use numpy to accelerate the trimming process #36

Closed chtsai0105 closed 1 year ago

chtsai0105 commented 1 year ago

Hi - I'm suggesting to use some numpy implementation instead of for loop for trimming.

Below is an example codes that I tried. I slightly modified the original function ClipKIT used for the trimming process but the concept should be the same. Note that I'm using Biopython MultipleSeqAlignment objects as input. I noticed you have already convert them to numpy array so I guess you can simply skip the conversion step. Also the trimD definition might be different from the original one since I just use it to store the index should be trimmed.

def original_clipkit(msa: MultipleSeqAlignment, gaps: int = 0.9):    
    keepD = msa[:, :0]
    trimD = []
    for idx in range(msa.get_alignment_length()):
        _, gappyness = report_column_features(msa, idx, get_gap_chars(SeqType.aa))
        if gappyness <= gaps:
            keepD += msa[:, idx : idx + 1]
        else:
            trimD.append(idx)

    return keepD, trimD

I tested it with jupyter %%timeit function and here is the results: 67.1 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And here is the rewrite with numpy methods.

def numpy_clipkit(msa: MultipleSeqAlignment, gaps: int = 0.9):
    infoList = [{"id": rec.id, "name": rec.name, "description": rec.description} for rec in msa]
    msa_array = np.array([list(rec) for rec in msa])
    gappyness = (msa_array == "-").mean(axis=0)
    trimD = np.where(gappyness > gaps)[0]
    keepD = np.delete(msa_array, trimD, axis=1)

    return MultipleSeqAlignment([SeqRecord(Seq("".join(rec)), **info) for rec, info in zip(keepD.tolist(), infoList)]), trimD.tolist()

And there is a significant speed up: 616 µs ± 2.73 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Since I'm only using the gappy mode to do the trimming, I'm not sure if this implementation is applicable to other trimming mode.

JLSteenwyk commented 1 year ago

Hi Cheng-Hung,

Thanks for this suggestion! At first glance, it looks like a great enhancement!

I will take a close look early next week.

Thanks again!

best,

Jacob

JLSteenwyk commented 1 year ago

Hi Cheng-Hung,

The latest version of ClipKIT takes inspiration from your ticket and now implements a faster version of ClipKIT.

Thank you for your suggestion!

best,

Jacob