Autopeptideml efficiency

taylorreiter commented 8 months ago

As @keithchev pointed out over in #10:

the use of autopeptideml here is a bit inefficient because it re-generates the ESM embeddings for each of the 12 named models. For now this is probably okay, but it may be worth optimizing if the dataset of combined peptide predictions that are input to autopeptideml becomes large (I would guess larger than ~10,000 sequences).

I think it could be simple to run each autopeptideml model in one script, which would then generate the ESM embeddings only once.

Similarly, keith mentioned:

we should look into the implications of snakemake parallelizing processes that use the GPU (in this case, all of the autopeptideml models). I assume that this is handled in a sensible way at the level of CUDA or the GPU itself, but I'm not sure.

This will be something to keep an eye out for.

RaulFD-creator commented 2 months ago

Hi @taylorreiter, I just found this project, it seems really cool. I've just released an update to AutoPeptideML (0.3.1) to address this issue. Now, you can calculate the representations one time with:

df_repr = re.compute_representations(df.sequence, average_pooling=True)

and then run the predictions taking the additional argument of df_repr:

    predictions = autopeptideml.predict(
        df=df, re=representation_engine, ensemble_path=model_folder, outputdir=tmp_dirname,
        df_repr=df_repr
    )

This should allow you to run the code in a loop of some sort and avoid calculating the embeddings every time.

taylorreiter commented 2 months ago

Thank you @RaulFD-creator!

Arcadia-Science / peptigate

Autopeptideml efficiency #14