How to speed up protein screening

KosinskiLab / AlphaPulldown

https://doi.org/10.1093/bioinformatics/btac749

GNU General Public License v3.0

219 stars 49 forks source link

How to speed up protein screening #467

Open gabrielpan147 opened 1 day ago

gabrielpan147 commented 1 day ago

Hello,

Thank you for developing such a great tool! I am currently doing protein screening based on the pulldown mode. We have several A100 GPU on a slurm based cluster. However, I just found the inferencing speed of the tool is slow: for a 827 residues , the prediction time on a single A100 card was ~150s, significantly slower than the Alphafold suggest prediction speed ~60-90 s(which also inference on A100).

I just followed your installation tutorial but I'm not sure if I configure everything correct. I'm wondering if you have parameter such as "global_config.subbatch_size" to increase the batch size or speed it up? Could you give me some suggestions?

Thanks, Gabriel

jkosinski commented 1 day ago

Hi Gabriel,

The inference speed depends not only on sequence lengths but also on other factors like the size of MSA. Did you compare AlphaFold and AlphaPulldown run time on the exact same input, including the same input sequence alignments and templates, same number of recycles, and so on?

Best, Jan

DimaMolod commented 1 day ago

Hi @gabrielpan147 we use the same config file as the original alphafold, e.g. you can find global_config.subbatch_size value here, but I agree that you can't compare speed based on the sequence length alone. Also not sure how changing the default AF parameters would affect the accuracy of predictions.

gabrielpan147 commented 5 hours ago

Hello all,

Thank you for your reply! I haven't tested the official AF on the same input yet, and I agree this is what I need to test. I am also wondering if we focus on AlphaPullDown, do you have any tips& suggestions for accelerating the inferencing speed?

DimaMolod commented 3 hours ago

well, you can limit MSA depth, reduce number of cycles, or play with some other params in config.py, but it's always a trade off between speed and quality and I don't think there is a trick to just accelerate inference without affecting accuracy. Otherwise, it would be well known by now :) we recently found out that conversion to modelCIF might take a while for large PAEs, so maybe turn off this too.

jkosinski commented 3 hours ago

Exactly as Dima said, unfortunately speed always sacrifices sensitivity and accuracy so it depends on your biological question, e.g. whether you want to find as many interactions as possible in your system, or you are fine with missing some. In addition to the parameters Dima listed above, you can also adjust --num_predictions_per_model=1 and only run one model (e.g.--model_names=model_2_multimer_v3). @DimaMolod, by the way, since this question comes back, maybe we could have some subpage in the doc listing all these settings for the fastest speed, with a warning about accuracy?

(we should also occasionally test that AP is as fast as AF with the same input).

DimaMolod commented 3 hours ago

yes, I like the idea: maybe instead of the current exhaustive manual with all possible functionality we should split it by use cases, e.g.

0) quick start 1) classical alphapulldown: how to inference as quickly as possible 2) modeling big complexes with slow and accurate mode and multimeric templates 3) how to use crosslinks etc.

jkosinski commented 2 hours ago

Yes, we need to restructure the front page along these lines, let's discuss that separately.