BSDExabio / PSP

protein structure prediction
1 stars 0 forks source link

Large-scale application #11

Closed FreshAirTonight closed 2 years ago

FreshAirTonight commented 3 years ago

I have about features for 30,000 proteins ready for modeling.

FreshAirTonight commented 2 years ago

To prepare for big runs, I did some testing runs and collected statistics on a benchmark set, 559 Desulforbio proteins. All except casp14 runs using 36 Summit Nodes, each with 6 16GB-V100.

Preset Mean pLDDT Mean pTMS Count Walltime (min)
reduced_dbs 78.55129 0.6327694 559 41
genome 79.54728 0.6439633 559 45
genome2 79.39952 0.6434832 559 62
super 79.8062 0.6471785 558 76
economy 78.54494 0.6335336 559 37
casp14 78.56151 0.6313684 551 150 (91 nodes)

Notes:

markcoletti commented 2 years ago

Looks like super is the winner! Um, except it's not a fair comparison since the larger proteins were left out.

Where does this leave us for the next large run?

FreshAirTonight commented 2 years ago

As soon as we figure out what causes core dumps and avoid the issue, we are ready for a real run.

FreshAirTonight commented 2 years ago

Test on 3400 sequence of pesudodesulfovibrio proteome was done within 1 hour using 200 summit nodes. A few long sequences >1500AAs were done with high memory node. The confident predictions cover 90% of total sequences, and ultra confident prediction covers 69%.

FreshAirTonight commented 2 years ago

A test on 24,738 sequences of Smagellanicum on 1k node 2 hour run. Each sequence have 5 DL models to go, totaling 123,690 target/model tasks. Got 52,452 (42%) done from this run. Since I ordered longer sequences to run first and the remaining shorter sequences cost far less time, it is estimated that >63% work load or even more has been done. Time spent on GPU is about 79%, of total run time, so overhead spending is not bad. 14TB data generated. Need to purge some pickle files of low rank models. One issue is that Dask reported only 2.4% of tasks were done, which was significantly lower than the actual tasks done.

FreshAirTonight commented 2 years ago

A quick update: Smagellanicum is almost done other than 141 long sequences still waiting in the high-memory queue. When this calculation is done, 99.7% (25151/25227) of Smagellanicum proteome will have been modeled with AlphaFold. For the most of week, the jobs are waiting in the queue. Actual runs took place are ~5 hrs on 200 regular nodes, and ~10 hrs on 36 high-memory nodes.

FreshAirTonight commented 2 years ago

The structure prediction of Smagellanicum proteome by AlphaFold 2 have been completed. Here is a summary.

Total number of protein sequences modeled: 25,134. This covers 99.7% of the whole Smagellanicum proteome provided, including all sequences < 2500 AAs. Total number of amino acids modeled is 11,448,954.

Since this is a brand-new genome, not a single sequence found in the current PDB database, a portal for experimentally determined protein structures. I found only 37 sequences in the PDB at sequence similarity > 90%, and 974 > 70%. In another word, all models provided here are genuinely new, good or bad.

Model local quality High confidence coverage pLDDT > 70): 58.2% in terms of AA, 56.9% in terms of sequences Very high confidence coverage pLDDT > 90): 36.0% in terms of AA.

Model global quality Very high confidence pTMS > 0.8: 20.1% high confidence pTMS > 0.6: 52.7%

Mean number of recycles of the top model: 12.0 (versus 3 by default).

Run Time Feature generation: 7,367 hours (using 8 AMD EPYC 7302 cores) Model inference: 16,064 hours (using 1 Nvidia V100)

Actual costs are about 2,000 Andes node hours, and 3,400 Summit node hours. This include both overheads and some wastes due to the experimental nature of the first large-scale application of this kind on Summit.

Note that in the PDB file, the b-factor column is the pLDDT score which is the confidence for the coordinate prediction of each residue. It max value is 100, > 70 indicates good confidence.

FreshAirTonight commented 2 years ago

Models are available on Summit here:

Top 1 models in PDB format /gpfs/alpine/world-shared/bif135/species/smagellanicum/af2_output (7.0GB)

Compressed tarball: /gpfs/alpine/world-shared/bif135/species/smagellanicum/af2_output.tar.gz (1.6GB)