Large-scale application

FreshAirTonight commented 3 years ago

I have about features for 30,000 proteins ready for modeling.

FreshAirTonight commented 2 years ago

To prepare for big runs, I did some testing runs and collected statistics on a benchmark set, 559 Desulforbio proteins. All except casp14 runs using 36 Summit Nodes, each with 6 16GB-V100.

Preset	Mean pLDDT	Mean pTMS	Count	Walltime (min)
reduced_dbs	78.55129	0.6327694	559	41
genome	79.54728	0.6439633	559	45
genome2	79.39952	0.6434832	559	62
super	79.8062	0.6471785	558	76
economy	78.54494	0.6335336	559	37
casp14	78.56151	0.6313684	551	150 (91 nodes)

Notes:

Walltime according LSF output, including everything.
Metric stats are of the top 1 model ranked by the pTM score.
reduced_dbs and casp14 are DeepMind's official presets. Others are my tweaks.
The longest sequence was not included in the super mode run because of OOM concern.
Higher pLDDT or pTMS score is better. Their perfect values are 100 and 1.0, respectively.
casp14 data collected from Mark Coletti's run on 91 nodes. Results of eight longest sequences are missing due to OOM.

markcoletti commented 2 years ago

Looks like super is the winner! Um, except it's not a fair comparison since the larger proteins were left out.

Where does this leave us for the next large run?

FreshAirTonight commented 2 years ago

As soon as we figure out what causes core dumps and avoid the issue, we are ready for a real run.

FreshAirTonight commented 2 years ago

Test on 3400 sequence of pesudodesulfovibrio proteome was done within 1 hour using 200 summit nodes. A few long sequences >1500AAs were done with high memory node. The confident predictions cover 90% of total sequences, and ultra confident prediction covers 69%.

FreshAirTonight commented 2 years ago

A test on 24,738 sequences of Smagellanicum on 1k node 2 hour run. Each sequence have 5 DL models to go, totaling 123,690 target/model tasks. Got 52,452 (42%) done from this run. Since I ordered longer sequences to run first and the remaining shorter sequences cost far less time, it is estimated that >63% work load or even more has been done. Time spent on GPU is about 79%, of total run time, so overhead spending is not bad. 14TB data generated. Need to purge some pickle files of low rank models. One issue is that Dask reported only 2.4% of tasks were done, which was significantly lower than the actual tasks done.

FreshAirTonight commented 2 years ago

A quick update: Smagellanicum is almost done other than 141 long sequences still waiting in the high-memory queue. When this calculation is done, 99.7% (25151/25227) of Smagellanicum proteome will have been modeled with AlphaFold. For the most of week, the jobs are waiting in the queue. Actual runs took place are ~5 hrs on 200 regular nodes, and ~10 hrs on 36 high-memory nodes.

FreshAirTonight commented 2 years ago

The structure prediction of Smagellanicum proteome by AlphaFold 2 have been completed. Here is a summary.

Total number of protein sequences modeled: 25,134. This covers 99.7% of the whole Smagellanicum proteome provided, including all sequences < 2500 AAs. Total number of amino acids modeled is 11,448,954.

Since this is a brand-new genome, not a single sequence found in the current PDB database, a portal for experimentally determined protein structures. I found only 37 sequences in the PDB at sequence similarity > 90%, and 974 > 70%. In another word, all models provided here are genuinely new, good or bad.

Model local quality High confidence coverage pLDDT > 70): 58.2% in terms of AA, 56.9% in terms of sequences Very high confidence coverage pLDDT > 90): 36.0% in terms of AA.

Model global quality Very high confidence pTMS > 0.8: 20.1% high confidence pTMS > 0.6: 52.7%

Mean number of recycles of the top model: 12.0 (versus 3 by default).

Run Time Feature generation: 7,367 hours (using 8 AMD EPYC 7302 cores) Model inference: 16,064 hours (using 1 Nvidia V100)

Actual costs are about 2,000 Andes node hours, and 3,400 Summit node hours. This include both overheads and some wastes due to the experimental nature of the first large-scale application of this kind on Summit.

Note that in the PDB file, the b-factor column is the pLDDT score which is the confidence for the coordinate prediction of each residue. It max value is 100, > 70 indicates good confidence.

FreshAirTonight commented 2 years ago

Models are available on Summit here:

Top 1 models in PDB format /gpfs/alpine/world-shared/bif135/species/smagellanicum/af2_output (7.0GB)

Compressed tarball: /gpfs/alpine/world-shared/bif135/species/smagellanicum/af2_output.tar.gz (1.6GB)

BSDExabio / PSP

Large-scale application #11