Closed FreshAirTonight closed 2 years ago
To prepare for big runs, I did some testing runs and collected statistics on a benchmark set, 559 Desulforbio proteins. All except casp14 runs using 36 Summit Nodes, each with 6 16GB-V100.
Preset | Mean pLDDT | Mean pTMS | Count | Walltime (min) |
---|---|---|---|---|
reduced_dbs | 78.55129 | 0.6327694 | 559 | 41 |
genome | 79.54728 | 0.6439633 | 559 | 45 |
genome2 | 79.39952 | 0.6434832 | 559 | 62 |
super | 79.8062 | 0.6471785 | 558 | 76 |
economy | 78.54494 | 0.6335336 | 559 | 37 |
casp14 | 78.56151 | 0.6313684 | 551 | 150 (91 nodes) |
Notes:
Looks like super
is the winner! Um, except it's not a fair comparison since the larger proteins were left out.
Where does this leave us for the next large run?
As soon as we figure out what causes core dumps and avoid the issue, we are ready for a real run.
Test on 3400 sequence of pesudodesulfovibrio proteome was done within 1 hour using 200 summit nodes. A few long sequences >1500AAs were done with high memory node. The confident predictions cover 90% of total sequences, and ultra confident prediction covers 69%.
A test on 24,738 sequences of Smagellanicum on 1k node 2 hour run. Each sequence have 5 DL models to go, totaling 123,690 target/model tasks. Got 52,452 (42%) done from this run. Since I ordered longer sequences to run first and the remaining shorter sequences cost far less time, it is estimated that >63% work load or even more has been done. Time spent on GPU is about 79%, of total run time, so overhead spending is not bad. 14TB data generated. Need to purge some pickle files of low rank models. One issue is that Dask reported only 2.4% of tasks were done, which was significantly lower than the actual tasks done.
A quick update: Smagellanicum is almost done other than 141 long sequences still waiting in the high-memory queue. When this calculation is done, 99.7% (25151/25227) of Smagellanicum proteome will have been modeled with AlphaFold. For the most of week, the jobs are waiting in the queue. Actual runs took place are ~5 hrs on 200 regular nodes, and ~10 hrs on 36 high-memory nodes.
The structure prediction of Smagellanicum proteome by AlphaFold 2 have been completed. Here is a summary.
Total number of protein sequences modeled: 25,134. This covers 99.7% of the whole Smagellanicum proteome provided, including all sequences < 2500 AAs. Total number of amino acids modeled is 11,448,954.
Since this is a brand-new genome, not a single sequence found in the current PDB database, a portal for experimentally determined protein structures. I found only 37 sequences in the PDB at sequence similarity > 90%, and 974 > 70%. In another word, all models provided here are genuinely new, good or bad.
Model local quality High confidence coverage pLDDT > 70): 58.2% in terms of AA, 56.9% in terms of sequences Very high confidence coverage pLDDT > 90): 36.0% in terms of AA.
Model global quality Very high confidence pTMS > 0.8: 20.1% high confidence pTMS > 0.6: 52.7%
Mean number of recycles of the top model: 12.0 (versus 3 by default).
Run Time Feature generation: 7,367 hours (using 8 AMD EPYC 7302 cores) Model inference: 16,064 hours (using 1 Nvidia V100)
Actual costs are about 2,000 Andes node hours, and 3,400 Summit node hours. This include both overheads and some wastes due to the experimental nature of the first large-scale application of this kind on Summit.
Note that in the PDB file, the b-factor column is the pLDDT score which is the confidence for the coordinate prediction of each residue. It max value is 100, > 70 indicates good confidence.
Models are available on Summit here:
Top 1 models in PDB format
/gpfs/alpine/world-shared/bif135/species/smagellanicum/af2_output
(7.0GB)
Compressed tarball: /gpfs/alpine/world-shared/bif135/species/smagellanicum/af2_output.tar.gz
(1.6GB)
I have about features for 30,000 proteins ready for modeling.