Closed danielparton closed 9 years ago
On Wed, Feb 11, 2015 at 3:30 PM, Daniel Parton notifications@github.com wrote:
The TK modeling is complete (up to the implicit-solvent MD stage), so let's discuss whether we should start the modeling of the remaining human kinases? These are the considerations for data usage and time taken.
Data usage: The model files (up to the implicit-solvent MD stage) for each TK target take up about 3 GB. This totals 270 GB for the 90 human TKs, and would total 1.5 TB for all 504 human kinases.
Time taken to complete modeling: It took about 17 days to complete the modeling (from alignment to implicit-solvent MD) for the 90 TKs. Assuming a similar rate, it would take about ~80 days (~2.5 months) to complete the modeling for the remaining 414 kinases.
I think these considerations seem reasonable, and so it would be worth doing the modeling?
Thanks! Let's discuss with Team Kinase on Thu or Fri. I'll find a timeslot.
We should first look through some of the models to see if we spot any issues with this approach now.
For the computation, how many thread-days was this? I'm not sure how many threads you were using.
Also, do you have a breakdown of the timing and storage requirements by stage? Is the storage size uncompressed, or compressed?
J
John D. Chodera Assistant Faculty Member, Computational Biology Memorial Sloan-Kettering Cancer Center email: j choderaj@mskcc.orgohn.chodera@choderalab.org office: 646.888.3400 fax: 510.280.3760 mobile: 415.867.7384 url: http://www.choderalab.org
Ok, sounds good. I'll try to prepare some statistics to aid this discussion. Things are a bit more complicated than thread-days, since some stages require GPUs and others don't. But I can put together some relevant statistics. Total wall-clock computation time will also depend on how busy the cluster is, of course.
Can do the same for storage size. Relatively large files such as PDB files and Modeller restraints files, are stored compressed.
Things are a bit more complicated than thread-days, since some stages require GPUs and others don't.
Maybe the stages could be broken down into thread-days and GPU-days?
But I can put together some relevant statistics.
Thanks! This should probably go in the manuscript too.
Total wall-clock computation time will also depend on how busy the cluster is, of course.
Of course!
504 human kinases 90 TKs 4433 kinase templates
Walltime per target with 1 process: 3.9 days
Walltime per target for 1 process with 1 GPU: 3.0 days
filename storage_single_model storage_one_target storage_90_targets storage_504_targets
alignment.pir 736 3M 306M 1.7G
model.pdb.gz 38K 157M 14G 79G
sequence-identity.txt 6 21K 2M 11M
restraints.rsr.gz 397K 1.2G 108G 605G
modeling-log.yaml 102 742K 67M 374M
unique_by_clustering 0 0 0 0
MODELING TOTAL 436K 1.4G 123G 686G
-------------------------------------------------------------------------------------------------------
implicit-refined.pdb.gz 72K 280M 25G 141G
implicit-energies.txt 5K 19M 1.7G 9.6G
implicit-log.yaml 172 785K 71M 396M
REFINEMENT TOTAL 77K 300M 27G 151G
-------------------------------------------------------------------------------------------------------
TOTAL 513K 1.7G 149G 837G
Estimated timing for modeling remaining 414 kinases:
((3.9 * 414) / 128.) + ((3 * 414) / 60.) = 33.3 days
This sounds great, by the way!
Is this running now?
Some new nodes have been added to the cluster that add many more threadslots.
I was running a series of chained jobs until yesterday - I think I was just reticent to restart in light of the storage issues which were ongoing at the time, but seem to be ok now. Anyway this is a good opportunity to increase the number of cores (to 128).
On Fri, Mar 6, 2015 at 2:14 PM, John Chodera notifications@github.com wrote:
Is this running now?
Some new nodes have been added to the cluster that add many more threadslots.
— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler-manuscripts/issues/4#issuecomment-77617650 .
Total usage is supposed to be <1TB, so you should be totally fine.
Given the difficulties we've been having with the cluster, do we want to postpone whole-kinome modeling to the Ensembler 2 paper with @pgrinaway?
I don't have a strong opinion - it's not difficult for me to keep the modeling running along. Fault tolerance in the code is better than it used to be, so cluster problems generally at worst mean resubmitting the job scripts. On Mar 12, 2015 12:42 AM, "John Chodera" notifications@github.com wrote:
Given the difficulties we've been having with the cluster, do we want to postpone whole-kinome modeling to the Ensembler 2 paper with @pgrinaway https://github.com/pgrinaway?
— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler-manuscripts/issues/4#issuecomment-78423046 .
This was eventually postponed due to various problems with the cluster and with the Ensembler code. This can be left for the Ensembler 2 paper.
The TK modeling is complete (up to the implicit-solvent MD stage), so let's discuss whether we should start the modeling of the remaining human kinases? These are the considerations for data usage and time taken.
Data usage: The model files (up to the implicit-solvent MD stage) for each TK target take up about 3 GB. This totals 270 GB for the 90 human TKs, and would total 1.5 TB for all 504 human kinases.
Time taken to complete modeling: It took about 17 days to complete the modeling (from alignment to implicit-solvent MD) for the 90 TKs. Assuming a similar rate, it would take about ~80 days (~2.5 months) to complete the modeling for the remaining 414 kinases.
I think these considerations seem reasonable, and so it would be worth doing the modeling?