dmzuckerman / Sampling-Uncertainty

Best Practices article intended for LiveCoMS
36 stars 5 forks source link

Qualitative section - formerly 'quick and dirty' #23

Closed agrossfield closed 6 years ago

agrossfield commented 6 years ago

Moreover, if I understand correctly, it's literally the same thing as Lyman and Zuckerman cluster population analysis, with the exception that there they do it as a function of block size instead of just breaking it into 2 pieces.

agrossfield commented 6 years ago

Actually, I think the whole section should be merged with global sampling assessment

dmzuckerman commented 6 years ago

Hey, @drroe, would you please see what you think about @agrossfield's suggestions?

Please check out my 2006 BJ paper with Ed Lyman which does seem similar to the 'combined clustering' analysis.

As to whether this is 'quick and dirty' or a quantification of global sampling, I think I agree with @drroe that this is quick-and-dirty. After all, all the methods in quick-and-dirty are semi-global. The distinction is whether they're easy and can rule out good sampling ... and don't really quantify the sampling if it is good. So I would favor leaving it in place but updating as needed

drroe commented 6 years ago

I've been traveling a lot lately (currently at SC17); sorry for the delay in responding here.

Moreover, if I understand correctly, it's literally the same thing as Lyman and Zuckerman cluster population analysis

It is similar, but it is certainly not the same. I've only had a chance to skim the 2006 paper in question, but from what I gather the procedure there is to generate a set of reference structures from a single trajectory using what appears to be a density-based method (similar in spirit to the DBscan clustering algorithm but without the minimum points requirement), then use that set of reference structures to assign populations using the same density based method (normalized by the number of frames in the trajectory). These populations can then be directly compared to ascertain convergence.

The combined clustering method uses all trajectories to be analyzed up front (also note that although the example is two trajectories, it is generalizable to N trajectories; see e.g. here where we compared 10 trajectories). So the resulting cluster representatives/centroids (somewhat analogous to the "reference structures" from the Lyman/Zuckerman paper) are determined from the entire ensemble of structures, not just a reference trajectory. This can be particularly useful if the trajectories are sampling very different regions of phase space. Also, the combined-clustering method can use any available clustering algorithm (e.g. hierarchical agglomerative, K-means, DBscan, etc) as well as distance metric. The fact that you're using a clustering algorithm also means you can use traditional clustering health metrics as a further assessment of the quality of the sampling, which is something I didn't really get into.

So I think that the spirit of the two methods is definitely similar (compare different trajectories or parts of a trajectory to a well-defined reference), but the actual details are quite different. Could be interesting to do a side-by-side comparison of the two but that's a different paper. I'd be happy to discuss further via phone/skype once I'm back in MD if anyone is interested.

dmzuckerman commented 6 years ago

@drroe - thanks for the explanation. I suggest you just put in as much (i.e., as little) commentary as needed: something like "this approach extends the work of ..." or whatever.

Another thing that @agrossfield and I discussed offline is that some of the analyses we (collectively) are putting in the paper are good (or great!) ideas but may not be true 'best practices' in the sense of having community acceptance. Because of this, I put a disclaimer to that effect at the beginning of the global section. You and Alan may want to consider similar language for the 'quick-and-dirty' section (which I think is going to get renamed to 'qualitative analysis' or similar) or subsections thereof as needed.

Of course, we hope that approaches we include in this article will become accepted best practices, but I think we should be transparent about the current status of methods.

dmzuckerman commented 6 years ago

@drroe I was just reading through Sec 4.3, now called 'Assessing Convergence', which opens with, "Convergence in the context of biomolecular simulations typically refers to the overlap of two independent measurements of the same property."

I know what you mean but I don't think this is quite what we want to say because, for example, two 1 ps simulation started from the same highly atypical configuration will both be similar but the agreement between them wouldn't indicate convergence. Of course, later you go on to say that one should really compare runs from significantly differing start states, which is a key point for the type of comparison you propose.

That said, I think the real point of this subsection is 'Comparing multiple trajectories' and the subsection should be titled that way or similar - which also is consistent with how we wrote the checklist. As you know, the general issue of 'assessing convergence' in an overall sense is extremely challenging and the phrase probably should be avoided in a section which is only intended to provide necessary-but-not-sufficient checks on sampling.

Another question: Is there any reason why the combined clustering approach couldn't be applied to two halves of a single long trajectory? If so, perhaps a comment to that effect would be helpful. If not, let me know, and I'll want to revise a sentence I added to the checklist.

As a minor point, is there anything readers should know about uncertainty in cluster populations? When are they similar enough, since they never will match exactly? Perhaps you can just mention the issue and refer to your paper if appropriate.

Would you mind revising the subsection with these points in mind? Thanks a lot.

@agrossfield let us know if you have other suggestions for Sec. 4.3

dmzuckerman commented 6 years ago

I guess @agrossfield doesn't have further comments here. @drroe please go ahead and revise when you can. Thank you.

drroe commented 6 years ago

Sorry for the delay on this.

know what you mean but I don't think this is quite what we want to say because, for example, two 1 ps simulation started from the same highly atypical configuration will both be similar but the agreement between them wouldn't indicate convergence.

What I really should say here is we want convergence of each simulation to the "right" answer (technically in the scenario you describe the two simulations are converged with respect to each other, just not to that "right" answer). So I will clarify that we want multiple sims from different starting points to head to that "right" answer.

Is there any reason why the combined clustering approach couldn't be applied to two halves of a single long trajectory?

It absolutely can, it's just recommended to have independent trajectories to avoid being too strongly biased by your initial conditions. I can add a comment about this.

As a minor point, is there anything readers should know about uncertainty in cluster populations? When are they similar enough, since they never will match exactly?

A good question. My personal feeling is that once your cluster populations deviate by less than 5% you're in reasonable shape, and once you're in the 1-2% range that's pretty well converged for MD sims. In terms of population that's on the order of 0.1 kcal/mol free energy difference. However, we never formalized this so I'm not sure if it's a good idea to recommend that as a "best practice". Any thoughts?

agrossfield commented 6 years ago

As a minor point, is there anything readers should know about uncertainty in cluster populations? When are they similar enough, since they never will match exactly?

A good question. My personal feeling is that once your cluster populations deviate by less than 5% you're in reasonable shape, and once you're in the 1-2% range that's pretty well converged for MD sims. In terms of population that's on the order of 0.1 kcal/mol free energy difference. However, we never formalized this so I'm not sure if it's a good idea to recommend that as a "best practice". Any thoughts?

I’d avoid mentioning a specific threshold, since you could always meet it by increasing the number of bins (if no bin goes above 1%, you’re not going to miss the bin percentage by much).

Alan


Dr. Alan Grossfield Associate Professor Department of Biochemistry and Biophysics University of Rochester Medical Center 610 Elmwood Ave, Box 712 Rochester, NY 14642 Phone: 585 276 4193 http://membrane.urmc.rochester.edu

dmzuckerman commented 6 years ago

Hi @drroe - just a reminder on getting this cleaned up. Will you be able to get to it this week?

drroe commented 6 years ago

Done. Let me know if more expansion or clarification is needed.

dmzuckerman commented 6 years ago

Thanks! @drroe . I'll try to look soon.

dmzuckerman commented 6 years ago

@mangiapasta check out my changes to quick-and-dirty (aka qualitative) section.

@drroe would you mind adding a ref for clustering where I've indicated?

dmzuckerman commented 6 years ago

@drroe some quite minor things for you in the clustering section (4.4)

drroe commented 6 years ago

Took a crack at adding equation for RMSD - my latex math formatting is a bit rusty though...

drroe commented 6 years ago

Added brief explanation for clustering (and reference), and additional reference for combined clustering.

dmzuckerman commented 6 years ago

Thanks a lot for all the edits @drroe !

dmzuckerman commented 6 years ago

@mangiapasta previously I wrote a comment questioning this sentence from Sec. 4: "Importantly, this technique [block autocorrleation analysis] can help to distinguish long-timescale trends in otherwise equilibrated data from truly non-equilibrated systems." To clarify, I think it's confusing to imply there is a difference between a 'truly' non-equil system and a system with long-timescale trends. Equilibrium is uniquely defined and anything that's not equil is non-equil. Thus, unless you object, I am changing the sentence to, "Importantly, this technique can help to distinguish long-timescale trends in apparently equilibrated data."

mangiapasta commented 6 years ago

(Not sure who to address this to, perhaps @dmzuckerman ?)

I updated Eq. 9. Previously there was no argument associated with the \min operator, although the argument was discussed briefly in the paragraph following the equation. But the text still left me a bit confused as to how I would actually do the minimization. So I introduced a rotation matrix and translation vector into Eq. 9 and then define them in the paragraph below. I find the revised equation to be more precise. Is my revision consistent with what the original authors intended?

dwsideriusNIST commented 6 years ago

@mangiapasta , I couldn't compile the paper with your revision, so I made a slight fix to the equation and associated text (basically, make the "d" vector a bolded, nonitalic symbol). Please confirm that my change is what you intended.

dmzuckerman commented 6 years ago

What @mangiapasta did looks correct to me, even though our community is usually much sloppier about the notation. @agrossfield should we leave it as is or do you have another idea?

agrossfield commented 6 years ago

Not to be a pain in the ass, but I think it’s unnecessarily confusing, particularly since alignment is not done by minimization — the Kabsch algorithm is an “analytical” approach that finds the best translation-rotation matrix in one shot.

I would just give the equation without the “min” prefix and say

“…where r and s are the coordinates of two distinct configurations. In general, the two configurations are first optimally aligned (cite Kabsch), so that the RMSD is the minimum “distance” between the molecules. It is not uncommon to use only a subset of the atoms (e.g. protein backbone, only secondary structure elements) when computing the RMSD, in order to filter the higher-frequency fluctuations.”

Here’s the relevant paper: http://scripts.iucr.org/cgi-bin/paper?S0567739476001873

Sorry I’ve been so checked out — I just sent in one grant (NSF), and am completely slammed writing the next one (NIH). I’ve basically done nothing but write grants for the last month, and I won’t get to do anything else until June. Remind me why I went into academia again?

Alan

On Apr 27, 2018, at 4:27 PM, dmzuckerman notifications@github.com<mailto:notifications@github.com> wrote:

What @mangiapastahttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mangiapasta&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=-lCS1ph0kIlccYviC_zSwMEHm_coh9dfVC3HrqWT1sU&s=JCpZ10O6HzEn48Gk4QSllGClvob9BM89YGVIHSoi2Tw&e= did looks correct to me, even though our community is usually much sloppier about the notation. @agrossfieldhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_agrossfield&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=-lCS1ph0kIlccYviC_zSwMEHm_coh9dfVC3HrqWT1sU&s=Jd_XIyDd5bJVg5VH3t1QOg9D8enJolWwBLCmiRNwWsA&e= should we leave it as is or do you have another idea?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmzuckerman_Sampling-2DUncertainty_issues_23-23issuecomment-2D385085195&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=-lCS1ph0kIlccYviC_zSwMEHm_coh9dfVC3HrqWT1sU&s=WHVHYESxR1PyPQdWe8KxMAZVIIpRQUE0uOdbwfjTlZ8&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AM-5F-2D8hINWnyJXPpydcNNQ-2DIje1VReh88ks5ts39FgaJpZM4QaG-2Dh&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=-lCS1ph0kIlccYviC_zSwMEHm_coh9dfVC3HrqWT1sU&s=4DVhgU_bvGSowpb5JutCHSuiEnmGlhNo-p6Zyg_9pkA&e=.


Dr. Alan Grossfield Associate Professor Department of Biochemistry and Biophysics University of Rochester Medical Center 610 Elmwood Ave, Box 712 Rochester, NY 14642 Phone: 585 276 4193 http://membrane.urmc.rochester.edu

mangiapasta commented 6 years ago

I'm fine with either approach (including min with its arguments or dropping min altogether), but the statement needs to be self-consistent and unambiguous. \min without arguments it not meaningful in this context.


From: Grossfield Lab notifications@github.com Sent: Friday, April 27, 2018 4:42:30 PM To: dmzuckerman/Sampling-Uncertainty Cc: Patrone, Paul (Fed); Mention Subject: Re: [dmzuckerman/Sampling-Uncertainty] Qualitative section - formerly 'quick and dirty' (#23)

Not to be a pain in the ass, but I think it’s unnecessarily confusing, particularly since alignment is not done by minimization — the Kabsch algorithm is an “analytical” approach that finds the best translation-rotation matrix in one shot.

I would just give the equation without the “min” prefix and say

“…where r and s are the coordinates of two distinct configurations. In general, the two configurations are first optimally aligned (cite Kabsch), so that the RMSD is the minimum “distance” between the molecules. It is not uncommon to use only a subset of the atoms (e.g. protein backbone, only secondary structure elements) when computing the RMSD, in order to filter the higher-frequency fluctuations.”

Here’s the relevant paper: http://scripts.iucr.org/cgi-bin/paper?S0567739476001873

Sorry I’ve been so checked out — I just sent in one grant (NSF), and am completely slammed writing the next one (NIH). I’ve basically done nothing but write grants for the last month, and I won’t get to do anything else until June. Remind me why I went into academia again?

Alan

On Apr 27, 2018, at 4:27 PM, dmzuckerman notifications@github.com<mailto:notifications@github.com> wrote:

What @mangiapastahttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mangiapasta&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=-lCS1ph0kIlccYviC_zSwMEHm_coh9dfVC3HrqWT1sU&s=JCpZ10O6HzEn48Gk4QSllGClvob9BM89YGVIHSoi2Tw&e= did looks correct to me, even though our community is usually much sloppier about the notation. @agrossfieldhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_agrossfield&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=-lCS1ph0kIlccYviC_zSwMEHm_coh9dfVC3HrqWT1sU&s=Jd_XIyDd5bJVg5VH3t1QOg9D8enJolWwBLCmiRNwWsA&e= should we leave it as is or do you have another idea?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmzuckerman_Sampling-2DUncertainty_issues_23-23issuecomment-2D385085195&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=-lCS1ph0kIlccYviC_zSwMEHm_coh9dfVC3HrqWT1sU&s=WHVHYESxR1PyPQdWe8KxMAZVIIpRQUE0uOdbwfjTlZ8&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AM-5F-2D8hINWnyJXPpydcNNQ-2DIje1VReh88ks5ts39FgaJpZM4QaG-2Dh&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=-lCS1ph0kIlccYviC_zSwMEHm_coh9dfVC3HrqWT1sU&s=4DVhgU_bvGSowpb5JutCHSuiEnmGlhNo-p6Zyg_9pkA&e=.


Dr. Alan Grossfield Associate Professor Department of Biochemistry and Biophysics University of Rochester Medical Center 610 Elmwood Ave, Box 712 Rochester, NY 14642 Phone: 585 276 4193 http://membrane.urmc.rochester.edu

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmzuckerman%2FSampling-Uncertainty%2Fissues%2F23%23issuecomment-385088603&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cf6fcffdb48444a09e7d108d5ac7f6611%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636604585531374887&sdata=waGoC0e2E%2BVujrXx%2BXEwnMDgsN6Thuh%2B3at231%2FxT5E%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAd9eER3kz-kgB54wh79KOHjbBEZr109_ks5ts4K2gaJpZM4QaG-h&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cf6fcffdb48444a09e7d108d5ac7f6611%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636604585531374887&sdata=AeR%2B%2F0XQWFCbug5BLf7%2FqW%2FOdWLtVuDn5zZPAN5YxKo%3D&reserved=0.

dmzuckerman commented 6 years ago

I updated to follow Alan's wording with a couple of clarifications.