Better handling of corrupted chunks

Lnaden commented 6 years ago

When a simulation gets interrupted, some chunks can be corrupted.

We need to better handle corrupted chunks on resume by going progressively further back.

For analysis, we need to implement a way to detect bad chunks and only analyze on the good chunks.

Lnaden commented 6 years ago

Here are the symptoms we have observed:

Corruption of data happens when a simulation is ungracefully interrupted, e.g. cluster job times out
Not all simulations get their data corrupted
Only the positions variable on the analysis file is corrupted. This is the analysis_only_particles
Only the last chunk of the data is corrupted. "Chunk" here refers to the HDF5 Chunking
We only seem to observe corruption in protein/ligand binding simulations, not host-guest or hydration (thank you @andrrizzi for confirming)

These symptoms are misleading though, as when Symptom 1 happens results in Symptom 2 and 3.

Symptom 2 is observed because if the simulation is halted from Symptom 1 in propagation, mixing, or when no MultiStateReporter.write_* actions are being taken, then corruption does not occur.

Now lets make some assumptions:

Assume the halt happens in the write step
Assume protein/ligand binding simulation

Whatever variable is currently being written is corrupted. The analysis_only_particles's positions is the often 2nd largest variable we store, being of shape [N_analysis_particles, 3], the largest is the full checkpoint positions. The remaining variables are all relatively small, assuming a protein/ligand binding simulation. We only observe corruption of the analysis_only_particles simply because they are being written far more often than the full checkpoint, so the likelihood that the halt happens specifically on a checkpoint interval is much smaller than on any other iteration. So now we can assume the halt happens on a non-checkpoint interval. Because of the size of the analysis_only_particles's position variable, it will be the most time-consuming variable to write, so it becomes the most likely variable which corrupts. All of this is why we observe Symptom 3, and why no other variable appears to corrupt, its a mater of how much time is spent. I believe if we saw this often enough, eventually it would corrupt another variable, but with low probability.

Symptom 4 is a function of how HDF handles its file system, a corrupted entry simply corrupts the whole chunk, which is why we can back up to the chunk before to get good data.

Symptom 5 is a problem because of the relative size of the position variable. The protein/ligand binding simulations have the largest one of these variables because often N_analysis_particles * 3 >> {n_states, n_replicas, n_states * n_replicas}. However, if we relax the assumption this is a protein/ligand simulation to include solvation or even host-guest systems, N_analysis_particles * 3 can be on the same order as n_states * n_replicas or even n_states (e.g. alanine dipeptide hydration). I think the reason we don't see the corruption in host-guest or hydration is the individual Reporter.write_* steps are much faster, so the simulation spends more relative time propagating than writing, meaning the halts are far more likely to happen in the propagate step, avoiding the corruption. However, it could still happen.

Now then, with the full understanding of why and when this happens. Here is where we would need to fix the problem.

Fix the resume feature of the MultiStateSampler to progressively try to go back to further and further checkpoints to overcome the corruption. (I think I already fixed by #972 with this code here)
Fix the analyzer to detect corruption and set the maximum iteration to analyze to the last good checkpoint.

I think this is just finding a way to generalize what I did in #972. My thought is to make Reporter.read_last_iteration() actually do a corruption check as well on first call, the cache it until that variable is written again or the reporter is closed. Something like this:

def read_last_iteration(last_checkpoint=True)
    if _cached_last_iter is not None:
        return _cached_last_iter
    last_iter = read_last_iteration_from_file(on_checkpoint = last_checkpoint)
    try:
        for variable in storage:
             read_variable_at_iteration(variable, last_iter)
        is_corrupt = False
    except HDFError:
        is_corrupt = True
    if is_corrupt:
        last_iter = last_last_checkpoint
    _cached_last_iter = last_iter
    return last_iter

def write_last_iteration(iteration):
    _cached_last_iter = None
    do_current_stuff_to_write_last_iter(iteration)

This would slow down the function a bit, but I think it might be the most stable answer unless we add a secondary function called like Reporter.read_last_noncorrupted_iteration, but that seems redundant unless we think the speed hit will be really bad for my proposed changes to read_last_iteration.

Do you think this is the correct diagnosis of the problem and proposed fixes that we need @andrrizzi and @jchodera. Also cc #983 since I think this will fully resolve it as well, although I think #972 already did that by the looks of it.

Lnaden commented 6 years ago

tl;dr version: We can't stop the corruption, but we can check for it and truncate our results.

And apologies for the wall of text. I've been thinking about this for a while now and because its been such a pain to pin down and to reliably reproduce, this is the conclusion I have come to as to why this happens which makes the most logical sense to me. I want to avoid putting in a "fix" then having it crop up again because we did not really fix it.

Lnaden commented 6 years ago

I also think the reason we never saw this with the Repex sims is propagation was the long step, we did not have a large number of replicas, and chunk sizes was 1.

andrrizzi commented 6 years ago

Sounds like it's worth trying to me!

I read very quickly so apologies in advance if I missed something.

If we're giving a notion of corruption to the Reporter we might also merge this modification to what you did in #972, and let read_last_iteration(..., last_healthy_iteration=True) try previous iterations until it finds the right one to reduce the code overall complexity. We could use the same thing during the resuming of the samplers or during analysis.

For the cached value, I'd say let's use it only if we actually read it multiple times during the simulation and/or if it is an actual bottleneck, otherwise it may not be worth taking the risk.

Lnaden commented 6 years ago

let read_last_iteration(..., last_healthy_iteration=True) try previous iterations until it finds the right one to reduce the code overall complexity

My plan was to do something like this, but obscure the concept of corruption from the user. When the user wants the last iteration, I'm assuming they want the last iteration where everything is readable. I can expose that as a kwarg I suppose if you can think of a non-developer reason to do so.

Although thinking about it, it may be good to have so people can access the written, non-corrupted data as well.

For the cached value, I'd say let's use it only if we actually read it multiple times during the simulation and/or if it is an actual bottleneck, otherwise it may not be worth taking the risk.

My though was invalidate any time the write_last_iteration is called, open_mode is set, or close() is called. But that may be overkill. The time sink I worry about is checking every single variable for corruption which uses last iteration write may be slow, but it might also not be so bad. It may even be possible to automate the sweep looking for any NetCDF variables which rely on the iteration dimension and determining it automatically by simply looking at iterations in order of [-1, last_checkpoint, last_checkpoint - chunk_interval, last_checkpoint - 2chunk_interval...]so we don't have to worry about missing variables.

choderalab / yank

Better handling of corrupted chunks #977