Error handling - Githubissues

proteneer commented 10 years ago

I'd like some feedback on how to handle errors for when a core encounters a NaN, insane forces, or has a major Force discrepancy with the Reference platform. These are all symptoms of one or more of the following:

1) Hardware overclocks 2) Driver bugs 3) Bad system topologies/force fields 4) Bugs in OpenMM

For reference, the State check code is here (I'd love some more feedback on if they are reasonable or not):

https://github.com/proteneer/backend/blob/master/core/openmm_core/StateTests.h https://github.com/proteneer/backend/blob/master/core/openmm_core/StateTests.cpp

Option 1, currently implemented: When a stream dies, the error_count for the stream is increased by 1. When a single stream has had more than 10 errors, the stream is stopped and needs to be manually restarted (which resets error_count to 0).

Option 2, same as above: but reset the error_count to zero when a checkpoint is successfully sent (indicating the stream has recovered).

Option 3, implement some fancier rollback system where older checkpoints are used.

rmcgibbo commented 10 years ago

These seem like reasonable things to do. What about having the ability for project managers to register some kind of callback / hook on exceptional events like this? If the manager could configure an email notification or something, that might be good, especially if the root cause is bad system / topologies.

On Tue, Apr 22, 2014 at 4:35 PM, Yutong Zhao notifications@github.comwrote:

I'd like some feedback on how to handle errors for when a core encounters a NaN, insane forces, or has a major Force discrepancy with the Reference platform. These are all symptoms of one or more of the following:

1) Hardware overclocks 2) Driver bugs 3) Bad system topologies/force fields 4) Bugs in OpenMM

For reference, the State check code is here (I'd love some more feedback on if they are reasonable or not):

https://github.com/proteneer/backend/blob/master/core/openmm_core/StateTests.h

https://github.com/proteneer/backend/blob/master/core/openmm_core/StateTests.cpp

Option 1, currently implemented: When a stream dies, the error_count for the stream is increased by 1. When a single stream has had more than 10 errors, the stream is stopped and needs to be manually restarted (which resets error_count to 0).

Option 2, same as above: but reset the error_count to zero when a checkpoint is successfully sent (indicating the stream has recovered).

Option 3, implement some fancier rollback system where older checkpoints are used.

— Reply to this email directly or view it on GitHubhttps://github.com/proteneer/backend/issues/5 .

jchodera commented 10 years ago

I like @rmcgibbo's suggestion about allowing user-definable handlers. Together with a sensible default (maybe with a user-tunable error_count for the default handler), this would be very flexible.

Some other scenarios to think through:

What if something is screwed up and all streams associated with a project will fail?
Can a user with a crazily overclocked GPU end up being assigned multiple attempts for the same stream to recover?
If there is a driver bug, a huge fraction of users may encounter problems

Note also that checking for discrepancies with the Reference problem may be very slow. For example, large systems with GBSA could take many, many minutes to run. Perhaps using the CPU platform for reference is sensible?

proteneer commented 10 years ago

Adding handlers and hooks is certainly something I have planned (and for much more than simply reporting errors).

What if something is screwed up and all streams associated with a project will fail?

The streams will stop individually until all streams in a target stops.

Can a user with a crazily overclocked GPU end up being assigned multiple attempts for the same stream to recover?

Very very unlikely, the assignment algorithm first picks a random manager based on their weights, and then a random target based on the target's weights.

If there is a driver bug, a huge fraction of users may encounter problems.

Yep.

For example, large systems with GBSA could take many, many minutes to run.

A few minutes for a single step? I also don't trust the CPU platform enough (it's still fairly new)

jchodera commented 10 years ago

A few minutes for a single step? I also don't trust the CPU platform enough (it's still fairly new)

On my laptop, yes!

proteneer commented 10 years ago

OK I'll ask Peter tomorrow about using the CPU platform. I'm also going to modify the error formula for Force comparisons between Reference and OpenCL platforms to use the following:

double mse = 0;
for(int i=0; i<nAtoms; i++) {
    double ex = forcesA[i][0] - forcesB[i][0];
    double ey = forcesA[i][1] - forcesB[i][1];
    double ez = forcesA[i][2] - forcesB[i][2];
    mse += ex*ex+ey*ey+ez*ez;
}
mse = sqrt(mse/nAtoms);

With a tolerance of 5 KJ per mol per nm. Though I should probably make this a configurable setting in the future.

Edit: implemented via 5c755cd771cb65b88921cfd31027a52d20e9a687

FoldingAtHome / siegetank-backend

Error handling #5