JelleAalbers / blueice

Build Likelihoods Using Efficient Interpolations and monte-Carlo generated Events
BSD 3-Clause "New" or "Revised" License
9 stars 9 forks source link

Caching encounters race conditions during parallell jobs #29

Closed kdund closed 4 years ago

kdund commented 4 years ago

When starting multiple batch jobs on Midway, it is necessary to run a "burn-in" run beforehand to avoid multiple jobs attempting to write to a cache file, corrupting it. Deleting and re-run is then required.

kdund commented 4 years ago

I think this is a better approach than I had in mind first: Short summary: this will allow the concurrent jobs to write to a (temp) file, which is then renamed in a manner so that only the whole file is written to the system-- so that no race condition occurs. (only fault source will then be if the last run is faulty somehow) https://stackoverflow.com/questions/12003805/threadsafe-and-fault-tolerant-file-writes

kdund commented 4 years ago

So trying to reproduce race conditions is (perhaps predictably) challenging. I will start running with this fix and no "burn-in". Changing the "open" to atomic write( df60ea6f6fe44a617d019f3e90a9ab421794f96d) will ensure that no two jobs write to the same cache file. It might be overwritten if two jobs both realise they need a non-existent cached file, but in the end, the slower of them will overwrite the entire file.

JelleAalbers commented 4 years ago

Sounds good! Thanks for pointing to the atomicwrites package , seems a lot better than making a custom temporary file + renaming solution.

kdund commented 4 years ago

Proposed solution in #30

kdund commented 4 years ago

closed with #30