LLNL / merlin

Machine Learning for HPC Workflows
MIT License
118 stars 26 forks source link

bugfix/initial-status-issues #471

Closed bgunnar5 closed 4 months ago

bgunnar5 commented 5 months ago

This branch should fix the problems discussed in my first comment of #469. However, I was never able to reproduce the initial bug that was posted about in that issue thread.

This branch should also allow for safer handling of the JSONDecodeError that users were seeing as I added a try/except for it. I think this error may have been arising due to a race condition that I failed to recognize initially. I tried to safeguard around this race condition by creating the write_status function. Hopefully this makes it so the JSONDecodeError is gone now.

lucpeterson commented 5 months ago

Can you briefly explain the race condition and how this would fix it?

bgunnar5 commented 5 months ago

In step.py where you last commented, the status was being dumped using:

with open(status_filepath, "w") as status_file:
     json.dump(status_info, status_file)

In this case, the status is being dumped to the status file and not using a file lock object to ensure nothing else is reading/writing to it at the same time. I changed this to call:

write_status(status_info, status_filepath, f"{self.workspace.value}/status.lock")

With this change, it ensures that a file lock is being used every time the status file is being written to here.

bgunnar5 commented 4 months ago

@lucpeterson just added two more things to this:

  1. A variable in status constants that's used in merlin v1.12.0 but won't be used in v1.12.1 when this is released. This was added in case someone uses merlin status from v1.12.1 on a study that was ran in v1.12.0.
  2. An additional if clause to check for a missing status file while condensing. This should help keep the workflow from crashing if there's a problem with condensing the status files.

Once I get the merge conflicts resolved this should be good to go