HDFGroup / hermes

Extending the HDF5 library to support intelligent I/O buffering for deep memory and storage hierarchy systems
https://grc.iit.edu/research/projects/hermes
30 stars 16 forks source link

Feature Request: Persistent file storage from Hermes #683

Open cubrink opened 2 weeks ago

cubrink commented 2 weeks ago

Requested feature:

Persistent file storage from Hermes

Problem:

We are using Hermes as the IO engine for ADIOS2 in scientific workflows. Whenever our workflow finishes, Hermes has no means to store the collected data to disk and therefore the data is lost. As the raw data from the workflow is not saved it is not possible to perform post-hoc analysis.

As these workflows can be computationally expensive it is not practical to re-run experiments to regenerate the data. Further, if the workflow is of a stochastic or chaotic nature, it may not be possible to replicate previous runs.

Proposed solution:

When Hermes is preparing to clear data, add an option so that this data can be written to disk.

For example, the default engine for ADIOS2 is the bp5 engine. Hermes could be configured to write to disk data that it no longer is using with the bp5 format. At the end of the workflow when Hermes exits, an end user would have access to the raw data in bp5 format. This way the user gets the benefit of using the Hermes engine while in the workflow but also can access the results of the experiment after it was run.

lukemartinlogan commented 2 weeks ago

This will require adding ADIOS2 to the data stager in Hermes. I expect this to take me 2 days for implementing and debugging, since I'll have to revisit how ADIOS2 works. I can have a version of this for the Monday meeting.

lukemartinlogan commented 2 weeks ago

This is more of an involved process than I originally anticipated. ADIOS2 requires BeginStep and EndStep which is difficult to combine with asynchronous data staging. By the time the stager is activated, multiple steps could be present which makes this more complicated. There are two main questions I'm experimenting with:

  1. Can BeginStep/EndStep be called across processes, but different processes are at different steps?
  2. Can BeginStep/EndStep be called out-of-order. E.g., BeginStep(16), BeginStep(15), BeginStep(17)?

If the answer to either of these questions is no, it will require some changes to the Hermes staging system, which I suspect will take at least a week.