ammarhakim / gkyl

This is the main source repo for the Gkeyll 2.0 code. Please see gkeyll.rtfd.io for details.
https://gkeyll.readthedocs.io/en/latest/
62 stars 18 forks source link

Implement adios2 #150

Closed manauref closed 1 year ago

manauref commented 1 year ago

This PR requires some careful review & discussion. It replaces our previous use of ADIOS with a wrap of ADIOS2's C API.

The implementation of ADIOS2 is done as close to our previous implementation of ADIOS as possible. That is, if you run a simulation with the version of the code that used adios1 and with this new version, you produce the same files and the changes required throughout the app are relatively minimal. Some notable differences are:

  1. ADIOS2 files appear to be folders.
  2. Previously ADIOs was initialized in gkyl.cxx. We now remove that and initialize adios2 in PlasmaOnCartGrid (or in each unit test). Likewise we explicitly terminate adios2 in PlasmaOnCartGrid/unit tests.
  3. It seems that to read string attributes I have to preallocate a string sufficiently big. Since I don't know how big to make this to read inputfile, I have disabled writing of input files as an attribute. This primarily affects regression tests and people who used this feature, but I think very few people do. Maybe @ammarhakim has some suggestions on how to handle this (I tried passing a void * as we did with ADIOS, but didn't work).
  4. Because we only initialize one adios2 object (in PlasmaOnCartGrid) we need to pass that object to all CartFields and DynVectors one wishes to write. The minimal intervention strategy to do that for CartFields is to give it to the grids, and CartFields can retrieve it from there. But for DynVectors we have to pass the adios2 object to every DynVector we wish to write.

Despite these differences,

Some possible future improvements to be discussed:

ammarhakim commented 1 year ago

I am closing this PR pending further review. We need to understand this better and it is best to close the PR and understand the issues better. In particular:

This work is very good but let's close the PR till we figure the above out.

liangwang0734 commented 1 year ago

The default behavior of adios2 is folder based and the actual number of heavy files under the folder can be flexible (say, equal to the number of nodes). This makes the Adios2 interface more flexible and allows benefits of single file, such as changing the number of processors for restarting.

On Sun, Oct 22, 2023, 11:16 AM Ammar @.***> wrote:

Closed #150 https://github.com/ammarhakim/gkyl/pull/150.

— Reply to this email directly, view it on GitHub https://github.com/ammarhakim/gkyl/pull/150#event-10733480573, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHUIY6VERYI4CEH7J4BYY3YAU2ENAVCNFSM6AAAAAA6KFB66WVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQG4ZTGNBYGA2TOMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ammarhakim commented 1 year ago

Changing the number of files for restarting was always possible even in Adios1. As I said above I ok with a directory instead of a file. That is not the problem here. We will meet to discuss issues next week.

manauref commented 1 year ago

Responses to @ammarhakim's points:

By the way, these ideas of consolidating I/O into fewer files, and whether they gain anything or not, could be tested in a unit test before changing the App.

ammarhakim commented 1 year ago

We do not need to pass anything to anybody. We should create the object in the main gkyl.cxx and then set a global Lua variable which holds this. Then this can be queried and used if it is set properly. If it is not set then the write should fail.

Allocating strings of 100 bytes is a very bad idea. We need to see why this is failing in ADIOS and fix. Obviously we can't take a risk of failing randomly and taking hours of wasted effort to figure out why things crashed.

We are not IO bound. 10%-15% is ok for large file IO. No one has yet shown that this will be significantly reduce by writing 1 monster file. Is there any data to back up that this is actually faster? I do not want assurances from ADIOS people that it is so, but proper numbers with our IO pattern to show the gains. A 5-10% gain, or even 50%, will not help us at all.

JunoRavin commented 1 year ago

I just want to chime in that I am not opposed to writing out a single file for sets of diagnostics. But I do not think it is feasible to combine distribution functions and other grid diagnostics into a single ADIOS2 output. We might consider experimenting with a file that has all the distribution functions and a file that has all the other grid diagnostics (so 9 files in Mana's example). Combining distribution function outputs and other grid diagnostics would lead to huge headaches in my opinion for data sharing/transfer, as it is often easiest to most quickly analyze data in the moments but if we start lugging around 1 TB files because the distribution functions and the density are packaged together, I think we will regret it.

In terms of I/O vs. compute bound, I think we should be more precise about the problem: for most mid-scale simulations I/O is 10-15 percent of the compute time. However, because of how we do I/O with ADIOS1, we literally cannot run simulations on 10k+ MPI processes. We have only ever done performance analysis at scale, not science, because our ADIOS1 implementation gets us kicked off cluster resources for crashing the file system.

Whatever the ADIOS folks recommend for 1000+ node simulations is what we should be doing, or at least supporting, if we want to use leadership computing clusters. I don't want to completely revamp my work flows for these folder-based outputs and I certainly don't like the idea of all the distribution functions all being in a single file, but if we have to do this to run on 1000+ nodes then we should support the option to run the way ADIOS recommends.

manauref commented 1 year ago

thanks for the extra insight @JunoRavin

Yes, i forgot about distfs. We won't group them with gridDiagnostics, they'll have their own separate files.