Closed manauref closed 1 year ago
I am closing this PR pending further review. We need to understand this better and it is best to close the PR and understand the issues better. In particular:
This work is very good but let's close the PR till we figure the above out.
The default behavior of adios2 is folder based and the actual number of heavy files under the folder can be flexible (say, equal to the number of nodes). This makes the Adios2 interface more flexible and allows benefits of single file, such as changing the number of processors for restarting.
On Sun, Oct 22, 2023, 11:16 AM Ammar @.***> wrote:
Closed #150 https://github.com/ammarhakim/gkyl/pull/150.
— Reply to this email directly, view it on GitHub https://github.com/ammarhakim/gkyl/pull/150#event-10733480573, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHUIY6VERYI4CEH7J4BYY3YAU2ENAVCNFSM6AAAAAA6KFB66WVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQG4ZTGNBYGA2TOMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Changing the number of files for restarting was always possible even in Adios1. As I said above I ok with a directory instead of a file. That is not the problem here. We will meet to discuss issues next week.
Responses to @ammarhakim's points:
save
to write it to a bp or hdf5 file.By the way, these ideas of consolidating I/O into fewer files, and whether they gain anything or not, could be tested in a unit test before changing the App.
We do not need to pass anything to anybody. We should create the object in the main gkyl.cxx and then set a global Lua variable which holds this. Then this can be queried and used if it is set properly. If it is not set then the write should fail.
Allocating strings of 100 bytes is a very bad idea. We need to see why this is failing in ADIOS and fix. Obviously we can't take a risk of failing randomly and taking hours of wasted effort to figure out why things crashed.
We are not IO bound. 10%-15% is ok for large file IO. No one has yet shown that this will be significantly reduce by writing 1 monster file. Is there any data to back up that this is actually faster? I do not want assurances from ADIOS people that it is so, but proper numbers with our IO pattern to show the gains. A 5-10% gain, or even 50%, will not help us at all.
I just want to chime in that I am not opposed to writing out a single file for sets of diagnostics. But I do not think it is feasible to combine distribution functions and other grid diagnostics into a single ADIOS2 output. We might consider experimenting with a file that has all the distribution functions and a file that has all the other grid diagnostics (so 9 files in Mana's example). Combining distribution function outputs and other grid diagnostics would lead to huge headaches in my opinion for data sharing/transfer, as it is often easiest to most quickly analyze data in the moments but if we start lugging around 1 TB files because the distribution functions and the density are packaged together, I think we will regret it.
In terms of I/O vs. compute bound, I think we should be more precise about the problem: for most mid-scale simulations I/O is 10-15 percent of the compute time. However, because of how we do I/O with ADIOS1, we literally cannot run simulations on 10k+ MPI processes. We have only ever done performance analysis at scale, not science, because our ADIOS1 implementation gets us kicked off cluster resources for crashing the file system.
Whatever the ADIOS folks recommend for 1000+ node simulations is what we should be doing, or at least supporting, if we want to use leadership computing clusters. I don't want to completely revamp my work flows for these folder-based outputs and I certainly don't like the idea of all the distribution functions all being in a single file, but if we have to do this to run on 1000+ nodes then we should support the option to run the way ADIOS recommends.
thanks for the extra insight @JunoRavin
Yes, i forgot about distfs. We won't group them with gridDiagnostics, they'll have their own separate files.
This PR requires some careful review & discussion. It replaces our previous use of ADIOS with a wrap of ADIOS2's C API.
The implementation of ADIOS2 is done as close to our previous implementation of ADIOS as possible. That is, if you run a simulation with the version of the code that used adios1 and with this new version, you produce the same files and the changes required throughout the app are relatively minimal. Some notable differences are:
gkyl.cxx
. We now remove that and initialize adios2 in PlasmaOnCartGrid (or in each unit test). Likewise we explicitly terminate adios2 in PlasmaOnCartGrid/unit tests.inputfile
, I have disabled writing of input files as an attribute. This primarily affects regression tests and people who used this feature, but I think very few people do. Maybe @ammarhakim has some suggestions on how to handle this (I tried passing a void * as we did with ADIOS, but didn't work).Despite these differences,
Some possible future improvements to be discussed: