LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 36 forks source link

SCR_Config contains pair of scr_param_init / scr_param_finalize #268

Open rhaas80 opened 3 years ago

rhaas80 commented 3 years ago

This pair causes the various parameter hashes and data structures to be created and torn down for each call to SCR_Config. Worse, it also triggers the file .scr/app.conf to be read / written for each SCR_Config call.

Eg in ed7b53214123615a07cb4e961957977b46fad0fe with an extra printf

diff --git a/src/scr_param.c b/src/scr_param.c
index e8e637d..e546c86 100644
--- a/src/scr_param.c
+++ b/src/scr_param.c
@@ -493,6 +493,7 @@ int scr_param_finalize()
     /* store parameters set by app code for use by post-run scripts */
     char* app_file = app_config_path();
     if (app_file != NULL) {
+      printf("writing app config file\n");
       scr_config_write(app_file, scr_app_hash);
     }
     scr_free(&app_file);

executing test_config gives:

./test_config
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file
writing app config file

ie 37 file creations. Files are created / read only on rank 0 but still this is not good behaviour.

This seems to have been introduced in 561b4ea5035ef874197b4a25ae5385152320cb43 part of https://github.com/LLNL/scr/pull/216 , which also seems to remove the possibility for SCR_Config to return a failure to set a parameter (since it always returns NULL when setting values).

adammoody commented 3 years ago

@rhaas80 , I haven't reproduced this yet, but quickly scanning the code, I think this was the abort I was trying to avoid back when I made the change:

https://github.com/LLNL/scr/blob/ed7b53214123615a07cb4e961957977b46fad0fe/src/scr_param.c#L191

Basically if scr_param_get is called before the scr_env_hash has been initialized and if the parameter being queried also happens to be set in the environment, then this code path with the abort gets triggered. Maybe I was trying to query the value, or query a two-level value? I don't remember off hand.

As for reading/writing the file multiple times, I decided that was an acceptable tradeoff since people tend to just set a few params and that only happens once in the run. It's inefficient but not frequent, so this trade off made sense. We can definitely look for a better solution.

Perhaps using a kvtree to cache the environment settings is overkill anyway. Or we can find another way to initialize those kvtrees.

adammoody commented 3 years ago

I think the issue came up with a situation where a program does a query:

>>: prog.c
<snip>
  const char* val = SCR_Config("SCR_DEBUG")
<snip>

And then at runtime, one has set that parameter via environment variable:

export SCR_DEBUG=1
./prog
adammoody commented 3 years ago

At the time, the call to scr_param_get that was leading to the abort might have come from setting a parameter rather than a query, since SCR_Config used to call scr_param_get after setting a parameter around line 2471:

https://github.com/LLNL/scr/pull/216/commits/67d907b579bb45c6cc29a97bfc36acb1dddb7803#diff-b508febc113fb14b5c2a383a5e862532033c58fd8202224e62f24fc1b712c081L2465

So then a program and run like the following would trigger the abort.

>>: prog.c
<snip>
  SCR_Config("SCR_DEBUG=1")
<snip>
export SCR_DEBUG=1
./prog

I mention in the PR that I changed things to always return NULL when setting a parameter, since a strdup was now required in order to return a value. That's true since calling scr_param_finalize before returning from SCR_Config deletes the kvtree that holds the string that we're returning. I decided to return NULL instead so users would not have to free returned strings when setting params. However, if we can avoid calling scr_param_finalize, we could restore the behavior to return the pointer when setting params again.