E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
353 stars 362 forks source link

Discuss: Easier workflow to re-create an existing case (create_newcase based) #4179

Open sarats opened 3 years ago

sarats commented 3 years ago

Wish to setup a discussion to scope out what is needed to re-create a previous case.

Assumption: This is specifically for the non-run script use case. In case of run script created runs, it technically contains everything required to to recreate said experiment.

First, is this functionality desired?

One could potentially use create_clone to recreate on the same platform if the previous case directory exists(?). What are any limitations? Should this be extended or a new script is required?

In the long run, what kind of tooling is required to re-create an experiment on the same platform given its data from say, PACE or the local performance archive on the platform? e.g., data for context: https://pacefs.ornl.gov/e3sm/exp-ac.golaz-63354.zip

Presently, when a case is setup using create_newcase, the arguments are written to README in CaseDocs. Additionally, all the xmlchange invocations are recorded there. It might be preferable to have a executable version of this data to enable one to re-create and run an experiment.

Based on feedback, we can open up issues in CIME as needed to address them.

PeterCaldwell commented 3 years ago

Wouldn't a simpler solution be to just always use a run script to submit any job? Why go to extra work when we already have a solution?

sarats commented 3 years ago

If that's the way forward, we need to get the simplified run script you folks are using into master to serve as a template.

However, I think some developers are using CIME directly to create a case and this capability might prove useful. I would use it but started this thread to gauge broader interest.

oksanaguba commented 3 years ago

most probably use some script to run. i do, then i copy that script into the config folder. i would be more interested in info on status of the clone, like here https://github.com/ESMCI/cime/issues/3290 . i see that this issue was closed, i am not aware whether it was implemented. i run code that saves info about the clone. also, what i find hard to track is job IDs associated with *.nc output. it may seem as overkill to save clone info and job IDs, but i use this info all the time.

sarats commented 3 years ago

From a related commit, https://github.com/ESMCI/cime/pull/3696 I see that there is a new 'GIT_SUBMODULE_STATUS' file being created but we need to check if that's being captured in provenance/performance archive.

rljacob commented 3 years ago

Yes GIT_SUBMODULE_STATUS was implemented and is in the $EXEROOT (the build directory)

oksanaguba commented 3 years ago

What about the rest of the info on the clone? it is not saved anywhere. i do this

gitstat=gitstat."${id}"
cd ~/${which}
echo " running stats on clone ${which}"
echo " status is ------------- " >> ${gitstat}
git status >> ${gitstat}
echo " branch is ------------- " >> ${gitstat}
git branch >> ${gitstat}
echo " diffs are ------------- " >> ${gitstat}
git diff >> ${gitstat}
echo " last 10 commits are ------------- " >> ${gitstat}
git log --first-parent  --pretty=oneline  HEAD~10..HEAD  >> ${gitstat}
rljacob commented 3 years ago

We do need this capability. The saved run script will have the machine and pe-layouts in it and so will become outdated But an option like "download script to run this case" would allow you to specify a ( currently supported) machine you want to run on and get a working script.

From the provenance we already save, a script can be generated that checks out the right code and submodule hashes, calls create_newcase with the right arguments, applies any xmlchange commands and overwrites the user_nl* files. That should be enough to reproduce the run provided all the input files are from the input-data directory and those files are on the input data server

"create_clone" only works from an already existing casedir and I think can only create a clone on the same machine (and maybe for the same user?).

A feature I'd like to see in cime is "export_case" which would pack up the current CASEDIR settings in to a single machine-portable object and then create_newcase could take that object as an argument to build a clone more portably.

sarats commented 3 years ago

+1, "download script to run this case" exactly what I was hoping for.

golaz commented 3 years ago

@sarats : note that something like this was attempted before in E3SM (PROVEN) and proved to be extremely difficult to the point that it was eventually abandoned. It you decide to go that way, make sure you learn from previous mistakes.

Some users will do all sorts of crazy stuff when the set up their simulations, it's very hard to capture all of that.

Like @PeterCaldwell , I would recommend the simplest solution that has a high chance of working: run script. With a run script, users can still do crazy things, but they need to do it in the run script.

For example, in my last simulation, I needed to change mpaso streams.ocean files. CIME does not provide utilities to do that, but I managed to do it in the script anyway using patch. A bit ugly, but it's reproducible. Without the script discipline, I'm sure users would just edit the file, thus losing provenance for it.

https://github.com/E3SM-Project/SimulationScripts/blob/7079cc59f86d593d2bf6f5f12c1e156155cf0781/archive/v2/beta/coupled/run.20210324.v2beta3GM900_GWDfix.piControl.ne30pg2_EC30to60E2r2.chrysalis.sh

sarats commented 3 years ago

@golaz I completely understand your concerns. We certainly have to learn from the past attempts (PROVEN and old-workflow group activities). At this stage, I'm assessing requirements.

There is spectrum of desired functionality ranging from

Minor comment: We captured the streams.ocean file as you (kudos!) have explicitly copied it to SourceMods in your script. From one of your recent exp: https://pacefs.ornl.gov/e3sm/exp-ac.golaz-63474.zip exp-ac.golaz-63474.zip/SourceMods.27839.210324-230703.tar.gz/src.mpaso

PeterCaldwell commented 3 years ago

I still don't understand:

  1. who is requesting this? Who would actually use this?
  2. why isn't a run script sufficient?
  3. why isn't this going to become a complicated quagmire of edge cases like PROVEN?

About @rljacob 's comment:

We do need this capability. The saved run script will have the machine and pe-layouts in it and so will become outdated But an option like "download script to run this case" would allow you to specify a ( currently supported) machine you want to run on and get a working script.

It is trivial to take an existing run script for one machine and just change the machine name (and PE layouts if you've customized them, but that's usually not the case). CIME handles the rest. Thus I don't see the example as proving the point.

I hope you know (Sarat and Rob) that I'm a big fan of both of you in particular and of the infrastructure/performance teams in general... so you don't think I'm being rude or dismissive when I say that this seems like a classic case of the computational team being out of touch with the needs of the people actually doing model runs. Chris and I both said we think this is a bad idea but it is still happening? I'm not saying it's a bad idea, I just don't understand it. I also don't understand why E3SM heroes like @jonbob don't use run scripts so maybe I'm just not seeing the bigger picture... Could you clarify why this is a top priority?

rljacob commented 3 years ago

The main reason we're doing this is that its a test of our provenance-collection system: can we redo the run from the information collected? I think Dave was a big proponent of this feature.

The run script, if used, will be collected so that's easy. (and you're right, renaming a machine isn't hard).

If a run script is not used (won't go in to that on this thread) you could do it with info available but it would be tedious and error-prone (copy and paste create_newcase command, download xml and user_nl files, etc.) so we want something more automated. The output auto-generated script should look pretty close to something hand-modified from the run_e3sm template.

PeterCaldwell commented 3 years ago

So you'll add a nightly test that runs a sim, collects its provenance and reruns, then confirms the two runs are BFB? That's a cool idea.

And is the idea for the provenance collection system that users can sort runs by certain features of a run (e.g. "I want all runs with rhmini=0.8")? If we just want provenance collection to ensure we can re-run a sim, I still don't see why "just use a run script" isn't the best option since if you ran it once, you can always just run it again.

rljacob commented 3 years ago

Yes there will be some kind of test like that.

And yes there will eventually be an extension to PACE (or a different website) that has a more science-focused view on the data to let you search for things like "all runs with rhmini=0.8" or other science features of the runs.

sarats commented 3 years ago

The primary motivation as Rob pointed out is to identify gaps in existing data collection and address them.

One could imagine a tool that takes an experiment's data from PACE and tries to recreate that run. Right now, this can be done manually in many cases. e.g., ./create_clone_pace https://pacefs.ornl.gov/e3sm/exp-ac.jwolfe-64061.zip

sarats commented 3 years ago

This sort of capability would address one of the Performance group's use-cases during debugging studies (to quickly replicate an exp for performance analysis and optimization). I'm not advocating either way (run-script or direct CIME) but trying to handle the existence of runs without run-scripts.

At a minimum, something has to do a sanity check (besides proper script-discipline) on an existing run-script to look for hardcoded paths that may not exist etc.

PeterCaldwell commented 3 years ago

Ok, that performance group use case is a good one. Thanks for humoring me, I'm just trying to understand...

golaz commented 3 years ago

@sarats : if you can pull off the "download script" feature and it actually works reliably, that would be an amazing feature to have. From your comment above, sounds like you already have a feature that works in many cases. That makes me more optimistic about the eventual outcome.