E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
336 stars 338 forks source link

user_nl_mpas* is Ignored #1276

Closed erichlf closed 6 years ago

erichlf commented 7 years ago

When trying to use a graph file which is not in the inputdata one should be able to change config_block_decomp_file_prefix to the desired value and then mpas- should be able to use this data. However, currently changing this value has no affect. I had a work-around whereby I would attempt a setup and build and then change the `Buildconf/mpas-.input_data_listvariable corresponding to the graph file (say graph96). However, this seems to now change when runningcase.submit`. Thus, I am left with no way to specify the location of the graph file.

If more details are needed maybe @jgfouca can fill in the rest.

jonbob commented 7 years ago

@erichlf - I don't think the buildnml is ignoring user_nl_mpas. We use that capability often and have never had a problem. So could you please change the title? And can you point me to a case where this happens? I just ran a test on anvil and the job is still in the queue, but the changes to mpas-o_in are there, even for config_block_decomp_file_prefix, after case.submit.

jonbob commented 7 years ago

@erichlf - can you change other variables with user_nl_mpas*? And what machine is having this problem?

erichlf commented 7 years ago

I have tried changing this on multiple machines including Anvil, Edison, Skybridge, and Cori.

jonbob commented 7 years ago

Can you point me to a case I can look at? I just tried again, and it works for me

erichlf commented 7 years ago

In particular I was running A_WCYCL2000 ne4_oQU240 on Anvil. In particular try to use 48 PEs, this way there is no partition file in inputdata. I have partition files for you to use in

/home/elfost/workspace/acme/inputdata/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.48
/home/elfost/workspace/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.48

Permissions are readable all the way to those files.

jonbob commented 7 years ago

@erichlf - again, it works for me, so I really need to see a specific case where it fails for you.

erichlf commented 7 years ago

@jonbob Weird, that part48 is now in inputdata, but I was pretty sure I hadn't added it. Anyway, I just tried running a case with 80 pes and it fails when looking up the inputdata as expected. You can go directly to it at blues.lcrc.anl.gov:/home/elfost/workspace/ACME/cime/scripts/A_WCYCL2000.ne4_oQU240.MPASFILES And run ./case.setup -c && ./case.setup && ./case.build && ./case.submit To verify the initial failure.

The next failure can be verified by editing BuildConf/mpas-*input_data_list and changing the location of part80 to point to

/home/elfost/workspace/acme/inputdata/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.80
/home/elfost/workspace/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80

And trying to run things.

jonbob commented 7 years ago

I thought you were having trouble using user_nl_mpas? Why are you testing with changes to BuildConf/mpas-input_data_list? That's not a workflow that the system is designed to use

erichlf commented 7 years ago

user_nl_mpas changes didn't work, so as a work-around I would change BuildConf/mpas-input_data_list.

jonbob commented 7 years ago

Well, I'm trying to follow up on this issue with user_nl_mpas -- which is what we should be using. So if you could please point me to a case where that has failed, I'll try to make sense of it

jonbob commented 7 years ago

Changing just the BuildConf/mpas-*input_data_list is not expected to work. The changes required that way are likely to be more extensive.

jgfouca commented 7 years ago

@jonbob I think @erichlf tried to do things the right way by using user_nl_mpas and when that didn't work, he fell back to manually editing BuildConf/mpas_input_data_list.

jonbob commented 7 years ago

I understand, but we need to spend the time getting the right way to work then. The other way won't solve this.

jgfouca commented 7 years ago

@jonbob yeah, that's fine

jgfouca commented 7 years ago

@jonbob I confirmed the problem with these steps:

% ./create_newcase --compset A_WCYCL2000 --res ne4_oQU240 --case erich
% cd erich
% case.setup
    * edit user_nl_mpaso, add this line:
    config_block_decomp_file_prefix="/home/jgfouca/mpas-o.graph.info.151209.part"

% case.setup
    * open Buildconf/mpas-o.input_data_list
      The graph* value should have the prefix /home/jgfouca/... but it doesn't
jonbob commented 7 years ago

@jgfouca - it may not show up in the input_data_list, but I believe it will successfully run and use the updated graph file.

jgfouca commented 7 years ago

@jonbob @erichlf could confirm that, but I highly suspect having the wrong value in input_data_list will impact check_input_data when it gets called.

jonbob commented 7 years ago

@jgfouca @erichlf - those files are, I think, just used for a list of files the scripts will make sure are available. If the default one is still in the list, it won't impact anything -- except that you would then be responsible for making sure the user-specified graph file really exists

jgfouca commented 7 years ago

@jonbob yes, I think that's the whole point. @erichlf wants to use a local file without having to put something in the input repo.

jonbob commented 7 years ago

He shouldn't have to touch the input_data_lists -- those are used just for the scripts to make sure all files are downloaded from the svn server. If he changes the graph in the corresponding user_nl , the code should pick it up and use it when it runs

jgfouca commented 7 years ago

@jonbob but he wants to use a local file that doesn't exit on the svn server, this will cause check_input_data to fail which will fail the entire case.

jonbob commented 7 years ago

I don't think so. If he leaves the default graph file in the input_data_list, the scripts won't care -- they'll happily report that the file exists and does not need to be downloaded. And if he points at his own graph file using user_nl, that's what the model will use when it runs.

jgfouca commented 7 years ago

@jonbob the default graph file also won't exist because he's using a new NTASKS for which there is no partition file in the input repo.

jonbob commented 7 years ago

Well, maybe we need a flag for the script that checks data to continue even when files are missing? I think it will be more difficult to add capability to look at the user_nl files when creating the input_data_list, since the user_nl files are used to make runtime changes. And the input_data_list is generated in the setup process.

jgfouca commented 7 years ago

@jonbob I think it would be better for user_nl_mpaso and BuildConf/... to all be consistent with each other, don't you think?

jonbob commented 7 years ago

No, I disagree. The reason the user_nl files are there is to allow users to make runtime changes. The BuildConf and input_data_list are part of the setup, it doesn't make sense to force the defaults and runtime settings to be the same. I think it would be better -- and much easier -- if the script that checks for input_data could report missing files and then have a setting to allow the case to continue despite not finding all files.

jgfouca commented 7 years ago

@jonbob maybe we need to bring in more people to get a consensus on what the correct behavior is. In my opinion, if namelist settings impact setup files like BuildConf/..., then changing those namelist values via usernl... should have an impact on those setup files. @rljacob , any thoughts?

gold2718 commented 7 years ago

@jonbob, The whole point of check_input_data is to ensure that a job does not crash when it makes its way through the queue and begins running. To allow a continue with missing files ensures a job that crashes on startup (possibly after being in the queue for quite a while). Each component is responsible for collecting files needed for a run by build_namelist time. Abdicating that responsibility (if I understand your proposal) does not seem like a good solution.

rljacob commented 7 years ago

What's the sequence here? When are the BuildConf/*input_data_list files created and how do they get modified? Seems to me like the final list of files you need shouldn't be set at build time. It should be set at case.submit time. So something needs to first read user_nl to look for changes in filenames, then build the file_list, then check that all files are present.

jonbob commented 7 years ago

@gold2718 - I agree that's the point of check_input_data, and that setting should in no way be the default. But this is a situation where a user wants to test a specific file that does not exist in the input_data repo. There could be a setting that says "I understand that there is a file that's missing and I want to go forward anyway". But I don't feel attached to the outcome -- either way, some of the cime scripts would have to change. Either a modification to check_input_data or a run-time checker that would have to parse all of the user_nl files, modify the input_data_lists, and then rerun check_input_data.

jgfouca commented 7 years ago

@rljacob I know that BuildConf/... files are initially created by case.setup. I think they get created each time preview_namelists is called, but I could be wrong.

rljacob commented 7 years ago

Sounds like check_input_data isn't using fully qualified path names? If the user specifies a complete path to a different file, not in the default input data directory, check_input_data should be ok with that.

gold2718 commented 7 years ago

I thought that it was the responsibility of a model's namelist builder to make sure that the filenames in usernl are expanded properly so that check_input_data can ensure that those files exist. The input_data repo is a red herring in that we have always allowed files to exist in other places, they just have to be specified properly in the usernl file and then handled correctly by the model's namelist builder. If these long-standing requirements are met, there is no need for the suggested workaround and no loss of functionality.

mt5555 commented 7 years ago

The MPAS behavior does not match the behavior of the other components. In CAM, if you specify a file in user_nl_cam, then check_input_data will assume you know what you are doing and not look for that file, and let you proceed with the run. It's a great way to test things without having to put data in the inputdata server. I think Erich wants to try 20 different partitions to find the best one, and then put the best one in the inputdata server.

On Tue, Feb 21, 2017 at 4:00 PM, goldy notifications@github.com wrote:

I thought that it was the responsibility of a model's namelist builder to make sure that the filenames in usernl are expanded properly so that check_input_data can ensure that those files exist. The input_data repo is a red herring in that we have always allowed files to exist in other places, they just have to be specified properly in the usernl file and then handled correctly by the model's namelist builder. If these long-standing requirements are met, there is no need for the suggested workaround and no loss of functionality.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ACME-Climate/ACME/issues/1276#issuecomment-281510768, or mute the thread https://github.com/notifications/unsubscribe-auth/AFW97aUIp-FcSRPV5lCVkB4aGnmdgRabks5re2yggaJpZM4MEq60 .

gold2718 commented 7 years ago

@mt5555, What you are saying only holds for local file names (e.g., ncdata = 'foo.nc'). If you use an absolute path (e.g., ncdata = '/path/to/foo.nc'), then that file existence will be checked. If you use a relative path (e.g., ncdata = '../foo.nc'), the file existence will be checked. It sounds like @jgfouca has demonstrated that the MPAS namelist builder is not properly creating the Buildconf/mpas-o.input_data_list file (see above).

jonbob commented 7 years ago

@mt5555 - that's exactly what the MPAS components are doing. If a user puts a file in one of the user_nl's, check_input_data does not look for it and allows the run to proceed. But the default files have to be available for check_input_data to succeed initially -- which is not necessarily true for graph files on all processor counts, because we have only a limited number of them in the inputdata repo.

mt5555 commented 7 years ago

@jonbob - when you wrote "but the default files have to be available...", do you mean that if a user replaces a default file with a new file using the usernl* feature, the older default file still has to be available, even though it wont be used?

Would @erichlf be able to do what he wants if instead of trying to change the partition file, he specifies new version of all files of the form mpas-o.graph.info.* ?

jonbob commented 7 years ago

@mt5555 - The issue with what @erichlf wants to do is, as far as I can tell this: if he specifies a core count for which there is no default graph file, the check_input_data scripts will fail at setup. So then whether or not the user_nl process works is moot -- the scripts won't let him go forward. If he specified a core count for which there was a default in the inputdata repo, he could use the user_nl capability to point at a different file -- but that default has to exist for the check_input_data script to succeed. As far as I can tell, the options we're discussing are:

  1. make the buildnml files repopulate the input_data_list using information from user_nl files at runtime -- though I'm not sure the initial case.setup or case.build would work due to missing files
  2. add a flag to check_input_data to allow the setup and build scripts to proceed despite missing input_data files I agree that most users know, by experience, that "if you specify a file in usernl***, then check_input_data will assume you know what you are doing and not look for that file", which is why option 2 seems easier to me. Either that or we can remove the graph files from the input_data_list and leave their existence up to the user....
gold2718 commented 7 years ago

@jonbob, check_input_data runs after buildnml so I think the solutions are:

I still don't think any changes to check_input_data are required since it is only checking files specified by buildnml.

mt5555 commented 7 years ago

Note that we do this all the time in CAM: if there is a case with a missing ncdata file (for example), the case wont build. But if the user specifies ncdata in the user_nl_cam file, then buildnml knows not to check for the default (and hence not to fail), and the run can proceed.

IIUC, this capability is broken in the MPAS buildnml scripts?

jonbob commented 7 years ago

@gold2718 - I can look and see if we can make the buildnml script work that way. But MPAS components can't run without a graph file, since it specifies the partitioning. And in general, user_nl files can be modified at any point, and I don't believe the system can get past the build right now if check_input_data fails. So that's requiring changes to a runtime file pre-build, which I do not like. But I'll try to understand it better when I get time.

gold2718 commented 7 years ago

@jonbob, check_input_data is not called until ./case.submit time so perhaps you need to use a warning instead of an error for missing graph files during namelist build. Still, since it is easy to populate the user_nl_mpas file before building, it seems that users can easily develop a workflow that avoids errors and warnings. However, for all this to work, the MPAS namelist builder needs to handle file paths correctly. @jgfouca's example above seems to indicate that there is a bug.

jonbob commented 7 years ago

@gold2718 - I believe it's called earlier than that. I tried a test this morning and it wouldn't even build with a missing file. And I disagree completely about changing workflow to solve a software issue. I also don't believe there's an issue with the namelist builder, other than it doesn't try to parse the user_nl file when determining which files it needs. However, I'm not sure any of the component namelist builders do -- I looked, and as far as I can tell, they all rely on a call to SetupTools::create_namelist_infile.

gold2718 commented 7 years ago

@jonbob, Are you saying that while trying to execute a ./case.build, you got an error from check_input_data? Can you post your demonstration of that?

jonbob commented 7 years ago

@gold2718 - Yes, that's exactly what I'm trying to say. Here's a demonstration using current master: 1 ./create_newcase -case XYZ -compset GMPAS-IAF -mach wolf -compiler gnu -res T62_oQU240 2 cd XYZ/ 3 vi env_mach_pes.xml (to change to a processor count that does not have a graph file in inputdata) 4 ./case.setup (which correctly notes that the graph files cannot be found in inputdata) 5 ./case.build FAILS as below:

calling build.case_build with caseroot=/lustre/scratch3/turquoise/jonbob/acme_commit/cime/scripts/XYZ sharedlib_only is False model_only is False Loading input file: 'Buildconf/drof.input_data_list' Loading input file: 'Buildconf/mpas-cice.input_data_list' Model mpas-cice missing file graph80 = '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.80' Trying to download file: 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.80' to path '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.80' SUCCESS

Loading input file: 'Buildconf/datm.input_data_list' Loading input file: 'Buildconf/mpas-o.input_data_list' Model mpas-o missing file graph80 = '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' Trying to download file: 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' to path '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' FAIL: SVN repo 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata' does not have file 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' Reason: svn: URL 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' non-existent in that revision

Trying to download file: 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' to path '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' FAIL: SVN repo 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata' does not have file 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' Reason: svn: OPTIONS of 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80': authorization failed: Could not authenticate to server: rejected Basic challenge (https://svn-ccsm-inputdata.cgd.ucar.edu)

Loading input file: 'Buildconf/cpl.input_data_list' ERROR: Failed to download input data

jonbob commented 7 years ago

@gold2718 - the build fails at that point and will not continue. This is the point where I think it could help to have a flag to check_input_data to allow the build to proceed. As @mt5555 noted, we often assume that users who specify different files in the user_nl's know what they're doing and take responsibility for making those files present and correct.

mt5555 commented 7 years ago

The CAM approach would be add a step above:

3a: vi env_mach_pes.xml (to change to a processor count that does not have a graph file in inputdata) 3b: vi user_nl_mpas (and specify a partition_file = /path/to/mpas-o.graph.info.151209.part.80'

Because of step 3b, and through some perl magic, partition files wont appear in the mpas-o.input_data_list, and thus there wont be any SVN errors.

jonbob commented 7 years ago

@mt5555 - I did that and it still fails to build

mt5555 commented 7 years ago

Right -that's the original issue reported by @erichlf : the perl code which insures files in user_nl_mpas dont show up in maps-o.input_data_list is not working in MPAS the way it does in CAM. This might be by design, or maybe because partition files have derived names, the perl code needs to be tweaked.

@gold2718 - is the above a correct summary of the issue? If it is, maybe @erichlf could track this down? If we can find the perl code which parses the user_nl files and builds the *.input_data_list files I think this should be an easy fix.

jonbob commented 7 years ago

@mt5555 - but it's not perl code in mpas, is it? As far as I can tell, all the components use something like: SetupTools::create_namelist_infile("$CASEROOT", "$CASEROOT/user_nl_mpaso${inst_string}", "$CASEBUILD/mpas-oconf/cesm_namelist"); for that. Which is in CIME, not the components themselves?