Closed erichlf closed 6 years ago
@erichlf - I don't think the buildnml is ignoring user_nl_mpas. We use that capability often and have never had a problem. So could you please change the title? And can you point me to a case where this happens? I just ran a test on anvil and the job is still in the queue, but the changes to mpas-o_in are there, even for config_block_decomp_file_prefix, after case.submit.
@erichlf - can you change other variables with user_nl_mpas*? And what machine is having this problem?
I have tried changing this on multiple machines including Anvil, Edison, Skybridge, and Cori.
Can you point me to a case I can look at? I just tried again, and it works for me
In particular I was running A_WCYCL2000 ne4_oQU240 on Anvil. In particular try to use 48 PEs, this way there is no partition file in inputdata. I have partition files for you to use in
/home/elfost/workspace/acme/inputdata/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.48
/home/elfost/workspace/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.48
Permissions are readable all the way to those files.
@erichlf - again, it works for me, so I really need to see a specific case where it fails for you.
@jonbob Weird, that part48 is now in inputdata, but I was pretty sure I hadn't added it. Anyway, I just tried running a case with 80 pes and it fails when looking up the inputdata as expected. You can go directly to it at
blues.lcrc.anl.gov:/home/elfost/workspace/ACME/cime/scripts/A_WCYCL2000.ne4_oQU240.MPASFILES
And run
./case.setup -c && ./case.setup && ./case.build && ./case.submit
To verify the initial failure.
The next failure can be verified by editing BuildConf/mpas-*input_data_list and changing the location of part80 to point to
/home/elfost/workspace/acme/inputdata/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.80
/home/elfost/workspace/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80
And trying to run things.
I thought you were having trouble using user_nl_mpas? Why are you testing with changes to BuildConf/mpas-input_data_list? That's not a workflow that the system is designed to use
user_nl_mpas changes didn't work, so as a work-around I would change BuildConf/mpas-input_data_list.
Well, I'm trying to follow up on this issue with user_nl_mpas -- which is what we should be using. So if you could please point me to a case where that has failed, I'll try to make sense of it
Changing just the BuildConf/mpas-*input_data_list is not expected to work. The changes required that way are likely to be more extensive.
@jonbob I think @erichlf tried to do things the right way by using user_nl_mpas and when that didn't work, he fell back to manually editing BuildConf/mpas_input_data_list.
I understand, but we need to spend the time getting the right way to work then. The other way won't solve this.
@jonbob yeah, that's fine
@jonbob I confirmed the problem with these steps:
% ./create_newcase --compset A_WCYCL2000 --res ne4_oQU240 --case erich
% cd erich
% case.setup
* edit user_nl_mpaso, add this line:
config_block_decomp_file_prefix="/home/jgfouca/mpas-o.graph.info.151209.part"
% case.setup
* open Buildconf/mpas-o.input_data_list
The graph* value should have the prefix /home/jgfouca/... but it doesn't
@jgfouca - it may not show up in the input_data_list, but I believe it will successfully run and use the updated graph file.
@jonbob @erichlf could confirm that, but I highly suspect having the wrong value in input_data_list will impact check_input_data when it gets called.
@jgfouca @erichlf - those files are, I think, just used for a list of files the scripts will make sure are available. If the default one is still in the list, it won't impact anything -- except that you would then be responsible for making sure the user-specified graph file really exists
@jonbob yes, I think that's the whole point. @erichlf wants to use a local file without having to put something in the input repo.
He shouldn't have to touch the input_data_lists -- those are used just for the scripts to make sure all files are downloaded from the svn server. If he changes the graph in the corresponding user_nl , the code should pick it up and use it when it runs
@jonbob but he wants to use a local file that doesn't exit on the svn server, this will cause check_input_data to fail which will fail the entire case.
I don't think so. If he leaves the default graph file in the input_data_list, the scripts won't care -- they'll happily report that the file exists and does not need to be downloaded. And if he points at his own graph file using user_nl, that's what the model will use when it runs.
@jonbob the default graph file also won't exist because he's using a new NTASKS for which there is no partition file in the input repo.
Well, maybe we need a flag for the script that checks data to continue even when files are missing? I think it will be more difficult to add capability to look at the user_nl files when creating the input_data_list, since the user_nl files are used to make runtime changes. And the input_data_list is generated in the setup process.
@jonbob I think it would be better for user_nl_mpaso and BuildConf/... to all be consistent with each other, don't you think?
No, I disagree. The reason the user_nl files are there is to allow users to make runtime changes. The BuildConf and input_data_list are part of the setup, it doesn't make sense to force the defaults and runtime settings to be the same. I think it would be better -- and much easier -- if the script that checks for input_data could report missing files and then have a setting to allow the case to continue despite not finding all files.
@jonbob maybe we need to bring in more people to get a consensus on what the correct behavior is. In my opinion, if namelist settings impact setup files like BuildConf/..., then changing those namelist values via usernl... should have an impact on those setup files. @rljacob , any thoughts?
@jonbob, The whole point of check_input_data is to ensure that a job does not crash when it makes its way through the queue and begins running. To allow a continue with missing files ensures a job that crashes on startup (possibly after being in the queue for quite a while). Each component is responsible for collecting files needed for a run by build_namelist time. Abdicating that responsibility (if I understand your proposal) does not seem like a good solution.
What's the sequence here? When are the BuildConf/*input_data_list files created and how do they get modified? Seems to me like the final list of files you need shouldn't be set at build time. It should be set at case.submit time. So something needs to first read user_nl to look for changes in filenames, then build the file_list, then check that all files are present.
@gold2718 - I agree that's the point of check_input_data, and that setting should in no way be the default. But this is a situation where a user wants to test a specific file that does not exist in the input_data repo. There could be a setting that says "I understand that there is a file that's missing and I want to go forward anyway". But I don't feel attached to the outcome -- either way, some of the cime scripts would have to change. Either a modification to check_input_data or a run-time checker that would have to parse all of the user_nl files, modify the input_data_lists, and then rerun check_input_data.
@rljacob I know that BuildConf/... files are initially created by case.setup. I think they get created each time preview_namelists is called, but I could be wrong.
Sounds like check_input_data isn't using fully qualified path names? If the user specifies a complete path to a different file, not in the default input data directory, check_input_data should be ok with that.
I thought that it was the responsibility of a model's namelist builder to make sure that the filenames in usernl
The MPAS behavior does not match the behavior of the other components. In CAM, if you specify a file in user_nl_cam, then check_input_data will assume you know what you are doing and not look for that file, and let you proceed with the run. It's a great way to test things without having to put data in the inputdata server. I think Erich wants to try 20 different partitions to find the best one, and then put the best one in the inputdata server.
On Tue, Feb 21, 2017 at 4:00 PM, goldy notifications@github.com wrote:
I thought that it was the responsibility of a model's namelist builder to make sure that the filenames in usernl are expanded properly so that check_input_data can ensure that those files exist. The input_data repo is a red herring in that we have always allowed files to exist in other places, they just have to be specified properly in the usernl file and then handled correctly by the model's namelist builder. If these long-standing requirements are met, there is no need for the suggested workaround and no loss of functionality.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ACME-Climate/ACME/issues/1276#issuecomment-281510768, or mute the thread https://github.com/notifications/unsubscribe-auth/AFW97aUIp-FcSRPV5lCVkB4aGnmdgRabks5re2yggaJpZM4MEq60 .
@mt5555, What you are saying only holds for local file names (e.g., ncdata = 'foo.nc'
).
If you use an absolute path (e.g., ncdata = '/path/to/foo.nc'
), then that file existence will be checked.
If you use a relative path (e.g., ncdata = '../foo.nc'
), the file existence will be checked.
It sounds like @jgfouca has demonstrated that the MPAS namelist builder is not properly creating the Buildconf/mpas-o.input_data_list
file (see above).
@mt5555 - that's exactly what the MPAS components are doing. If a user puts a file in one of the user_nl's, check_input_data does not look for it and allows the run to proceed. But the default files have to be available for check_input_data to succeed initially -- which is not necessarily true for graph files on all processor counts, because we have only a limited number of them in the inputdata repo.
@jonbob - when you wrote "but the default files have to be available...", do you mean that if a user replaces a default file with a new file using the usernl* feature, the older default file still has to be available, even though it wont be used?
Would @erichlf be able to do what he wants if instead of trying to change the partition file, he specifies new version of all files of the form mpas-o.graph.info.* ?
@mt5555 - The issue with what @erichlf wants to do is, as far as I can tell this: if he specifies a core count for which there is no default graph file, the check_input_data scripts will fail at setup. So then whether or not the user_nl process works is moot -- the scripts won't let him go forward. If he specified a core count for which there was a default in the inputdata repo, he could use the user_nl capability to point at a different file -- but that default has to exist for the check_input_data script to succeed. As far as I can tell, the options we're discussing are:
@jonbob, check_input_data runs after buildnml so I think the solutions are:
I still don't think any changes to check_input_data are required since it is only checking files specified by buildnml.
Note that we do this all the time in CAM: if there is a case with a missing ncdata file (for example), the case wont build. But if the user specifies ncdata in the user_nl_cam file, then buildnml knows not to check for the default (and hence not to fail), and the run can proceed.
IIUC, this capability is broken in the MPAS buildnml scripts?
@gold2718 - I can look and see if we can make the buildnml script work that way. But MPAS components can't run without a graph file, since it specifies the partitioning. And in general, user_nl files can be modified at any point, and I don't believe the system can get past the build right now if check_input_data fails. So that's requiring changes to a runtime file pre-build, which I do not like. But I'll try to understand it better when I get time.
@jonbob,
check_input_data
is not called until ./case.submit
time so perhaps you need to use a warning instead of an error for missing graph files during namelist build.
Still, since it is easy to populate the user_nl_mpas file before building, it seems that users can easily develop a workflow that avoids errors and warnings.
However, for all this to work, the MPAS namelist builder needs to handle file paths correctly. @jgfouca's example above seems to indicate that there is a bug.
@gold2718 - I believe it's called earlier than that. I tried a test this morning and it wouldn't even build with a missing file. And I disagree completely about changing workflow to solve a software issue. I also don't believe there's an issue with the namelist builder, other than it doesn't try to parse the user_nl file when determining which files it needs. However, I'm not sure any of the component namelist builders do -- I looked, and as far as I can tell, they all rely on a call to SetupTools::create_namelist_infile.
@jonbob, Are you saying that while trying to execute a ./case.build
, you got an error from check_input_data
? Can you post your demonstration of that?
@gold2718 - Yes, that's exactly what I'm trying to say. Here's a demonstration using current master: 1 ./create_newcase -case XYZ -compset GMPAS-IAF -mach wolf -compiler gnu -res T62_oQU240 2 cd XYZ/ 3 vi env_mach_pes.xml (to change to a processor count that does not have a graph file in inputdata) 4 ./case.setup (which correctly notes that the graph files cannot be found in inputdata) 5 ./case.build FAILS as below:
calling build.case_build with caseroot=/lustre/scratch3/turquoise/jonbob/acme_commit/cime/scripts/XYZ sharedlib_only is False model_only is False Loading input file: 'Buildconf/drof.input_data_list' Loading input file: 'Buildconf/mpas-cice.input_data_list' Model mpas-cice missing file graph80 = '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.80' Trying to download file: 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.80' to path '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.80' SUCCESS
Loading input file: 'Buildconf/datm.input_data_list' Loading input file: 'Buildconf/mpas-o.input_data_list' Model mpas-o missing file graph80 = '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' Trying to download file: 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' to path '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' FAIL: SVN repo 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata' does not have file 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' Reason: svn: URL 'https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' non-existent in that revision
Trying to download file: 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' to path '/lustre/scratch3/turquoise/jonbob/ACME/input_data/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' FAIL: SVN repo 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata' does not have file 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80' Reason: svn: OPTIONS of 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/ocn/mpas-o/oQU240/mpas-o.graph.info.151209.part.80': authorization failed: Could not authenticate to server: rejected Basic challenge (https://svn-ccsm-inputdata.cgd.ucar.edu)
Loading input file: 'Buildconf/cpl.input_data_list' ERROR: Failed to download input data
@gold2718 - the build fails at that point and will not continue. This is the point where I think it could help to have a flag to check_input_data to allow the build to proceed. As @mt5555 noted, we often assume that users who specify different files in the user_nl's know what they're doing and take responsibility for making those files present and correct.
The CAM approach would be add a step above:
3a: vi env_mach_pes.xml (to change to a processor count that does not have a graph file in inputdata) 3b: vi user_nl_mpas (and specify a partition_file = /path/to/mpas-o.graph.info.151209.part.80'
Because of step 3b, and through some perl magic, partition files wont appear in the mpas-o.input_data_list, and thus there wont be any SVN errors.
@mt5555 - I did that and it still fails to build
Right -that's the original issue reported by @erichlf : the perl code which insures files in user_nl_mpas dont show up in maps-o.input_data_list is not working in MPAS the way it does in CAM. This might be by design, or maybe because partition files have derived names, the perl code needs to be tweaked.
@gold2718 - is the above a correct summary of the issue? If it is, maybe @erichlf could track this down? If we can find the perl code which parses the user_nl files and builds the *.input_data_list files I think this should be an easy fix.
@mt5555 - but it's not perl code in mpas, is it? As far as I can tell, all the components use something like: SetupTools::create_namelist_infile("$CASEROOT", "$CASEROOT/user_nl_mpaso${inst_string}", "$CASEBUILD/mpas-oconf/cesm_namelist"); for that. Which is in CIME, not the components themselves?
When trying to use a graph file which is not in the inputdata one should be able to change
config_block_decomp_file_prefix
to the desired value and then mpas- should be able to use this data. However, currently changing this value has no affect. I had a work-around whereby I would attempt a setup and build and then change the `Buildconf/mpas-.input_data_listvariable corresponding to the graph file (say graph96). However, this seems to now change when running
case.submit`. Thus, I am left with no way to specify the location of the graph file.If more details are needed maybe @jgfouca can fill in the rest.