Closed eamarais closed 1 year ago
Hi @eamarais, I just transferred this issue to the GCHP Github space. Could you try enabling the ExtData prints which is item #3 here: https://gchp.readthedocs.io/en/13.3.4/supplement/rundir-config.html#enable-maximum-print-output
Once that is done, try rerunning and then post output file allPEs.log
. This might shed more light on the issue. If it is failing during ExtData Run then it is likely an input problem.
Hi @lizziel. Thanks for transferring my issue over to GCHP. I've rerun the model with CAP.EXTDATA root_level in logging.yml set to DEBUG and attached the allPEs.log file. Nothing jumps out at me, but I'm still becoming familiar with GCHP.
Hi Eloise. Thanks for taking a thorough look into it. It looks like you're doing everything right.
@tclune @bena-nasa @atrayano Her simulation hangs at call MAPL_ExtDataPrefetch(...)
(code) in the first ExtData Run stage. What could cause MAPL_ExtDataPrefetch()
to hang?
Thanks @LiamBindle. There is a message in the build log file about not being able to find doxygen. Is this needed? If so, could this be the cause?
Hi Eloise, just letting you know I haven't forgot about his. I'm going to ping the MAPL developers offline. This looks like a bug/configuration issue---I'm guessing it's something related to MPI or the filesystem. It might take a bit of time (more than a few days) to track down the issue. I'll follow up when there's something to try.
The warning about doxygen not being available is fine (that's for documentation documents....we don't use it).
Hi all, @tclune is away for a few weeks so might not get back on this for a while. @eamarais, in the meantime you could add a print statement to the code that @LiamBindle linked to and then re-compile and run again. See if it prints anything, stops at a certain file, or gradually slows down to a crawl.
The file to change is src/MAPL/base/MAPL_ExtDataGridComp.F90
, right before this line. The variable to print is io_bundle%file_name
.
Hi @lizziel, thanks for the suggestion. I added a print statement as you suggested and identified that the code gets stuck when reading in OCEAN_O3_DRYDEP files. Not clear what's causing this, but I commented this out in HEMCO_Config.rc to see whether it progresses further, but then it gets stuck reading Yuan_XLAI in MAPL_ExtDataReadPrefetch. I added a print statement after this line for this test. Is the model slow to read the Yuan_XLAI file?
Below I've included the simulation log file with print statements (log_gchp_test_run) and the allPEs.log file with DEBUG statements.
allPEs.log
log_gchp_test_run.log
An edit to add that I just noticed that the following line appears multiple times in a file called PET00.ESMF_LogFile: "20220726 141315.000 ERROR PET00 ESMF_Time.F90:810 ESMF_TimeGet() Object Set or SetDefault method not called - Object not Initialized". I haven't looked into this yet, but wanted to include it with this comment in case it's useful.
Hi Eloise, sorry for the delay in getting back to you. I was off the last two days.
Here at WashU, we happen to be experiencing the same issue you are. Our cluster underwent maintenance a 2 weeks ago, and ever since then our file system has been operating in a degraded state (reading/writing isn't reliable). Now, when we try to run simulation on multiple nodes, they hang in the pre-fetch stage similar to your situtation.
We believe it's related to file system. The MAPL developers would probably have the best insight, but I believe they're away for a week or two. We'll get back to you later today with an update.
Hi All, I unfortunately looking at the logs there's not much to go on here. It does seem like some sort of file system hang. I wish I had more to add but is nothing it throwing a return code so it is hard to say.
Thanks @bena-nasa for checking this. I'm at a bit of a loss too. Do you, @lizziel or @LiamBindle have any suggestions on what to try to get an idea of what's causing the issue? Change to a different version of GCHP? Try a different combination of library and/or ESMF versions? Turn on/off certain aspects of the model?
Hi @eamarais, we had a discussion about this issue with GMAO yesterday. The GCST team at WashU is going to to dig more into this since they are having a similar hang right now at the same spot. The strategy right now is to add prints deeper into the MAPL calls to figure out which call exactly is where it stops. You could do this as well, or you can wait for results from them.
Version 14.0 has a newer version of MAPL but I am not sure that would fix the issue since we think it is a file system problem. If you would like to try out that version, however, I can help get you set up. 14.0 has not been officially released but is in the final stages of the 10-year benchmark.
Correction: The calling routine where the run seems to be stopping does not actually access the files. So there is something else going on. Stay tuned.
Hangs can be tricky - you have to be really certain you understand the state of all the processes at the point of the hang. E.g., a common debugging mistake is that the hanging process is the one that is appears to be behind the other processes. This is because it has actually progressed to the next time step (or loop iteration, etc). BARRIER with print can be your friend, but be careful not to put one in conditional logic (or then put one in both branches).
@lizziel, I'll give version 14.0 a try.
Sounds good. You can checkout tag 14.0.0-rc.1. The accompanying documentation is located here: https://gchp.readthedocs.io/en/docs-dev/
Download and compile instructions are the same as version 13. Running, however, has a few changes:
runConfig.sh
is renamed setCommonRunSettings.sh
setCommonRunSettings.sh
, but start and end date are not. Start date is set in cap_restart
and end date is no longer used.cap_restart
gets updated at the end of a successful run to restart where the last run left off. To rerun a simulation you therefore need to manually reset the start date.Restarts
subdirectoryrunScriptsSamples
subdirectory. There is one for batch jobs and another for running interactively. Simply copy to the GCHP run directory main level and edit the execution call.Let us know if you have questions!
@lizziel, thanks for the guidance. I gave it a try, but it hangs in the same location. I'll wait to see if the WashU folks are able to solve the issue.
@eamarais and @Jourdan-He, I put together some MAPL updates to hone in on this issue. This can be used by both UCL and WashU (the places experiencing the prefetch hang) to help figure out what might be causing the problem.
Here is a summary of what I did:
I ran git grep -in request_data_from_file
to find out where that subroutine (which is where it is stalling) is defined. I then put prints throughout it (see code here). Each print has a call in front of and after it to force all processors to stop there until they have caught up with each other.
Here is what to do to test with this:
src/MAPL
git fetch
git checkout debug/prefetch_hang
After running you should see messages like this in your main GCHP log file (gchp.date.log):
debug: start of request_data_from_file: filename is ./MetDir/2019/01/MERRA2.20190101.A1.05x0625.nc4
debug: request_data_from_file: calling ESMF_FieldBundleGet to get output grid
debug: request_data_from_file: calling MAPL_GridGet
debug: request_data_from_file: calling factory%generate_file_bounds
debug: request_data_from_file: calling ESMF_FieldBundleGet for numVars 1
debug: request_data_from_file: calling ESMF_FieldBundleGet for field name list
debug: request_data_from_file: in loop over fields
debug: request_data_from_file: calling ESMF_FieldBundleGet to get varname
debug: request_data_from_file: success, varname is EVAP
debug: request_data_from_file: calling ESMF_FieldGet to get rank
debug: request_data_from_file: success, rank is 2
debug: request_data_from_file: hasDE is true, calling ESMF_FieldGet to get field
debug: request_data_from_file: calling i_Clients%collective_prefetch_data
debug: request_data_from_file: end of subroutine
It will presumably fail after one of these calls for a certain file and variable. You can do tail --follow logname
(where logname is the gchp log file) during the run to track progress and see if it is slowing down at a specific call, or runs fast right up until the hang.
Hopefully this will give us more information to trouble-shoot.
Hi @lizziel, I gave this a try.
The model first hangs at “debug: request_data_from_file: calling ESMF_FieldBundleGet to get output grid” for file ./HcoDir/OCEAN_O3_DRYDEP/v2020-02/Oi_prj_predicted_iodide_0.125x0.125_No_Skagerrak_Just_Ensemble.nc. So, I set this to false in HEMCO_Config.rc.
Then it hangs at “debug: start of request_data_from_file: filename is ./HcoDir/VerticalScaleFactors/v2021-05/gc_layers.nc”, so I commented these files out too, as these don’t appear to be used anyway.
After that, it reaches the end of HEMCO_Config.rc file, as it progresses through all debug messages to “debug: request_data_from_file: end of subroutine” for India_mask.generic.1x1.nc that is at end of HEMCO_Config file. After that, the model just hangs.
The only collection I noticed tool slightly longer than others to cycle through is the new AEIC2019 inventory.
Log file for the test that ends at the India mask: gchp.20190701_0000z.log.
Thanks @eamarais. Did the end of allPEs.log
get beyond prefetch for that last run you did?
@lizziel, it reached read-prefetch: allPEs.log
@tclune, as previously discussed the prefetch hang is during the call to request_data_from_file
within ExtData prefetch. We now know it is happening for global files that are either extremely high res (1441x2880) or low res (4x4). The hang for high res is happening during this block, and the hang for low res is happening during this. That logic assumes I implemented VMBarrier correctly and there are no processors executing elsewhere at the same time.
Skipping those files is still resulting in a hang later but maybe we can start with figuring out what is going wrong with these files.
Any ideas?
Hmm. I can see how extreme low res might tax the decomposition approach in MAPL, though we think we've allowed for that. Your high-res is not terribly high-res from our perspective, esp. if it is only 2D data.
But the annoying thing is that @bena-nasa reports that no file actions even happen at that step.
Just to confirm you have something like this in the code at the point where the hang happens:
<write from all pes to somewhere>
< barrier >
<call that hangs>
<write from all pes to somewhere>
We want to be sure that (A) every process gets to the call that hangs (not just some), and that no process gets to the 2nd write as that would indicate that the "advanced" process might be indirectly triggering a problem.
@tclune Since the hang presumably happen in prefetch, we do "read" not "write" in your pseudocode above
@tclune The code I put to get the processors all to the same spot is this. I put this directly before and after every debug print that I added.
call ESMF_VMBarrier(VM, RC=STATUS)
@atrayano the writes in my pseudo code are to ensure that we made it to that line. A "read" would not accomplish that.
@lizziel OK - but I've not had the time to look closely at the output and the changes. I was hoping you could just say "yes - that is precisely what we see." (Again, I've been burned many times by a "hang" that is actually caused elsewhere by a process that "got ahead".)
@lizziel your code instrumentation for debugging looks right, EXCEPT that the call to MAPL_am_i_root should take vm as argument
@tclune I changed my debug code to print on every thread and label with PET #. This shows VMBarrier isn't quite working, although they are all within a call or two.
@atrayano I have never passed VM to MAPL_am_I_Root before. Does this give different behavior?
@eamarais and @Jourdan-He, can you pull in my latest MAPL push, rebuild, and rerun? I added finer granularity to the debug prints and that, plus the labels of processor number, should help figure this out (I hope). @eamarais, turn back on OCEAN_O3_DRYDEP file before running so we can look at the issue with that file.
@lizziel Not passing VM effectively means that the call tells you if you are root (or not) using the MPI_COMM_WORLD communicator. This is could be ok, if you are not using dedicated IO nodes for the IO server.
Heads up everyone I am leaving today for vacation and will not be back until Aug 29th. @Jourdan-He and @LiamBindle will be here to help facilitate this issue.
Hi @eamarais , Just want to let you know that Liam and I have been trying to run the simulation with @lizziel 's MAPL updates, we will let you know once it runs through.
@Jourdan-He, has there been any progress on this?
Hi @eamarais - while I was away there were updates on the cluster at WashU that fixed the problem similar to the one you are seeing. However, the details of the changes to their system are not available because they are so extensive and involve withheld correspondence with IBM. This makes me think the issues you are having are also a system problem. I can walk you through further testing to hone in on the source of the issue. That might help your sysadmins troubleshoot.
Ah. Unfortunate that it's a system issue. Yes, please send me options to troubleshoot.
Hi @eamarais, to start could you pull in the latest changes of MAPL branch debug/prefetch_hang, recompile, and rerun. The steps for that are in the previous post from me on August 10th. A 1-hr 6-core job is sufficient to test it. Let's start with that.
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.
Hi @eamarais, are you still interested in pursuing this? Unfortunately GCST does not have time available in the near future to spend on this. Tagging @sdeastham in case he can help.
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.
I am closing out this issue since we have moved the discussion to email.
I've set up GCHP version 13.3.4 on the University College London (UCL) computational cluster Myriad. The model builds and compiles (with many warning messages that according to other GitHub posts and/or GCHP documentation is to be expected), but when I try run the model it hangs at "GENERIC: INFO: Started the 'Run' stage of the gridded component 'EXTDATA'". This occurs even after waiting more than an hour, increasing memory allocation and increasing the number of cores. As far as I can tell, there isn't an error message even if I build with DEBUG, enable debugging messages in input.geos and change verbose to "3" in HEMCO_Config.rc. The outcome is also the same when I turn off all diagnostics and turn off emissions, chemistry, deposition and other processes in runConfig.sh.
Myriad uses a Sun Grid Engine queueing system. The UCL research computing staff have confirmed that mpi is working correctly, so I suspect it may be an issue with configuration settings, the combination of libraries/dependencies or the environmental settings. I use a programme called GERun developed by a member of the UCL research computing team that provides a wrapper for the mpirun command (https://github.com/UCL/GERun). Model build, compile and runs were done interactively using "qrsh -pe mpi X -l mem=YG -l h_rt=8:0:0 -now no" to establish an interactive session, where for X I've tried 6, 12, and 18 (updating runConfig.sh when I did) and where Y has ranged from 7-10 to get total memory of 60-120 GB.
I've attached a zipped file of the files I've used to build and compile (build_gchp.sh), and run (gchp_test_run_interactive.sh) the model interactively. Also included are my runConfig.sh file and log files generated during model build (log_gchp_test_build), compile (log_gchp_test_compile) and run (log_gchp_test_run, allPEs.log, HEMCO.log).
Any suggestions on how to diagnose what's causing the issue?
UCL-GCHP-files.zip