[QUESTION] Getting GCHP to run on UCL cluster

eamarais commented 2 years ago

I've set up GCHP version 13.3.4 on the University College London (UCL) computational cluster Myriad. The model builds and compiles (with many warning messages that according to other GitHub posts and/or GCHP documentation is to be expected), but when I try run the model it hangs at "GENERIC: INFO: Started the 'Run' stage of the gridded component 'EXTDATA'". This occurs even after waiting more than an hour, increasing memory allocation and increasing the number of cores. As far as I can tell, there isn't an error message even if I build with DEBUG, enable debugging messages in input.geos and change verbose to "3" in HEMCO_Config.rc. The outcome is also the same when I turn off all diagnostics and turn off emissions, chemistry, deposition and other processes in runConfig.sh.

Myriad uses a Sun Grid Engine queueing system. The UCL research computing staff have confirmed that mpi is working correctly, so I suspect it may be an issue with configuration settings, the combination of libraries/dependencies or the environmental settings. I use a programme called GERun developed by a member of the UCL research computing team that provides a wrapper for the mpirun command (https://github.com/UCL/GERun). Model build, compile and runs were done interactively using "qrsh -pe mpi X -l mem=YG -l h_rt=8:0:0 -now no" to establish an interactive session, where for X I've tried 6, 12, and 18 (updating runConfig.sh when I did) and where Y has ranged from 7-10 to get total memory of 60-120 GB.

I've attached a zipped file of the files I've used to build and compile (build_gchp.sh), and run (gchp_test_run_interactive.sh) the model interactively. Also included are my runConfig.sh file and log files generated during model build (log_gchp_test_build), compile (log_gchp_test_compile) and run (log_gchp_test_run, allPEs.log, HEMCO.log).

Any suggestions on how to diagnose what's causing the issue?

UCL-GCHP-files.zip

lizziel commented 2 years ago

Hi @eamarais, I just transferred this issue to the GCHP Github space. Could you try enabling the ExtData prints which is item #3 here: https://gchp.readthedocs.io/en/13.3.4/supplement/rundir-config.html#enable-maximum-print-output Once that is done, try rerunning and then post output file allPEs.log. This might shed more light on the issue. If it is failing during ExtData Run then it is likely an input problem.

eamarais commented 2 years ago

Hi @lizziel. Thanks for transferring my issue over to GCHP. I've rerun the model with CAP.EXTDATA root_level in logging.yml set to DEBUG and attached the allPEs.log file. Nothing jumps out at me, but I'm still becoming familiar with GCHP.

allPEs.log

LiamBindle commented 2 years ago

Hi Eloise. Thanks for taking a thorough look into it. It looks like you're doing everything right.

@tclune @bena-nasa @atrayano Her simulation hangs at call MAPL_ExtDataPrefetch(...) (code) in the first ExtData Run stage. What could cause MAPL_ExtDataPrefetch() to hang?

eamarais commented 2 years ago

Thanks @LiamBindle. There is a message in the build log file about not being able to find doxygen. Is this needed? If so, could this be the cause?

LiamBindle commented 2 years ago

Hi Eloise, just letting you know I haven't forgot about his. I'm going to ping the MAPL developers offline. This looks like a bug/configuration issue---I'm guessing it's something related to MPI or the filesystem. It might take a bit of time (more than a few days) to track down the issue. I'll follow up when there's something to try.

The warning about doxygen not being available is fine (that's for documentation documents....we don't use it).

lizziel commented 2 years ago

Hi all, @tclune is away for a few weeks so might not get back on this for a while. @eamarais, in the meantime you could add a print statement to the code that @LiamBindle linked to and then re-compile and run again. See if it prints anything, stops at a certain file, or gradually slows down to a crawl.

The file to change is src/MAPL/base/MAPL_ExtDataGridComp.F90, right before this line. The variable to print is io_bundle%file_name.

eamarais commented 2 years ago

Hi @lizziel, thanks for the suggestion. I added a print statement as you suggested and identified that the code gets stuck when reading in OCEAN_O3_DRYDEP files. Not clear what's causing this, but I commented this out in HEMCO_Config.rc to see whether it progresses further, but then it gets stuck reading Yuan_XLAI in MAPL_ExtDataReadPrefetch. I added a print statement after this line for this test. Is the model slow to read the Yuan_XLAI file?

Below I've included the simulation log file with print statements (log_gchp_test_run) and the allPEs.log file with DEBUG statements.
allPEs.log log_gchp_test_run.log

An edit to add that I just noticed that the following line appears multiple times in a file called PET00.ESMF_LogFile: "20220726 141315.000 ERROR PET00 ESMF_Time.F90:810 ESMF_TimeGet() Object Set or SetDefault method not called - Object not Initialized". I haven't looked into this yet, but wanted to include it with this comment in case it's useful.

LiamBindle commented 2 years ago

Hi Eloise, sorry for the delay in getting back to you. I was off the last two days.

Here at WashU, we happen to be experiencing the same issue you are. Our cluster underwent maintenance a 2 weeks ago, and ever since then our file system has been operating in a degraded state (reading/writing isn't reliable). Now, when we try to run simulation on multiple nodes, they hang in the pre-fetch stage similar to your situtation.

We believe it's related to file system. The MAPL developers would probably have the best insight, but I believe they're away for a week or two. We'll get back to you later today with an update.

bena-nasa commented 2 years ago

Hi All, I unfortunately looking at the logs there's not much to go on here. It does seem like some sort of file system hang. I wish I had more to add but is nothing it throwing a return code so it is hard to say.

eamarais commented 2 years ago

Thanks @bena-nasa for checking this. I'm at a bit of a loss too. Do you, @lizziel or @LiamBindle have any suggestions on what to try to get an idea of what's causing the issue? Change to a different version of GCHP? Try a different combination of library and/or ESMF versions? Turn on/off certain aspects of the model?

lizziel commented 2 years ago

Hi @eamarais, we had a discussion about this issue with GMAO yesterday. The GCST team at WashU is going to to dig more into this since they are having a similar hang right now at the same spot. The strategy right now is to add prints deeper into the MAPL calls to figure out which call exactly is where it stops. You could do this as well, or you can wait for results from them.

Version 14.0 has a newer version of MAPL but I am not sure that would fix the issue since we think it is a file system problem. If you would like to try out that version, however, I can help get you set up. 14.0 has not been officially released but is in the final stages of the 10-year benchmark.

lizziel commented 2 years ago

Correction: The calling routine where the run seems to be stopping does not actually access the files. So there is something else going on. Stay tuned.

tclune commented 2 years ago

Hangs can be tricky - you have to be really certain you understand the state of all the processes at the point of the hang. E.g., a common debugging mistake is that the hanging process is the one that is appears to be behind the other processes. This is because it has actually progressed to the next time step (or loop iteration, etc). BARRIER with print can be your friend, but be careful not to put one in conditional logic (or then put one in both branches).

eamarais commented 2 years ago

@lizziel, I'll give version 14.0 a try.

lizziel commented 2 years ago

Sounds good. You can checkout tag 14.0.0-rc.1. The accompanying documentation is located here: https://gchp.readthedocs.io/en/docs-dev/

Download and compile instructions are the same as version 13. Running, however, has a few changes:

Config file runConfig.sh is renamed setCommonRunSettings.sh
Duration is still in setCommonRunSettings.sh, but start and end date are not. Start date is set in cap_restart and end date is no longer used.
Start date in cap_restart gets updated at the end of a successful run to restart where the last run left off. To rerun a simulation you therefore need to manually reset the start date.
Restart files are stored in the Restarts subdirectory
You must use one of the provided example run scripts in the runScriptsSamples subdirectory. There is one for batch jobs and another for running interactively. Simply copy to the GCHP run directory main level and edit the execution call.

Let us know if you have questions!

eamarais commented 2 years ago

@lizziel, thanks for the guidance. I gave it a try, but it hangs in the same location. I'll wait to see if the WashU folks are able to solve the issue.

lizziel commented 2 years ago

@eamarais and @Jourdan-He, I put together some MAPL updates to hone in on this issue. This can be used by both UCL and WashU (the places experiencing the prefetch hang) to help figure out what might be causing the problem.

Here is a summary of what I did: I ran git grep -in request_data_from_file to find out where that subroutine (which is where it is stalling) is defined. I then put prints throughout it (see code here). Each print has a call in front of and after it to force all processors to stop there until they have caught up with each other.

Here is what to do to test with this:

Go to your 14.0 GCHP code
Change directories to src/MAPL
Run git fetch
Run git checkout debug/prefetch_hang
Rebuild GCHP
Rerun the model

After running you should see messages like this in your main GCHP log file (gchp.date.log):

 debug: start of request_data_from_file: filename is ./MetDir/2019/01/MERRA2.20190101.A1.05x0625.nc4
 debug: request_data_from_file: calling ESMF_FieldBundleGet to get output grid
 debug: request_data_from_file: calling MAPL_GridGet
 debug: request_data_from_file: calling factory%generate_file_bounds
 debug: request_data_from_file: calling ESMF_FieldBundleGet for numVars            1
 debug: request_data_from_file: calling ESMF_FieldBundleGet for field name list
 debug: request_data_from_file: in loop over fields
 debug: request_data_from_file: calling ESMF_FieldBundleGet to get varname
 debug: request_data_from_file: success, varname is EVAP
 debug: request_data_from_file: calling ESMF_FieldGet to get rank
 debug: request_data_from_file: success, rank is 2
 debug: request_data_from_file: hasDE is true, calling ESMF_FieldGet to get field
 debug: request_data_from_file: calling i_Clients%collective_prefetch_data
 debug: request_data_from_file: end of subroutine

It will presumably fail after one of these calls for a certain file and variable. You can do tail --follow logname (where logname is the gchp log file) during the run to track progress and see if it is slowing down at a specific call, or runs fast right up until the hang.

Hopefully this will give us more information to trouble-shoot.

eamarais commented 2 years ago

Hi @lizziel, I gave this a try.

The model first hangs at “debug: request_data_from_file: calling ESMF_FieldBundleGet to get output grid” for file ./HcoDir/OCEAN_O3_DRYDEP/v2020-02/Oi_prj_predicted_iodide_0.125x0.125_No_Skagerrak_Just_Ensemble.nc. So, I set this to false in HEMCO_Config.rc.

Then it hangs at “debug: start of request_data_from_file: filename is ./HcoDir/VerticalScaleFactors/v2021-05/gc_layers.nc”, so I commented these files out too, as these don’t appear to be used anyway.

After that, it reaches the end of HEMCO_Config.rc file, as it progresses through all debug messages to “debug: request_data_from_file: end of subroutine” for India_mask.generic.1x1.nc that is at end of HEMCO_Config file. After that, the model just hangs.

The only collection I noticed tool slightly longer than others to cycle through is the new AEIC2019 inventory.

Log file for the test that ends at the India mask: gchp.20190701_0000z.log.

lizziel commented 2 years ago

Thanks @eamarais. Did the end of allPEs.log get beyond prefetch for that last run you did?

eamarais commented 2 years ago

@lizziel, it reached read-prefetch: allPEs.log

lizziel commented 2 years ago

@tclune, as previously discussed the prefetch hang is during the call to request_data_from_file within ExtData prefetch. We now know it is happening for global files that are either extremely high res (1441x2880) or low res (4x4). The hang for high res is happening during this block, and the hang for low res is happening during this. That logic assumes I implemented VMBarrier correctly and there are no processors executing elsewhere at the same time.

Skipping those files is still resulting in a hang later but maybe we can start with figuring out what is going wrong with these files.

Any ideas?

tclune commented 2 years ago

Hmm. I can see how extreme low res might tax the decomposition approach in MAPL, though we think we've allowed for that. Your high-res is not terribly high-res from our perspective, esp. if it is only 2D data.

But the annoying thing is that @bena-nasa reports that no file actions even happen at that step.

Just to confirm you have something like this in the code at the point where the hang happens:

<write from all pes to somewhere>
< barrier >
<call that hangs>
<write from all pes to somewhere>

We want to be sure that (A) every process gets to the call that hangs (not just some), and that no process gets to the 2nd write as that would indicate that the "advanced" process might be indirectly triggering a problem.

atrayano commented 2 years ago

@tclune Since the hang presumably happen in prefetch, we do "read" not "write" in your pseudocode above

lizziel commented 2 years ago

@tclune The code I put to get the processors all to the same spot is this. I put this directly before and after every debug print that I added. call ESMF_VMBarrier(VM, RC=STATUS)

tclune commented 2 years ago

@atrayano the writes in my pseudo code are to ensure that we made it to that line. A "read" would not accomplish that.

@lizziel OK - but I've not had the time to look closely at the output and the changes. I was hoping you could just say "yes - that is precisely what we see." (Again, I've been burned many times by a "hang" that is actually caused elsewhere by a process that "got ahead".)

atrayano commented 2 years ago

@lizziel your code instrumentation for debugging looks right, EXCEPT that the call to MAPL_am_i_root should take vm as argument

lizziel commented 2 years ago

@tclune I changed my debug code to print on every thread and label with PET #. This shows VMBarrier isn't quite working, although they are all within a call or two.

@atrayano I have never passed VM to MAPL_am_I_Root before. Does this give different behavior?

@eamarais and @Jourdan-He, can you pull in my latest MAPL push, rebuild, and rerun? I added finer granularity to the debug prints and that, plus the labels of processor number, should help figure this out (I hope). @eamarais, turn back on OCEAN_O3_DRYDEP file before running so we can look at the issue with that file.

atrayano commented 2 years ago

@lizziel Not passing VM effectively means that the call tells you if you are root (or not) using the MPI_COMM_WORLD communicator. This is could be ok, if you are not using dedicated IO nodes for the IO server.

lizziel commented 2 years ago

Heads up everyone I am leaving today for vacation and will not be back until Aug 29th. @Jourdan-He and @LiamBindle will be here to help facilitate this issue.

Jourdan-He commented 2 years ago

Hi @eamarais , Just want to let you know that Liam and I have been trying to run the simulation with @lizziel 's MAPL updates, we will let you know once it runs through.

lizziel commented 2 years ago

@Jourdan-He, has there been any progress on this?

lizziel commented 2 years ago

Hi @eamarais - while I was away there were updates on the cluster at WashU that fixed the problem similar to the one you are seeing. However, the details of the changes to their system are not available because they are so extensive and involve withheld correspondence with IBM. This makes me think the issues you are having are also a system problem. I can walk you through further testing to hone in on the source of the issue. That might help your sysadmins troubleshoot.

eamarais commented 2 years ago

Ah. Unfortunate that it's a system issue. Yes, please send me options to troubleshoot.

lizziel commented 2 years ago

Hi @eamarais, to start could you pull in the latest changes of MAPL branch debug/prefetch_hang, recompile, and rerun. The steps for that are in the previous post from me on August 10th. A 1-hr 6-core job is sufficient to test it. Let's start with that.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

lizziel commented 2 years ago

Hi @eamarais, are you still interested in pursuing this? Unfortunately GCST does not have time available in the near future to spend on this. Tagging @sdeastham in case he can help.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

lizziel commented 1 year ago

I am closing out this issue since we have moved the discussion to email.

geoschem / GCHP

[QUESTION] Getting GCHP to run on UCL cluster #239