TeaganKing commented 8 months ago

ADF run type

Model vs. Model

What happened?

When using ADF for high-res simulations, I seem to be running into some issues while parallelizing the computation using the multiprocessing pool for climatology generation (the mp.Pool step in scripts/averaging/create_climo_files.py). It seems that this bit of the code hangs for a while and then I repeatedly get the error: input in flex scanner failed. I'm wondering if this is some sort of memory or resource allocation issue with high-res?

@brianpm also suggested replacing the multiprocessing lines:

Parallelize the computation using multiprocessing pool:
with mp.Pool(processes=number_of_cpu) as p:
       result = p.starmap(process_variable, list_of_arguments)

with serial processing:

for i in list_of_arguments:
    process_variable(*i)

This caused the job to be killed upon processing the first variable.

ADF Hash you are using

fc809ba

What machine were you running the ADF on?

CISL machine

What python environment were you using?

ADF-provided Conda env

Extra info

No response

brianpm commented 8 months ago

I wonder if this is a memory issue. I'm not sure what the best next step is, but I wonder if @nusbaume will have thoughts?

nusbaume commented 8 months ago

I think this is a fairly low-level error that I imagine could potentially be caused by several different things. If possible could you try the following experiments, assuming you are running on a Casper compute node with the default ADF code:

Can you try a run with the num_procs config variable set to one?
If that test fails, then what happens if you reduce diag_var_list to just a single 3-D variable? If that doesn't work, how about a single 2-D variable?
If none of those work, then what happens if you reduce the size of the timeseries files you are analyzing (e.g. make the difference between start_year and end_year smaller)?

Of course please let me know if need any help setting up any of those experiments.

Finally, if you want to point me to the ADF config file you are using then I can try and see if I can replicate the error myself, although I am mostly on PTO tomorrow and next week so it might be awhile before I get to it. Thanks!

TeaganKing commented 8 months ago

Thanks for these suggestions @nusbaume ! I'll try setting those up tomorrow-- and, of course, no worries if you are on PTO for a bit!

TeaganKing commented 7 months ago

Sorry this slipped through the cracks! As a quick clarification, I am actually running on Derecho. With the first test, additional climo files were generated, but this still hung on the CVDP portion. However, with the second test (both num_procs config variable set to one and fewer variables in diag_var_list) and a few minor modifications (some issues with expected units), I was able to get past the portion that was hanging and generate plots as expected! So, thank you for these suggestions @nusbaume and @brianpm !

TeaganKing commented 6 months ago

I wanted to document few more details in case anyone is working with high-res simulations and ADF in the future.

In order to generate climatology files for the 3D files, I ended up needing to regrid to 1-degree in order to avoid hanging on the following line: cam_climo_data = cam_ts_data.groupby('time.month').mean(dim='time'). This worked for our particular use case, but there is still the issue of hitting a limit on resources while using a single processor with these large files. @nusbaume noted that ultimately doing similar operations in the future will likely require implementing dask and ensuring flox is installed. This will be important for both ADF and CUPiD to ensure efficient operations.

47 discusses implementing dask in ADF.

NCAR / ADF

High-Res Simulations Multiprocessing Issue #271