E3SM-Project / e3sm_diags

E3SM Diagnostics package
https://e3sm-project.github.io/e3sm_diags
BSD 3-Clause "New" or "Revised" License
39 stars 32 forks source link

Fix segfault issue with newest shapely and esmpy #715

Closed xylar closed 1 year ago

xylar commented 1 year ago

This merge fixes a segfault issue related to shapely >=2.0.0 and esmpy >=8.4.0, likely caused by multiprocessing. It appears that things work fine as long as shapely gets imported before e3smpy.

The proposed fix is simply to import shapely, then esmpy at the very start in the e3sm_diags package. This isn't a very elegant fix, so it will be worth exploring better options (perhaps including a bug report to ESMPy and/or shapely developers) in the future. But we need this to work for us sooner than that is likely to work for us.

I also propose constraining esmpy >=8.4.0 so we don't have to deal with the import name change from ESMF to esmpy.

I also propose to constrain shapely >=2.0.0,<3.0.0. There were major changes from shapely 1.x to 2.x and other E3SM analysis tools like polaris, compass, MPAS-Analysis and geometric_features are all using shapely 2.x at this point. e3sm_diags only depends on shapely indirectly (e.g. through cartopy) but for consistency in plots, it seems best to constrain shapely. This may help to prevent issues upgrading E3SM-Unified in the future.

This is my proposed fix for https://github.com/E3SM-Project/zppy/pull/475

xylar commented 1 year ago

Testing

I tested a out the following on Perlmutter:

 mamba remove -y --all -n e3sm_diags_dev
 mamba env create -n e3sm_diags_dev -f conda-env/ci.yml
 mamba activate e3sm_diags_dev
 python -m pip install .

Then, on a compute node, I ran:

#!/usr/bin/env bash

source /global/homes/x/xylar/mambaforge/etc/profile.d/conda.sh
source /global/homes/x/xylar/mambaforge/etc/profile.d/mamba.sh
mamba activate e3sm_diags_dev

e3sm_diags lat_lon --no_viewer \
   --reference_data_path '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology' \
   --test_data_path '/global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/climatology/rgr/' \
   --results_dir '/global/homes/x/xylar/tmp/results_conda' --case_id 'GPCP_v3.2' \
   --run_type 'model_vs_obs' --sets 'lat_lon' --variables 'PRECT' --seasons 'ANN' \
   --ref_name 'GPCP_v3.2'

This had failed for me using the main branch (see https://github.com/E3SM-Project/zppy/pull/475#issuecomment-1672940309) but ran successfully with this branch.

xylar commented 1 year ago

@chengzhuzhang and @forsyth2, can you do a little more testing on this (assuming you're okay with it) before we make another RC with it? I'm happy to deploy another E3SM-Unified RC (rc12!) tomorrow if you can get me a e3sm_diags 2.9.0rc3 by then.

forsyth2 commented 1 year ago

Thanks @xylar!!

On my testing branch (https://github.com/E3SM-Project/zppy/pull/477), I was able to successfully run zppy -c tests/integration/generated/test_complete_run_chrysalis.cfg using these changes!

I don't think zppy needs another RC in this case. Once we have another E3SM Diags RC (and then Unified RC), I can do final tests for zppy (and zstash too, which I wouldn't expect to have any issues).

I haven't tested on Perlmutter yet, but it had the same problem as Chrysalis.

Compy's error (https://github.com/E3SM-Project/zppy/pull/475#issuecomment-1670471026) still needs to be fixed before a final release though.

chengzhuzhang commented 1 year ago

@xylar this is a good news. I will test a full run to see if everything works as expected. Though it is mysterious that having the imports in a setting order upfront fixed things (great job to come up with this fix!). I agree that it is worth while to notify shapely/esmf developers about this issue

forsyth2 commented 1 year ago

it is mysterious that having the imports in a setting order upfront fixed things

That is indeed strange. "# isort:skip" has been used in e3sm_diags before though. However, in these cases, it was because we had to run a matplotlib line first. I.e., it wasn't exclusively the import order causing the issue.

For reference, in e3sm_diags/plot/cartopy/streamflow_plot.py:

matplotlib.use("Agg")
import matplotlib.colors as colors  # isort:skip  # noqa: E402                                                
import matplotlib.lines as lines  # isort:skip  # noqa: E402                                                  
import matplotlib.pyplot as plt  # isort:skip  # noqa: E402 

For all appearances: git grep -n "# isort:skip"

tomvothecoder commented 1 year ago

This looks good to me! Thanks @xylar!

it is mysterious that having the imports in a setting order upfront fixed things

I think importing esmpy>=8.4.0 first might initiate code that locks up the GIL without releasing it. Since the GIL is locked, any subsequent code that requires the GIL results in a segmentation fault (e.g., shapely multithreading code).

By importing shapely first, shapely can acquire the GIL and release it for esmpy to use whenever esmpy needs it. This is just my theory and we should let esmpy/shapely people figure it out with a GitHub issue 😂 It might be a lot more complex than that.