CDAT / cdms

8 stars 10 forks source link

Bad file descriptor when using VPN #391

Closed forsyth2 closed 4 years ago

forsyth2 commented 4 years ago

Describe the bug Running e3sm_diags on a Mac while on VPN causes a Bad File Descriptor error, printed below. @zshaheen explained that this error was not from e3sm_daigs code, but rather due to a problem in CDMS (see https://github.com/E3SM-Project/e3sm_diags/pull/287 for the discussion). This bug is easily gotten around by turning off VPN, but it would be nice to be able to stay on VPN.

Fatal error in MPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(572)..............: 
MPID_Init(224).....................: channel initialization failed
MPIDI_CH3_Init(105)................: 
MPID_nem_init(324).................: 
MPID_nem_tcp_init(178).............: 
MPID_nem_tcp_get_business_card(425): 
MPID_nem_tcp_init(384).............: gethostbyname failed, ml-9624328 (errno 1)
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=3191567
:
system msg for write_line failure : Bad file descriptor

To Reproduce Steps to reproduce the behavior:

  1. Be on VPN (unsure if you have to be on non-LLNL Wi-Fi)
  2. Run e3sm_diags code, for example ./tests/test.sh

Expected behavior The code should run.

Desktop (please complete the following information):

Environment Information

`conda info`

``` active environment : e3sm_diags_env_dev active env location : /usr/local/anaconda3/envs/e3sm_diags_env_dev shell level : 2 user config file : /Users/forsyth2/.condarc populated config files : /Users/forsyth2/.condarc conda version : 4.7.12 conda-build version : 3.18.9 python version : 3.7.4.final.0 virtual packages : base environment : /usr/local/anaconda3 (writable) channel URLs : https://conda.anaconda.org/cdat/label/latest_vtk/osx-64 https://conda.anaconda.org/cdat/label/latest_vtk/noarch https://conda.anaconda.org/cdat/label/new_vtk_project_vectors/osx-64 https://conda.anaconda.org/cdat/label/new_vtk_project_vectors/noarch https://conda.anaconda.org/cdat/label/nightly/osx-64 https://conda.anaconda.org/cdat/label/nightly/noarch https://conda.anaconda.org/conda-forge/osx-64 https://conda.anaconda.org/conda-forge/noarch https://repo.anaconda.com/pkgs/main/osx-64 https://repo.anaconda.com/pkgs/main/noarch https://repo.anaconda.com/pkgs/r/osx-64 https://repo.anaconda.com/pkgs/r/noarch package cache : /usr/local/anaconda3/pkgs /Users/forsyth2/.conda/pkgs envs directories : /usr/local/anaconda3/envs /Users/forsyth2/.conda/envs platform : osx-64 user-agent : conda/4.7.12 requests/2.22.0 CPython/3.7.4 Darwin/18.7.0 OSX/10.14.6 UID:GID : 26501:26501 netrc file : None offline mode : False ```

`conda config --show-sources`

``` ==> /Users/forsyth2/.condarc <== ssl_verify: False channel_priority: strict channels: - cdat/label/latest_vtk - cdat/label/new_vtk_project_vectors - cdat/label/nightly - conda-forge - defaults ```

`conda list --show-channel-urls`

``` # packages in environment at /usr/local/anaconda3/envs/e3sm_diags_env_dev: # # Name Version Build Channel _libgcc_mutex 0.1 main defaults asn1crypto 1.2.0 py37_0 conda-forge attrs 19.3.0 py_0 conda-forge beautifulsoup4 4.8.1 py37_0 conda-forge bokeh 1.3.4 py37_0 conda-forge bzip2 1.0.8 h0b31af3_2 conda-forge ca-certificates 2019.9.11 hecc5488_0 conda-forge cartopy 0.17.0 py37h95120c7_1007 conda-forge cdat_info 8.2 py_7 conda-forge cdms2 3.1.4 pypi_0 pypi cdp 1.6.0 py_0 conda-forge cdtime 3.1.2 py37ha91d4f2_6 conda-forge cdutil 8.2 py_2 cdat/label/v82 certifi 2019.9.11 py37_0 conda-forge cffi 1.13.1 py37h33e799b_0 conda-forge chardet 3.0.4 py37_1003 conda-forge click 7.0 py_0 conda-forge cloudpickle 1.2.2 py_1 conda-forge conda 4.8.2 py37_0 conda-forge conda-package-handling 1.6.0 py37h0b31af3_1 conda-forge cryptography 2.7 py37hafa8578_0 conda-forge curl 7.65.3 h22ea746_0 conda-forge cycler 0.10.0 py_2 conda-forge cytoolz 0.10.0 py37h0b31af3_0 conda-forge dask 2.6.0 py_0 conda-forge dask-core 2.6.0 py_0 conda-forge dbus 1.13.6 h2f22bb5_0 conda-forge decorator 4.4.1 py_0 conda-forge distarray 2.12.2 py_1 conda-forge distributed 2.6.0 py_0 conda-forge dv3d 8.2 py_0 cdat/label/v82 e3sm-diags 2.0.0 pypi_0 pypi esmf 7.1.0 h963e782_1008 conda-forge esmpy 7.1.0 py37h5ca1d4c_3 conda-forge expat 2.2.5 h4a8c4bd_1004 conda-forge ffmpeg 4.2 h5c2b479_0 conda-forge fontconfig 2.13.1 h6b1039f_1001 conda-forge freetype 2.10.0 h24853df_1 conda-forge fsspec 0.5.2 py_0 conda-forge future 0.18.1 py37_0 conda-forge g2clib 1.6.0 h4e57d6e_9 conda-forge genutil 8.2 py37h3b54f70_3 conda-forge geos 3.7.2 h6de7cb9_2 conda-forge gettext 0.19.8.1 h46ab8bc_1002 conda-forge ghostscript 9.22 h0a44026_1001 conda-forge glib 2.58.3 py37h577aef8_1002 conda-forge gmp 6.1.2 h0a44026_1000 conda-forge gnutls 3.6.5 h53004b3_1002 conda-forge gst-plugins-base 1.14.5 hb4a159a_2 conda-forge gstreamer 1.14.5 h06b91d7_2 conda-forge hdf4 4.2.13 h84186c3_1003 conda-forge hdf5 1.10.5 nompi_h3e39495_1104 conda-forge heapdict 1.0.1 py_0 conda-forge icu 64.2 h6de7cb9_1 conda-forge idna 2.8 py37_1000 conda-forge importlib_metadata 0.23 py37_0 conda-forge ipython_genutils 0.2.0 py_1 conda-forge jasper 1.900.1 h636a363_1006 conda-forge jinja2 2.10.3 py_0 conda-forge jpeg 9c h1de35cc_1001 conda-forge jsonschema 3.1.1 py37_0 conda-forge jupyter_core 4.5.0 py_0 conda-forge kiwisolver 1.1.0 py37ha1b3eb9_0 conda-forge krb5 1.16.3 hcfa6398_1001 conda-forge lame 3.100 h1de35cc_1001 conda-forge lazy-object-proxy 1.4.3 py37h0b31af3_1 conda-forge libblas 3.8.0 11_openblas conda-forge libcblas 3.8.0 11_openblas conda-forge libcdms 3.1.2 hbe35099_5 conda-forge libcf 1.0.3 py37h00f410c_1 conda-forge libclang 8.0.1 h770b8ee_1 conda-forge libcurl 7.65.3 h16faf7d_0 conda-forge libcxx 9.0.1 1 conda-forge libdrs 3.1.2 h1ddc27c_7 conda-forge libdrs_f 3.1.2 hb052ab9_6 conda-forge libedit 3.1.20170329 hcfe32e1_1001 conda-forge libffi 3.2.1 h6de7cb9_1006 conda-forge libgcc 4.8.5 1 conda-forge libgfortran 4.0.0 2 conda-forge libiconv 1.15 h01d97ff_1005 conda-forge liblapack 3.8.0 11_openblas conda-forge libllvm8 8.0.1 h770b8ee_0 conda-forge libllvm9 9.0.0 h770b8ee_3 conda-forge libnetcdf 4.6.2 h1a02027_1003 conda-forge libopenblas 0.3.6 h4bb4525_6 conda-forge libpng 1.6.37 h2573ce8_0 conda-forge libssh2 1.8.2 hcdc9a53_2 conda-forge libtiff 4.0.10 h3527a1b_1004 conda-forge libuuid 2.32.1 h1de35cc_1000 conda-forge libxcb 1.13 h1de35cc_1002 conda-forge libxml2 2.9.9 h12c6b28_5 conda-forge libxslt 1.1.33 h320ff13_0 conda-forge llvm-openmp 9.0.1 h28b9765_2 conda-forge locket 0.2.0 py_2 conda-forge lxml 4.4.1 py37h08abf6f_0 conda-forge lz4-c 1.8.3 h6de7cb9_1001 conda-forge markupsafe 1.1.1 py37h0b31af3_0 conda-forge matplotlib 3.1.1 py37_2 conda-forge matplotlib-base 3.1.1 py37h11da6c2_2 conda-forge more-itertools 7.2.0 py_0 conda-forge mpi 1.0 mpich conda-forge mpich 3.3.1 hc856adb_1 conda-forge msgpack-python 0.6.2 py37ha1b3eb9_0 conda-forge nbformat 4.4.0 py_1 conda-forge ncurses 6.1 h0a44026_1002 conda-forge netcdf-fortran 4.4.5 h1993a31_1004 conda-forge nettle 3.4.1 h3efe00b_1002 conda-forge nspr 4.20 h0a44026_1000 conda-forge nss 3.47 hc0980d9_0 conda-forge numpy 1.17.3 py37hde6bac1_0 conda-forge olefile 0.46 py_0 conda-forge openblas 0.3.6 h4bb4525_6 conda-forge openh264 1.8.0 hd9629dc_1000 conda-forge openssl 1.1.1c h01d97ff_0 conda-forge output_viewer 1.3.1 py_1 conda-forge owslib 0.18.0 py_0 conda-forge packaging 19.2 py_0 conda-forge pandas 0.25.2 py37h4f17bb1_0 conda-forge partd 1.0.0 py_0 conda-forge pcre 8.43 h4a8c4bd_0 conda-forge pillow 6.2.1 py37hb6f49c9_0 conda-forge pip 19.3.1 py37_0 conda-forge proj4 6.1.1 hca663eb_1 conda-forge psutil 5.6.3 py37h0b31af3_0 conda-forge pthread-stubs 0.4 h1de35cc_1001 conda-forge pycosat 0.6.3 py37h0b31af3_1002 conda-forge pycparser 2.19 py37_1 conda-forge pyepsg 0.4.0 py_0 conda-forge pykdtree 1.3.1 py37h3b54f70_1002 conda-forge pyopenssl 19.0.0 py37_0 conda-forge pyparsing 2.4.2 py_0 conda-forge pyproj 2.3.1 py37h9bb365a_0 conda-forge pyqt 5.12.3 py37he22c54c_1 conda-forge pyqt5-sip 4.19.18 pypi_0 pypi pyqtwebengine 5.12.1 pypi_0 pypi pyrsistent 0.15.5 py37h0b31af3_0 conda-forge pyshp 2.1.0 py_0 conda-forge pysocks 1.7.1 py37_0 conda-forge python 3.7.3 h93065d6_1 conda-forge python-dateutil 2.8.0 py_0 conda-forge pytz 2019.3 py_0 conda-forge pyyaml 5.1.2 py37h0b31af3_0 conda-forge qt 5.12.5 h1b46049_0 conda-forge readline 8.0 hcfe32e1_0 conda-forge regrid2 3.1.4 pypi_0 pypi requests 2.22.0 py37_1 conda-forge ruamel_yaml 0.15.71 py37h1de35cc_1000 conda-forge scipy 1.3.1 py37h7e0e109_2 conda-forge setuptools 41.4.0 py37_0 conda-forge shapely 1.6.4 py37h5c88e11_1006 conda-forge six 1.12.0 py37_1001 conda-forge sortedcontainers 2.1.0 py_0 conda-forge soupsieve 1.9.4 py37_0 conda-forge sqlite 3.30.1 h93121df_0 conda-forge tblib 1.4.0 py_0 conda-forge tk 8.6.9 h2573ce8_1003 conda-forge toolz 0.10.0 py_0 conda-forge tornado 6.0.3 py37h0b31af3_4 conda-forge tqdm 4.42.1 py_0 conda-forge traitlets 4.3.3 py37_0 conda-forge udunits2 2.2.27.6 h776b7f1_1001 conda-forge urllib3 1.25.6 py37_0 conda-forge vcs 8.2 py_2 cdat/label/v82 vcsaddons 8.2 py37h1de35cc_1 cdat/label/v82 vtk-cdat 8.2.0.8.2 py37h3a4d124_0 cdat/label/v82 wheel 0.33.6 py37_0 conda-forge x264 1!152.20180806 h1de35cc_0 conda-forge xorg-libxau 1.0.9 h1de35cc_0 conda-forge xorg-libxdmcp 1.1.3 h01d97ff_0 conda-forge xz 5.2.4 h1de35cc_1001 conda-forge yaml 0.1.7 h1de35cc_1001 conda-forge zict 1.0.0 py_0 conda-forge zipp 0.6.0 py_0 conda-forge zlib 1.2.11 h0b31af3_1006 conda-forge zstd 1.4.3 he7fca8b_0 conda-forge ```

davidcbaderatllnl commented 4 years ago

see https://stackoverflow.com/questions/23112515/mpich2-gethostbyname-failed https://github.com/conda-forge/fenics-feedstock/issues/44

chengzhuzhang commented 4 years ago

Another related open issue https://github.com/CDAT/vcdat/issues/295 And some discussion in here might be helpful for troubleshooting...

davidcbaderatllnl commented 4 years ago

A work around for the LLNL VPN Add this line to /etc/hosts – you need root privileges 127.0.0.1 computer_name.llnl.gov

Longer term fix may be to identify where MPI calls gethostname() and replacing it with MPI_Get_processor_name() which is MPI standard and portable

see https://stackoverflow.com/questions/23112515/mpich2-gethostbyname-failed

chengzhuzhang commented 4 years ago

@downiec @jasonb5 @muryanto1 @gabdulla @painter1 @doutriaux1 Hey, Guys, I talked to some of you. I'm pinging you in case someone has been looking into it. I don't have much expertise on how MPI works in CDAT and if a faulty MPI library version is pinned and needs to be updated. This issue has also been seen randomly on a compute node of a cluster.

forsyth2 commented 4 years ago

Possibly related: In early February I was trying to run e3sm_diags on my Mac for the first time (instead of on Cori or Compy). I was getting ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain when onsite at LLNL. Running e3sm_diags offsite allowed it to pull down whatever files or resources it needed, and I've since been able to run e3sm_diags onsite. (However, running offsite on VPN causes the error described in this issue).

forsyth2 commented 4 years ago

Running python -c "import cdms2" is sufficient to produce the error (on my Mac and in this environment) -- no need to actually run e3sm_diags itself.

jasonb5 commented 4 years ago

Here's a little update on the progress of this issue.

The issue is definitely caused by DNS not being able to resolve the systems hostname. Best guess is connecting to VPN is reconfiguring DNS and preventing this from occurring. Interesting enough I was never able to reproduce this on VPN until I purposely configured my DNS settings incorrectly.

I've traced the source of the crash to the following line: https://github.com/CDAT/cdms/blob/753fd7a3441e5f073fbb8beb6ab0723d379eec54/regrid2/Lib/mvESMFRegrid.py#L18

This can be verified with python -c "import ESMF; ESMF.Manager()"

I'll be opening up an issue with ESMF.

For the time being the solution here will work https://github.com/CDAT/cdms/issues/391#issuecomment-604695738 or you can run export MPICH_INTERFACE_HOSTNAME=localhost

forsyth2 commented 4 years ago

@jasonb5 Thanks for looking into this! I confirmed that export MPICH_INTERFACE_HOSTNAME=localhost allows me to run e3sm_diags while on VPN.

gabdulla commented 4 years ago

Thank you Jason!

Ghaleb

From: Jason Boutte notifications@github.com Reply-To: CDAT/cdms reply@reply.github.com Date: Wednesday, April 1, 2020 at 7:20 PM To: CDAT/cdms cdms@noreply.github.com Cc: Ghaleb Abdulla abdulla1@llnl.gov, Mention mention@noreply.github.com Subject: Re: [CDAT/cdms] Bad file descriptor when using VPN (#391)

Here's a little update on the progress of this issue.

The issue is definitely caused by DNS not being able to resolve the systems hostname. Best guess is connecting to VPN is reconfiguring DNS and preventing this from occurring. Interesting enough I was never able to reproduce this on VPN until I purposely configured my DNS settings incorrectly.

I've traced the source of the crash to the following line: https://github.com/CDAT/cdms/blob/753fd7a3441e5f073fbb8beb6ab0723d379eec54/regrid2/Lib/mvESMFRegrid.py#L18

This can be verified with python -c "import ESMF; ESMF.Manager()"

I'll be opening up an issue with ESMF.

For the time being the solution here will work #391 (comment)https://github.com/CDAT/cdms/issues/391#issuecomment-604695738 or you can run export MPICH_INTERFACE_HOSTNAME=localhost

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/CDAT/cdms/issues/391#issuecomment-607583490, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABPRQHLOTRCDYD2EDQSTFTDRKPY7JANCNFSM4LMV7ABA.