OSError: undefined symbol: __netcdf_MOD_nf90_put_var_1d_fourbyteint

treerink commented 6 years ago

I am not sure whether this issue appeared after the last updates or after the recreate of the ece2cmor conda environment:

The error can be reproduced by running: ./check-for-obsolete-cmor-variables-in-json-file.py but applies to checvars.py as well. As far I could judge the results are not affected.

Traceback (most recent call last):
  File "/usr/people/reerink/anaconda2/envs/ece2cmor3/lib/python2.7/site-packages/ESMF/interface/loadESMF.py", line 122, in <module>
    mode=ct.RTLD_GLOBAL)
  File "/usr/people/reerink/anaconda2/envs/ece2cmor3/lib/python2.7/ctypes/__init__.py", line 366, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/people/reerink/anaconda2/envs/ece2cmor3/lib/libesmf_fullylinked.so: undefined symbol: __netcdf_MOD_nf90_put_var_1d_fourbyteint

treerink commented 6 years ago

On a fresh checkout and new ece2cmor3 environment create on cca, this error message is absent ...

goord commented 6 years ago

Can you post the operating system you are using (Fedora version)?

treerink commented 6 years ago

Fedora 26

treerink commented 6 years ago

Even after removing the environment by conda env remove --name ece2cmor3 and a fresh install of ece2cmor this error message persists on my knmi fedora workstation (I haven't encountered it either on mac or on cca) .

oloapinivad commented 6 years ago

Hi all,

I am new on this, so forgive me if I have not followed the full story. I am trying to catching up with the ece2cmor3 thing at CNR. I am experiencing the same issue on Marconi HPC at Cineca. I am not able to cmorize any nemo output for the moment, but I am not sure if this is the main problem.

HPC Traceback (most recent call last):
  File "/marconi_work/Pra13_3311/opt/anaconda/envs/ece2cmor3/lib/python2.7/site-packages/ESMF/interface/loadESMF.py", line 122, in <module>
    mode=ct.RTLD_GLOBAL)
  File "/marconi_work/Pra13_3311/opt/anaconda/envs/ece2cmor3/lib/python2.7/ctypes/__init__.py", line 366, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /marconi_work/Pra13_3311/opt/anaconda/envs/ece2cmor3/lib/libesmf_fullylinked.so: undefined symbol: __netcdf_MOD_nf90_put_var_1d_fourbyteint

goord commented 6 years ago

Hi @oloapinivad , I have had contact with Kristian Strommen and I think the problem with your nemo files on Marconi was the time axis that had disappeared when the parallel netcdf files from the XIOS server processes were merged, can you confirm this?

oloapinivad commented 6 years ago

Thanks @goord for the reply. Actually the one you mention was one of the issues that together with Kristian we managed to solve. The problem above is still there but it does not seem to be the culprit of my NEMO crases.

Indeed, I have a few extra problems with NEMO (with depth axes, which lead to a crash) and IFS (with Primavera tables, with does not cmorize some high freq variables), I will try to figure it out in the next days: in the worst case I am going to open separated issues.

treerink commented 6 years ago

Also if I remove these lines:

esmf=7.1.0r=0 # 5 depends on hdf5 1.10.1, libgcc, mpich, netcdf-fortran 4.4.*
esmpy=7.0.0=py27_1 from the environment.yml and recreate the environment the error remains on my KNMI workstation. So I assume this loadESMF.py must be installed automatically somewhere because of a detected dependency?

Removing these lines did not matter for checkvars.py at least.

zklaus commented 5 years ago

@treerink According to this comment the issue vanished for @oloapinivad. Do you still have the problem? Can we close this issue?

treerink commented 5 years ago

This issue still persisits on my KNMI fedora workstation, even after a full clean: a new anaconda release, a new ece2cmor3 checkout, environment clean. On ubuntu I do not enclounter this error. The error seems not to have any impact, but appears in all output of the scripts. I gave up about this error, but kept the issue open, because once in a while someone else might experience this.

zklaus commented 5 years ago

I understand that time is short and you might prefer not to address this issue, in which case I would still suggest to close it or at least mark it as 'wont_fix' or something like that so that it can easily be ignored.

Having said that, here is another idea for debugging. That symbol is resolved by the libnetcdff library (note the second f for fortran). Could you do a

ldd /marconi_work/Pra13_3311/opt/anaconda/envs/ece2cmor3/lib/libesmf_fullylinked.so

and confirm that there is a line about netcdff similar to

libnetcdff.so.6 => /marconi_work/Pra13_3311/opt/anaconda/envs/ece2cmor3/lib/libnetcdff.so.6

? This should tell us whether it finds the right library, but that might not contain the right symbol or it pulls in a wrong library, possibly from a system path.

treerink commented 5 years ago

In my case, yes I have:

/usr/people/reerink/anaconda2/envs/ece2cmor3/lib/libesmf_fullylinked.so

and

/usr/people/reerink/anaconda2/envs/ece2cmor3/lib/libnetcdff.so -> libnetcdff.so.6.1.1*
/usr/people/reerink/anaconda2/envs/ece2cmor3/lib/libnetcdff.so.6 -> libnetcdff.so.6.1.1*
/usr/people/reerink/anaconda2/envs/ece2cmor3/lib/libnetcdff.so.6.1.1*

zklaus commented 5 years ago

This looks like the output from ls. With ldd we get to know what the dynamic linker considers to be the appropriate libraries to load as dependencies.

treerink commented 5 years ago

This is what you we asking?:


 ldd /usr/people/reerink/anaconda2/envs/ece2cmor3/lib/libesmf_fullylinked.so |grep libnetcdff.so.6
 libnetcdff.so.6 => /usr/local/free/installed/netcdf_for_fortran_f26/netcdf-fortran-4.4.4_ifort/lib/libnetcdff.so.6 (0x00007f22094c4000)

zklaus commented 5 years ago

Exactly! Here, we see that the dynamic linker picked up the wrong netcdf fortran library. Instead of the correct

/usr/people/reerink/anaconda2/envs/ece2cmor3/lib/libnetcdff.so.6

which contains the expected symbol

nm /usr/people/reerink/anaconda2/envs/ece2cmor3/lib/libnetcdff.so.6 |egrep put_var_1d_fourbyteint
0000000000051ca0 T __netcdf_MOD_nf90_put_var_1d_fourbyteint

it is using

/usr/local/free/installed/netcdf_for_fortran_f26/netcdf-fortran-4.4.4_ifort/lib/libnetcdff.so.6

which was compiled by an intel compiler and hence, thanks to these crucial bits of the ABI not being fixed in the fortran specification but left to the compiler implementations, instead has

nm libnetcdff.so.6 |egrep put_var_1d_fourbyteint
00000000000598c0 T netcdf_mp_nf90_put_var_1d_fourbyteint_

ie different name mangling with regards to underscores and mp instead of MOD for modules. So this is a local configuration problem, not an ece2cmor bug and hence can be closed, I think.

To solve your problem: The linker gets its search path from three places

The binary, ie libesmf_fullylinked.so itself
The environment variable $LD_LIBRARY_PATH
The linker configuration, typically /etc/ld.so.conf

Usually the culprit for this kind of problem is a rogue $LD_LIBRARY_PATH variable. So you can try

echo $LD_LIBRARY_PATH

to check if it contains something like /usr/local/free/installed/netcdf_for_fortran_f26/netcdf-fortran-4.4.4_ifort/lib and then

unset LD_LIBRARY_PATH
ece2cmor

to see if this way the error vanishes. Of course the problem is that you might actually need that version of netcdf for something else. This is the reason for the widespread module system that takes care of setting and unsetting these variables in a slightly simpler fashion. Indeed, it may well be that also here it has been set by an innocent

module load netcdf

in your .bashrc.

Good luck!

goord commented 5 years ago

Thanks for your input @zklaus. I thought we checked the LD_LIBRARY_PATH no Thomas?

treerink commented 5 years ago

It took me a bit go through all your comment (for this indeed not high priority issue), anyway @zklaus thanks for your guidance. Indeed, after checking

echo $LD_LIBRARY_PATH

finally I vaguely remembered I had adjusted that one in my bash alias file when I started here at KNMI because of some trouble on fedora with netcdf and I was advised to use a fix. Anyway taking out that LD_LIBRARY_PATH setting does the job: including/excluding this LD_LIBRARY_PATH makes the difference with the error. So solved!

EC-Earth / ece2cmor3

OSError: undefined symbol: __netcdf_MOD_nf90_put_var_1d_fourbyteint #110