Closed StevenCTimm closed 9 months ago
to reproduce . /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh setup rucio v1_29_11 setup dunesw v09_70_00d00 -q e20:prof $ rucio --verbose upload --rse DUNE_US_FNAL_DISK_STAGE --protocol davs --scope testpro --lifetime 86400 --name tutorial_hist_824_1_20230417T142645Z.root /pnfs/dune/scratch/users/kherner/jan2023tutorial/tutorial_hist_824_1_20230417T142645Z.root
Or pick whatever file you want. If rucio and dunesw are setup at the same time it will segfault
If you only set up the new python v3_9_13 and not all the rest of dunesw you don't have this problem. So there's something in the 40-some directories of the PYTHONPATH of dunesw that is giving us a wrong library The segfault happens while it is going through the gfal2-plugins
close(8) = 0 open("/lib64/tls/liblfc.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/lib64/liblfc.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/tls/liblfc.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/liblfc.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) munmap(0x7f8e1ae7d000, 202156) = 0 munmap(0x7f8df2ab4000, 2147432) = 0 geteuid() = 2904 access("/tmp/x509up_u2904", R_OK) = 0 geteuid() = 2904 access("/tmp/x509up_u2904", R_OK) = 0 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x1f7b6} --- +++ killed by SIGSEGV +++
We note that the CVMFS plugins directory for gfal2 has 9 plugins rather than the 6 that are installed on gpvms
3347 /cvmfs/dune.opensciencegrid.org/products/dune/python_gfal2_python/v1_11_0_post3/NULL/lib/python3.6/site-packages/gfal2.so 6 /cvmfs/dune.opensciencegrid.org/products/dune/python_requests/v2_25_0/NULL/lib/python3/site-packages/urllib3/contrib/pycache/socks.cpython-36.pyc 7 /cvmfs/dune.opensciencegrid.org/products/dune/python_requests/v2_25_0/NULL/lib/python3/site-packages/urllib3/contrib/socks.py
This is now moot since we are not using rucio v1_29_12 client anymore.
while investigating a different issue on RAL tier1 Ken found that when the dunesw software stack is set up, (which brings python 3.9 along with it, python v3_9_2 by default) then it's impossible to rucio upload with https you get a segfault. if dunesw is not set up, things are fine.
The segfault is most likely coming from the fact that some of the *.so compiled libraries in the cvmfs python library area are compiled against python 3.6 rather than python 3.9.. but we have not yet been able to track down which one. there are 3 main candidates, the first one being gfal2.so from python3-gfal2.
In theory doing the strace of a rucio upload with segfault and one without should tell the story. I have this output but haven't been able to totally parse it yet.
Note that rucio upload does not shell out to gfal-copy or gfal-mkdir, it is rather using the python bindings.