PySlurm / pyslurm

Python Interface to Slurm
https://pyslurm.github.io
GNU General Public License v2.0
479 stars 117 forks source link

symbol lookup error: undefined symbol: data_init #289

Closed KrisDavie closed 1 year ago

KrisDavie commented 1 year ago

Details

Issue

Hey there!

Our cluster just updated the slurm version and I've updated pyslurm accordingly, but when importing pyslurm I'm getting the following error.

>>> import pyslurm
python3: symbol lookup error: /usr/lib64/slurm/cli_filter_lua.so: undefined symbol: data_init

I've tried both the latest commit 788f445 and the tagged 23.2.0 release f506d63, both result in the same error.

Any ideas what might be going on here?

Many thanks for the great library!

Kris

tazend commented 1 year ago

Hi,

mh interesting... If you do nm -D /usr/lib64/slurm/cli_filter_lua.so | grep data_init, it's really not showing up right?

Could you check to see when you manually remove the slurm_init call from pyslurm/__init__.py and reinstall whether the error is gone?

KrisDavie commented 1 year ago

Thanks for the help.

Running nm does find it:

➜ nm -D /usr/lib64/slurm/cli_filter_lua.so | grep data_init
    U data_init

I couldn't find a slurm_init call in pyslurm/__init__.py, but there was one at the last line in pyslurm/pyslurm.pyx, removing that seems to let me load the library, but then a call to pyslurm.slurmdb_jobs() causes a segfault (maybe not unexpected?).

Cheers,

Kris

tazend commented 1 year ago

Hi,

oh yeah, the slurm_init call in pyslurm/__init__.py only exists on the most recent commit on the main branch (or 23.2.x branch), which I recommend to use (as it already includes a bit of API rework, there will be a new release soon though).

Removing slurm_init and then making an API call that potentially segfaults is indeed expected - I just wanted to make sure that the slurm_init call is actually the point where the lookup errors is brought up.

I will try to reproduce it also on my test cluster and do some tests

tazend commented 1 year ago

Hi again,

As I found out, that error was introduced with slurm 23.02.

Basically, in 23.02, they now explicitly load any client plugins in slurm_init, such as cli_filter, that may be required to interact with the API. Problem is however, as the error indicates, a symbol called data_init is expected to be somewhere in a shared-library (as indicated by the U (undefined), it isn't in cli_filter_lua.so directly).

This symbol is in libslurmfull.so, which basically contains the public API + all internal functions, and every slurm tool like squeue, sbatch, slurmctld, slurmdbd, ... links to that one. Thats why no error appears when using the slurm tools.

It is however not in libslurm.so, which is usually the recommended library to link against to interact with slurm. And because of that, basically any client application linking with libslurm.so in 23.2, like pyslurm, and calling slurm_init (which is mandatory when doing API calls) is broken. If you have some of the tools from the slurm-contribs package installed, like seff, that should also yield the same error.

The bug however has already been reported: https://bugs.schedmd.com/show_bug.cgi?id=16503 (Not sure if its already fixed in 23.02.2, but I don't think so)

But I have been thinking about switching back to libslurmfull for pyslurm anyway actually, as it might make certain things a bit easier to implement in the future.

tazend commented 1 year ago

You can build from this branch for now if you want, it links with libslurmfull and the error should go away

KrisDavie commented 1 year ago

Just jumping in to say that the branch you linked worked great, thanks a lot for the quick fix!

tazend commented 1 year ago

Hi @KrisDavie ,

just wanted to let you know that the issue with data_init symbol missing should be fixed in Slurm 23.02.2 (by this commit) If your cluster already updated to this version, you can continue to use the normal pyslurm releases instead of the branch I made where it links to libslurmfull

Also a note on that: I planned on actually merging the change where we link back with libslurmfull to the main branch, but I noticed a specific test was failing. The issue can be triggered with this for example:

python -c "import pyslurm; gg = pyslurm.utils.nodelist_from_range_str('node[001:002]'); print(gg)";

You should probably see some weird unknown error if you are still using the branch and 23.02.1. Well I have absolutely no idea why its happening with libslurmfull and not libslurm - it also only happens in a python context (can't reproduce with a simple c program that does the same)

So just a heads up: The version I provided via the branch might not be 100% stable in some cases and slurm 23.02.2 is the minimum requirement to use the normal pyslurm 23.2.x releases if the cluster uses the cli_filter functionality.