galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.37k stars 991 forks source link

Jobs fail due to underscore in username #11270

Open code4dna opened 3 years ago

code4dna commented 3 years ago

We have 20.09 installed on a cluster. There are 60 students in our class and all were added to Galaxy on the same day. There are two who can login but their jobs fail. The only thing I can see is that these two students are the only ones that have underscores in their user names. Not sure if this is failing due to Galaxy or Slurm or both.

galaxy.jobs.runners.drmaa DEBUG 2021-02-01 10:39:18,065 [p:39975,w:1,m:0] [SlurmRunner.work_thread-8] (536) submitting file /scratch/user/galaxy/kaiser/jobs_directory/000/536/galaxy_536.sh
galaxy.jobs.runners.drmaa DEBUG 2021-02-01 10:39:18,065 [p:39975,w:1,m:0] [SlurmRunner.work_thread-8] (536) native specification is: --time=24:00:00 --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=1200 --partition=knl
galaxy.jobs.runners ERROR 2021-02-01 10:39:18,069 [p:39975,w:1,m:0] [SlurmRunner.work_thread-8] (536) Unhandled exception calling queue_job
Traceback (most recent call last):
  File "lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "lib/galaxy/jobs/runners/drmaa.py", line 204, in queue_job
    job_wrapper.change_ownership_for_run()
  File "lib/galaxy/jobs/__init__.py", line 2220, in change_ownership_for_run
    external_chown_script, description="working directory")
  File "lib/galaxy/util/path/__init__.py", line 360, in external_chown
    cmd.extend([path, pwent[0], str(pwent[3])])
TypeError: 'NoneType' object is not subscriptable
galaxy.jobs.runners.drmaa ERROR 2021-02-01 10:39:18,070 [p:39975,w:1,m:0] [SlurmRunner.work_thread-8] (536/None) User killed running job, but error encountered removing from DRM queue
Traceback (most recent call last):
  File "lib/galaxy/jobs/runners/drmaa.py", line 356, in stop_job
    assert ext_id not in (None, 'None'), 'External job id is None'
AssertionError: External job id is None
nsoranzo commented 3 years ago

I guess you are submitting job as the real user ?

code4dna commented 3 years ago

yes, submitting as real user

nsoranzo commented 3 years ago

Thanks, what's the value of real_system_username in your config/galaxy.yml ?

code4dna commented 3 years ago

real_system_username: username

nsoranzo commented 3 years ago

OK, what happens if you try to execute python3 -c 'import pwd; print(pwd.getpwnam("YOUR_USERNAME_WITH_UNDERSCORE"))' ?

code4dna commented 3 years ago
(.venv) [kaiser@portal-opa galaxy]$ python --version
Python 3.6.6
(.venv) [kaiser@portal-opa galaxy]$ python -c 'import pwd; print(pwd.getpwnam("joshua_6"))'
pwd.struct_passwd(pw_name='joshua_6', pw_passwd='x', pw_uid=22101, pw_gid=22101, pw_gecos='Joshua X', pw_dir='/home/joshua_6', pw_shell='/bin/bash')
nsoranzo commented 3 years ago

Mmmh, can you try to apply this patch and see if it gives you a different traceback?

diff --git a/lib/galaxy/model/__init__.py b/lib/galaxy/model/__init__.py
index 8bd10f6803..69a42cc978 100644
--- a/lib/galaxy/model/__init__.py
+++ b/lib/galaxy/model/__init__.py
@@ -437,24 +437,17 @@ class User(Dictifiable, RepresentById):
         Gives the system user pwent entry based on e-mail or username depending
         on the value in real_system_username
         """
-        system_user_pwent = None
         if real_system_username == 'user_email':
-            try:
-                system_user_pwent = pwd.getpwnam(self.email.split('@')[0])
-            except KeyError:
-                pass
+            username = self.email.split('@')[0]
         elif real_system_username == 'username':
-            try:
-                system_user_pwent = pwd.getpwnam(self.username)
-            except KeyError:
-                pass
+            username = self.username
         else:
-            try:
-                system_user_pwent = pwd.getpwnam(real_system_username)
-            except KeyError:
-                log.warning("invalid configuration of real_system_username")
-                system_user_pwent = None
-        return system_user_pwent
+            username = real_system_username
+        try:
+            return pwd.getpwnam(username)
+        except Exception:
+            log.warning(f"Error getting the password database entry for user {username}")
+            raise

     def all_roles(self):
         """
code4dna commented 3 years ago

I noticed that in the admin interface, the Users list shows email as joshua_6@email.com and the second column 'User name' shows joshua-6

nsoranzo commented 3 years ago

Mmmh, how were these users added?

code4dna commented 3 years ago

We have a bash script that adds user to galaxy group. The user shows up in the Users list only after they first log in using CAS authentication with their university email and password.

code4dna commented 3 years ago

the patch returns the same results

nsoranzo commented 3 years ago

Unfortunately I have no idea where in your CAS auth setup (I imagine this is using nginx/HTTP proxy auth) the _ in the username is changed to -. Maybe someone else can help you there.

I am bit puzzled the patch is not changing the traceback, it should fail with a different error message.

code4dna commented 3 years ago

we are using apache would it work to just change the username in the postgres database for now?

nsoranzo commented 3 years ago

would it work to just change the username in the postgres database for now?

Worth a try :)

code4dna commented 3 years ago

thanks, changing the username in the database works :)