On restart, queued dynamic destination jobs lose destination_params

pcm32 commented 6 years ago

Using the cli runner (LSF) with dynamic destinations, I notice that when the instance restarts the destination parameters of jobs that were left running or queued get scrambled. This leads to the the plugin parameter being lost and producing this error:

galaxy.jobs.runners ERROR 2018-11-12 20:28:07,243 [p:96911,w:1,m:0] [Dummy-2] Unhandled exception checking active jobs
Traceback (most recent call last):
  File "lib/galaxy/jobs/runners/__init__.py", line 594, in monitor
    self.check_watched_items()
  File "lib/galaxy/jobs/runners/cli.py", line 152, in check_watched_items
    job_states = self.__get_job_states()
  File "lib/galaxy/jobs/runners/cli.py", line 204, in __get_job_states
    shell, job_interface = self.get_cli_plugins(shell_params, job_params)
  File "lib/galaxy/jobs/runners/cli.py", line 40, in get_cli_plugins
    return self.cli_interface.get_plugins(shell_params, job_params)
  File "lib/galaxy/jobs/runners/util/cli/__init__.py", line 57, in get_plugins
    job_interface = self.get_job_interface(job_params)
  File "lib/galaxy/jobs/runners/util/cli/__init__.py", line 70, in get_job_interface
    raise ValueError(ERROR_MESSAGE_NO_JOB_PLUGIN)
ValueError: No job plugin parameter found, cannot create CLI job interface

(line numbers might be a bit off, as I had some log statements here or there). This error brings down the entire check_watched_items loop of the runner, which in turns means that no new jobs are detected to be running (they get submitted fine to the scheduler and they run, but for Galaxy they stay in queued state).

My current hypothesis is that on restart, the URL of the job is used to reconstruct it, and in that process the parameters for the dynamic destination related jobs gets scrambled. You can notice after restart in the database that the job.destination_params field gets shortened to something like \x7b7d. Before restart, the same field in the DB, for the same job, looks a lot longer. So I guess the parameters should be somehow rescued from the database before the URL-reconstructed destination is persisted to the database on restart.

The following error seems to be related to the url to destination issue:

galaxy.jobs.handler ERROR 2018-11-16 09:38:28,694 [p:178352,w:1,m:0] [MainThread] Unable to convert legacy job runner URL 'cli' to job destination, destination will be the 'cli' runner with no params
Traceback (most recent call last):
  File "lib/galaxy/jobs/handler.py", line 907, in url_to_destination
    return self.job_runners[runner_name].url_to_destination(url)
  File "lib/galaxy/jobs/runners/cli.py", line 44, in url_to_destination
    shell_params, job_params = url.split('/')[2:4]
ValueError: need more than 0 values to unpack

pcm32 commented 5 years ago

I'm still seeing this unfortunately....

pcm32 commented 5 years ago

@mvdbeek I remember you mentioned something that you wanted me to print the value for, I thought you wrote it here, but now I cannot find it. Sorry I only got back to this now. Thanks.

pcm32 commented 5 years ago

I've found it:

@pcm32 any chance you could log url in https://github.com/galaxyproject/galaxy/blob/4eed32dc4c86da17a525419ba985d4d06d7a768b/lib/galaxy/jobs/runners/cli.py#L43 ? (when it fails)

pcm32 commented 5 years ago

There is a log.debug on that same method (url_to_destination) that is not reached in the location that you mention, as I never see the message Converted URL .... However this other part:

https://github.com/galaxyproject/galaxy/blob/a11632756a0b2ec043c596eafd412fea287b4126/lib/galaxy/jobs/handler.py#L186

gets executed, as I see the log message for Converted job from a URL to a destination and recovered. On that execution path, self.dispatcher.url_to_destination gets called (instead of maybe self.runner, which I guess would trigger the method that you are after).

galaxyproject / galaxy

On restart, queued dynamic destination jobs lose destination_params #7022