galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.38k stars 998 forks source link

Issue with resubmission + dynamic destination? #7118

Open hexylena opened 5 years ago

hexylena commented 5 years ago

Hi @jmchilton since we debugged last time I'm having an issue with the resubmission and maybe you can shed some light on it. Since I can't really pass around parameters I've settled on having three destinations: 1 1,5 and 2x memory destinations.

I have the following dynamic destination:

from galaxy.jobs import JobDestination

def testway(tool_id, user, memory_scale=1.0, next_dest=None):
    resubmit = []
    if next_dest:
        resubmit = [{
            'condition': 'any_failure and attempt <= 3',
            'destination': next_dest
        }]

    return JobDestination(
        id='blah',
        runner='condor',
        params={'description': 'aaaa', 'request_memory': '%sG' % memory_scale},
        resubmit=resubmit,
    )

def gateway_1x(tool_id, user):
    return testway(tool_id, user, memory_scale=1, next_dest='gateway_1_5x')

def gateway_1_5x(tool_id, user):
    return testway(tool_id, user, memory_scale=1.5, next_dest='gateway_2x')

def gateway_2x(tool_id, user):
    return testway(tool_id, user, memory_scale=2)

And the following job conf:

<job_conf>
    <plugins workers="4">
        <plugin id="condor" type="runner" load="galaxy.jobs.runners.condor:CondorJobRunner" />
    </plugins>
    <destinations default="gateway_1x">

        <destination id="gateway_1x" runner="dynamic">
            <param id="function">gateway</param>
            <param id="tmp_dir">True</param>
            <param id="type">python</param>
        </destination>

        <destination id="gateway_1_5x" runner="dynamic">
            <param id="function">gateway_1_5x</param>
            <param id="tmp_dir">True</param>
            <param id="type">python</param>
        </destination>

        <destination id="gateway_2x" runner="dynamic">
            <param id="function">gateway_2x</param>
            <param id="tmp_dir">True</param>
            <param id="type">python</param>
        </destination>

    </destinations>
</job_conf>

With the expectation that it will submit to first, then second, then third.

In practice I see that it is only resubmitted once:

Galaxy logs
galaxy.tools DEBUG 2018-12-14 10:25:18,113 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Validated and populated state for tool request (12.920 ms)
galaxy.tools.actions INFO 2018-12-14 10:25:18,158 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Handled output named out_file1 for tool test_history_sanitization (26.258 ms)
galaxy.tools.actions INFO 2018-12-14 10:25:18,174 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Added output datasets to history (15.416 ms)
galaxy.tools.actions INFO 2018-12-14 10:25:18,181 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Setup for job Job[unflushed,tool_id=test_history_sanitization] complete, ready to flush (6.363 ms)
galaxy.tools.actions INFO 2018-12-14 10:25:18,216 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Flushed transaction for job Job[id=33,tool_id=test_history_sanitization] (35.057 ms)
galaxy.tools.execute DEBUG 2018-12-14 10:25:18,216 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Tool [test_history_sanitization] created job [33] (91.104 ms)
galaxy.tools.execute DEBUG 2018-12-14 10:25:18,222 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Executed 1 job(s) for tool test_history_sanitization request: (108.683 ms)
galaxy.jobs.mapper DEBUG 2018-12-14 10:25:18,520 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Mapped job to destination id: blah
galaxy.jobs.handler DEBUG 2018-12-14 10:25:18,528 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Dispatching to condor runner
galaxy.jobs DEBUG 2018-12-14 10:25:18,536 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Persisting job destination (destination id: blah)
galaxy.jobs DEBUG 2018-12-14 10:25:18,538 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Working directory for job is: /home/hxr/arbeit/galaxy/database/jobs_directory/000/33
galaxy.jobs.runners DEBUG 2018-12-14 10:25:18,554 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] Job [33] queued (25.845 ms)
galaxy.jobs.handler INFO 2018-12-14 10:25:18,577 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Job dispatched
galaxy.jobs.command_factory INFO 2018-12-14 10:25:18,700 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] Built script [/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/tool_script.sh] for tool command [sleep 0; env | sort> /home/hxr/arbeit/galaxy/database/files/000/dataset_33.dat; exit 1]
galaxy.jobs.runners DEBUG 2018-12-14 10:25:18,749 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] (33) command is: rm -rf working; mkdir -p working; cd working; /home/hxr/arbeit/galaxy/database/jobs_directory/000/
33/tool_script.sh; return_code=$?; cd '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33';
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True
python "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/set_metadata_9qLh3K.py" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/registry.xml" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33
/working/galaxy.json" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_in_HistoryDatasetAssociation_33_ZIRRRv,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_kwds_HistoryDatasetAss
ociation_33_E92cHz,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_out_HistoryDatasetAssociation_33_f1Bd3X,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_results_HistoryDatasetAssociation_33_05ETQK,/home/hxr/arbeit/galaxy/database/files/000/dataset_33.dat,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_override_HistoryDatasetAssociation_33_6CUyJJ" 5242880; sh -c "exit $ret
urn_code"
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:18,780 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] (33) submitting file /home/hxr/arbeit/galaxy/database/pbs/galaxy_33.sh
galaxy.jobs.runners.util.condor DEBUG 2018-12-14 10:25:18,847 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] condor external id: 320
galaxy.jobs.runners.condor INFO 2018-12-14 10:25:18,848 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] (33) queued as 320
galaxy.jobs DEBUG 2018-12-14 10:25:18,849 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] (33) Persisting job destination (destination id: blah)
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:19,043 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/320) job is now running
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:22,115 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/320) job is now running
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:22,175 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/320) job has completed
galaxy.jobs.output_checker INFO 2018-12-14 10:25:22,200 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] Job 33: Fatal error: Exit code 1 ()
galaxy.jobs.output_checker DEBUG 2018-12-14 10:25:22,200 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] Tool exit code indicates an error, failing job.
galaxy.jobs.output_checker DEBUG 2018-12-14 10:25:22,200 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] job failed, standard error is - [Fatal error: Exit code 1 ()
]
galaxy.jobs.runners.state_handlers.resubmit INFO 2018-12-14 10:25:22,213 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33/320) Job will be resubmitted to 'gateway_1_5x' because tool detected error at the 'blah'
 destination
galaxy.jobs.mapper DEBUG 2018-12-14 10:25:22,217 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33) Mapped job to destination id: blah
galaxy.jobs DEBUG 2018-12-14 10:25:22,218 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33) Working directory for job is: /home/hxr/arbeit/galaxy/database/jobs_directory/000/33
galaxy.jobs DEBUG 2018-12-14 10:25:22,218 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33) Previous working directory moved to /home/hxr/arbeit/galaxy/database/jobs_directory/_cleared_contents/000/33/20181214-
102522
galaxy.jobs DEBUG 2018-12-14 10:25:22,253 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33) Persisting job destination (destination id: blah)
galaxy.jobs.handler DEBUG 2018-12-14 10:25:22,717 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Job was resubmitted and is being dispatched immediately
galaxy.jobs.handler DEBUG 2018-12-14 10:25:22,719 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Recovered destination id (blah) does not exist in job config (but this may be normal in the case of a dynamically generated destination)
galaxy.jobs.handler DEBUG 2018-12-14 10:25:22,719 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Dispatching to condor runner
galaxy.jobs DEBUG 2018-12-14 10:25:22,736 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Persisting job destination (destination id: blah)
galaxy.jobs DEBUG 2018-12-14 10:25:22,743 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Working directory for job is: /home/hxr/arbeit/galaxy/database/jobs_directory/000/33
galaxy.jobs.runners DEBUG 2018-12-14 10:25:22,762 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] Job [33] queued (42.119 ms)
galaxy.jobs.command_factory INFO 2018-12-14 10:25:22,884 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] Built script [/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/tool_script.sh] for tool command [sleep 0; env | sort> /home/hxr/arbeit/galaxy/database/files/000/dataset_33.dat; exit 1]
galaxy.jobs.runners DEBUG 2018-12-14 10:25:22,976 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] (33) command is: rm -rf working; mkdir -p working; cd working; /home/hxr/arbeit/galaxy/database/jobs_directory/000/33/tool_script.sh; return_code=$?; cd '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33';
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True
python "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/set_metadata_R8Sdq5.py" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/registry.xml" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/working/galaxy.json" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_in_HistoryDatasetAssociation_33_8pYEJ6,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_kwds_HistoryDatasetAss
ociation_33_Jwg8le,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_out_HistoryDatasetAssociation_33_iBInaR,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_results_HistoryDatasetAss
ociation_33_Ndsn0g,/home/hxr/arbeit/galaxy/database/files/000/dataset_33.dat,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_override_HistoryDatasetAssociation_33_hbTGcI" 5242880; sh -c "exit $ret
urn_code"
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:23,010 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] (33) submitting file /home/hxr/arbeit/galaxy/database/pbs/galaxy_33.sh
galaxy.jobs.runners.util.condor DEBUG 2018-12-14 10:25:23,069 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] condor external id: 321
galaxy.jobs.runners.condor INFO 2018-12-14 10:25:23,069 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] (33) queued as 321
galaxy.jobs DEBUG 2018-12-14 10:25:23,070 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] (33) Persisting job destination (destination id: blah)
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:38,388 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/321) job is now running
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:41,447 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/321) job is now running
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:41,501 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/321) job has completed
galaxy.jobs.output_checker INFO 2018-12-14 10:25:41,544 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Job 33: Fatal error: Exit code 1 ()
galaxy.jobs.output_checker DEBUG 2018-12-14 10:25:41,544 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Tool exit code indicates an error, failing job.
galaxy.jobs.output_checker DEBUG 2018-12-14 10:25:41,544 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] job failed, standard error is - [Fatal error: Exit code 1 ()
]
galaxy.jobs DEBUG 2018-12-14 10:25:41,561 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] (33) setting dataset 33 state to ERROR
galaxy.jobs INFO 2018-12-14 10:25:41,606 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Collecting metrics for Job 33
galaxy.jobs DEBUG 2018-12-14 10:25:41,617 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] job 33 ended (finish() executed in (70.957 ms))
galaxy.tools.error_reports DEBUG 2018-12-14 10:25:41,626 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Bug report plugin  generate
d response None
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,626 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Cleaning up external metadata files
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup MetadataTempFile temp files from /home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadat
a_out_HistoryDatasetAssociation_33_f1Bd3X: [Errno 2] No such file or directory: u'/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_out_HistoryDatasetAssociation_33_f1Bd3X'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_in) for HistoryDatasetAssociation_33: [Errno 2] No such file o
r directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_in_HistoryDatasetAssociation_33_ZIRRRv'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_out) for HistoryDatasetAssociation_33: [Errno 2] No such file
or directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_out_HistoryDatasetAssociation_33_f1Bd3X'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_results_code) for HistoryDatasetAssociation_33: [Errno 2] No s
uch file or directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_results_HistoryDatasetAssociation_33_05ETQK'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_kwds) for HistoryDatasetAssociation_33: [Errno 2] No such file
 or directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_kwds_HistoryDatasetAssociation_33_E92cHz'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,632 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_override_metadata) for HistoryDatasetAssociation_33: [Errno 2]
 No such file or directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_override_HistoryDatasetAssociation_33_6CUyJJ'
bernt-matthias commented 4 years ago

Found this comment:

https://github.com/galaxyproject/galaxy/blob/737775f586c5ab4c4cef4f486dd8826a978ed6b1/lib/galaxy/jobs/__init__.py#L1024

we "just" need to find out why this happens. Seems that destinations / job-destination mappings are cached ..

Also odd is the None id in the log output

bernt-matthias commented 4 years ago

With static job mapping the log output contains the names of the destinations.

bernt-matthias commented 4 years ago

OK. None is unrelated: https://github.com/galaxyproject/galaxy/pull/9742

bernt-matthias commented 4 years ago

One tiny step further:

https://github.com/galaxyproject/galaxy/blob/737775f586c5ab4c4cef4f486dd8826a978ed6b1/lib/galaxy/jobs/handler.py#L215

if I get it correct the JobHandler class loads the Job from the DB (?) which does not persists the resubmit attribute of the JobDestination, but restore it from the job_conf file. All this info is then used to construct a JobWrapper which now has lost the resubmit info...

With my current knowledge I see two options:

  1. make the resubmit info persistent
  2. postpone the calculation of the new destination (or recalculate it) after the job wrapper has been reconstructed

Intuitively 1 seems the easier solution (but I guess in reality its quite a lot of work).

For 2.: Currently the dynamic destination calculation is triggered via this route:

The last function sets the destination for the resubmission and triggers galaxy.jobs.mapper.cache_job_destination which resets the cached JobDestination by the result of the dynamic destinations python function (i.e. setting a new runner, params, env and resubmit).

The Idea would be to split _handle_resubmit_definitions in two parts, such that the failure handling would just mark the job for resubmission and the calculation of the new destination happens when the job reenters the queue.

Interestingly the destination is already set in the JobHandler when it enters the queue for the first time.

bernt-matthias commented 4 years ago

And maybe another wee step https://github.com/galaxyproject/galaxy/pull/9747

bernt-matthias commented 4 years ago

For my own reference some debug notes https://gist.github.com/bernt-matthias/2c75e2266f626ba8bb5c331e178bce30