datasets do not appear in the history for hours

bgruening commented 7 years ago

We getting more and more requests that workflows do not put datasets into the history for hours, a few users have encountered a 24h delay.

We do have this setting enabled history_local_serial_workflow_scheduling as your users need to have the order of datasets preserved in a history.

Has anyone seen this as well?

hexylena commented 7 years ago

This is happening to me as well on 17.01. I have a jenkins job which launches workflows. I just hit the button and maybe 50% only showed three jobs when 50+ should have been scheduled.

utvalg_093

martenson commented 7 years ago

This is 17.01 I assume. Are the jobs created / finished? How complex are the workflows (some scheduling issues?)

On Sun, Mar 26, 2017, 14:54 Björn Grüning notifications@github.com wrote:

We getting more and more requests that workflows do not put datasets into the history for hours, a few users have encountered a 24h delay.

We do have this setting enabled history_local_serial_workflow_scheduling https://github.com/galaxyproject/galaxy/blob/dev/config/galaxy.ini.sample#L1094 as your users need to have the order of datasets preserved in a history.

Has anyone seen this as well?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/galaxyproject/galaxy/issues/3816, or mute the thread https://github.com/notifications/unsubscribe-auth/ABuxqnzsoAcV3povhqKJbEiiFfkm5FcQks5rprRcgaJpZM4MpjA4 .

hexylena commented 7 years ago

utvalg_094

as for me, no idea if the jobs will be created. I killed the histories and re-started with a delay. This time even fewer jobs are scheduled.

utvalg_095

hexylena commented 7 years ago

utvalg_097

Will be watching over the course of the day.

hexylena commented 7 years ago

15 minutes later, ok, looks like more are queued. That was just very very unsettling! I imagine @bgruening has this problem much worse than I do, based on the workflows I see him pictured on twitter with

utvalg_098

nekrut commented 7 years ago

@natefoo @jmchilton ping

bgruening commented 7 years ago

I'm glad I'm not the only one that sees this. This can also related to the usage of the cluster, so that the visual feedback is lacking even more I suppose.

bgruening commented 7 years ago

Locally, if I run a workflow with one tool (htseq-count) it is considerably slower than running this one tool 10 times without a workflow. I tried this on usegalaxy.org. It is faster than locally but I was able to produce the following, which is also bad. Multiple same IDs in one history on usegalaxy.org.

https://usegalaxy.org/u/bgruening/h/hiv-coverage

jmchilton commented 7 years ago

Duplicate HIDs seem unrelated IMO so I have created a separate issue here in #3818.

jmchilton commented 7 years ago

Locally, if I run a workflow with one tool (htseq-count) it is considerably slower than running this one tool 10 times without a workflow.

Locally with history_local_serial_workflow_scheduling enabled?

There is some overhead associated with backgrounding the workflows and waiting for a job handler to pick them up versus just scheduling 10 jobs right in a web thread. There are a lot of optimizations that have been applied to the tool execution thread that the workflow scheduling thread cannot leverage as architected because it processes workflows one at a time. Creating 10 jobs from a tool submission is not 10x slower than creating 1 job - scheduling 10 workflows is 10x slower than scheduling 1 workflow.

So yes - running tools is faster than running workflows with single tools - this would be magnified by history_local_serial_workflow_scheduling though. With history_local_serial_workflow_scheduling enabled you may have to walk through everyone's open/ready workflows on the server before the first one of yours schedules, and then again for the second - and so on. The behavior @erasche is seeing above - 15 minutes to schedule potentially thousands of datasets is concerning and we need to optimize (and ask the Canadians - we've done a lot to optimize this - and I still have two WIP threads pursuing more optimizations) - but all of this slow down would be magnified for individual users with history_local_serial_workflow_scheduling if they have many workflows ready to go in the same history.

There are some things we can do improve the setting history_local_serial_workflow_scheduling - we can load the workflows in order so that the oldest one for each history is loaded first. That would probably improve the turn around time for these users since you should only have to walk all the open workflows once to get these things scheduled. We could also move the double nested loop that determines if "this" workflow is the correct workflow to schedule into SQL to speed up every check on every loop.

I also need to write up some documentation on separating workflow and job scheduling threads - I think performance problems and debugging would be much clearer if these were separate processes. Right now I/O problems and such in job schedulers can potentially slow down workflow scheduling in more ways than they should be able to. It also means that if a job runner process dies (which we have observed to happen for SLURM and PBS) - workflow scheduling dies. I also think https://github.com/galaxyproject/galaxy/pull/3659 would mean Galaxy would clean up older workflows that maybe have error-ed out in a way I don't understand yet (possible ideas in https://github.com/galaxyproject/galaxy/issues/3555 which sort of staled waiting on merge of #3619).

I'll open a PR for the two optimizations to history_local_serial_workflow_scheduling Monday as well as trying to figure out the failed test case to get #3659 into 17.01 - I think these are the best things we can do right away to improve the Freiburg situation improved. Later in the week I'll open a PR for instructions on splitting up workflow scheduler threads and I'll continue working on the test cases in #3555 and see if I can find workflow bugs that might lead to stuck workflow invocations that would slow down scheduling. If I get through all of that this week I will also finish up work on #1957 which could speed up workflow scheduling and see if I can make improvements to tool submission bursting in an optimized way with workflow threads - that could also really speed up workflow invocations.

Update: I believe the above shows a good faith effort to try to address the problems caused by invoking many flat workflows with individual datasets - but I do want to point out that in addition to better history UX organization I believe Galaxy will schedule a single workflow invocation with collections much faster since it can leverage the tool execution optimizations aimed at creating homogenous jobs together.

bgruening commented 7 years ago

Locally with history_local_serial_workflow_scheduling enabled?

Yes, on my side yes. @erasche has probably not enabled this.

Update: I believe the above shows a good faith effort to try to address the problems caused by invoking many flat workflows with individual datasets - but I do want to point out that in addition to better history UX organization I believe Galaxy will schedule a single workflow invocation with collections much faster since it can leverage the tool execution optimizations aimed at creating homogenous jobs together.

Belief me we are trying to convince people to use collections more and more.

One more thing I have seen, not sure how this relates. I have a history with 6 BAM files, and a workflow with one step (rmdup). Running this workflow on all 6 BAM files, will show me 5 new datasets in around 20s. The last one shows up when the other 5 are finished with computing - this can take minutes to hours.

hexylena commented 7 years ago

I have not enabled it, no.

The behavior @erasche is seeing above - 15 minutes to schedule potentially thousands of datasets is concerning and we need to optimize (and ask the Canadians - we've done a lot to optimize this - and I still have two WIP threads pursuing more optimizations) - but all of this slow down would be magnified for individual users with history_local_serial_workflow_scheduling if they have many workflows ready to go in the same history.

Oh, I imagine they have it much, much worse. At my org, for now / the foreseeable future I'm the only person launching 20 histories with 50 steps in each. I don't mind knowing that it can take some time, I can document this for others here. Can also provide timing data / etc.

Belief me we are trying to convince people to use collections more and more.

Us too! But @jmchilton, I tried collections, and they look like they'll be great but I cannot recommend that my users use them until https://github.com/galaxyproject/galaxy/issues/740 is solved. Please don't get me wrong, I'm excited for how much they'll simplify this sort of data processing.

jmchilton commented 7 years ago

@bgruening How much of your immediate problems have been solved by #3820 and #3830 and disabling history_local_serial_workflow_scheduling? Have we gotten the delay down to minutes from hours at least?

bgruening commented 7 years ago

I think you solved this, at lest my testing so far looks very good. I would need this both merged and need a few days to test this with more and bigger workflows. But so far everything looks good.

Thanks so much!

galaxyproject / galaxy

datasets do not appear in the history for hours #3816