Sisyphus gets stuck pulling an input file

1fish2 commented 4 years ago

I'm running the wcEcoli/runscripts/cloud/util/demo_workflow.py using the list-workflows branch on gaia-base.

The first task (lines) runs fine. The second task (count) waits forever trying to download its input file, even though:

gsutil cp gs://sisyphus-jerry/DemoWorkflow/20191202.165350/lines.txt /tmp/lines.txt

fetches it just fine from my laptop and from the Sisyphus worker.

[x] Why is this getting stuck? (The Sisyphus worker did say *** System restart required *** upon ssh login, but rebooting it didn't help.)
Halting a workflow should stop tasks that are downloading. Right now it doesn't. --> Split out to #26
Halting a workflow should stop tasks that are pulling docker images. --> Split out to #26
Halting the workflow should clear its messages from RabbitMQ so restarting a Sisyphus worker without resuming the workflow should not restart any of its tasks. (Currently the task starts up again and gets stuck downloading again.) --> Split out to #25

Testing with Gaia on master, new Sisyphus workers still get stuck on the old workflow's task.

I'm tempted to purge or reset rabbitmq, or whatever the command is, and for that matter to do apt-get updates to rabbit-prime and gaia-prime, but it's better to ask first.

prismofeverything commented 4 years ago

Okay yeah, several things going on here. First is the stuck download, which I would say is the greatest mystery, especially as gs_util seems to work fine. I'll check if there is something about that file in particular or it is something in the code, but either way this is the unknown failure mode.

Halting a workflow should stop tasks that are downloading. Right now it isn't doing that.

Agreed. We have a comment in the code currently that says:

Defer any termination requests until there's a docker process to kill.

So it looks like we are currently only considering termination to be termination of the docker container, but downloading is outside of that. Interrupting download would require additional logic and perhaps wrapping the executing task in a data structure that would allow us to cancel it from another thread (I am imagining something like putting the future in our state atom so we can future-cancel it? I'll have to check if there are issues with that).

Halting the workflow then restarting the Sisyphus worker -- without resuming the workflow -- restarts that task, which it shouldn't (and the task again gets stuck downloading).

Right, so if we just stop the sisyphus worker instead of letting it finish and send the acknowledgement to rabbit, the same task will get sent over and over again. However we solve the problem and handle the termination for download it will have to involve sending the acknowledgement back to rabbit so we don't get stuck in a loop. This happens if it follows the code path, but won't happen if we kill Sisyphus.

Sisyphus should have timeouts for downloading files and for pulling docker images.

Yes. Ideally we don't have timeouts downloading things? That seems to be the real problem here. But if this is some kind of reality we have to deal with then yes the earlier we can detect the stuck state the better. I'll look into this stuck download and see if I can find a cause besides "intermittent storage access failure".

I'm tempted to purge or reset rabbitmq, or whatever the command is, and for that matter to do apt-get updates to rabbit-prime and gaia-prime, but it's better to ask first.

Feel free to purge the rabbit queue whenever, it becomes necessary if we get into a situation where we have to kill a worker node. Here are the commands -

To list queues and how many messages remain:

sudo rabbitmqctl list_queues

It will show something like this:

Listing queues
sisyphus-queue  3

That means there are three messages remaining in the queue sisyphus-queue. To purge any queues:

sudo rabbitmqctl purge_queue sisyphus-queue

I just did this so you should be clear. I'm about to do some testing of my own.

prismofeverything commented 4 years ago

If it really is some intermittent quality of the google cloud storage java api then we could always shell out to gsutil : P

prismofeverything commented 4 years ago

One point of data: I can pull the file fine using sisyphus.cloud/download!. Maybe it is hanging somewhere else in the task logic?

1fish2 commented 4 years ago

I can pull the file fine using sisyphus.cloud/download!

Very interesting! Is it still failing in the worker or has the problem (server-side?) perhaps cleared up?

Halting the workflow then restarting the Sisyphus worker -- without resuming the workflow -- restarts that task, which it shouldn't (and the task again gets stuck downloading).

Right, so if we just stop the sisyphus worker instead of letting it finish and send the acknowledgement to rabbit, the same task will get sent over and over again. However we solve the problem and handle the termination for download it will have to involve sending the acknowledgement back to rabbit so we don't get stuck in a loop. This happens if it follows the code path, but won't happen if we kill Sisyphus.

When Gaia halts a workflow, it should purge that workflow's messages from rabbit. Maybe this has to wait until each workflow has a separate (virtual?) queue. A worker acknowledging a rabbit purged message needs to be safe.

Since rabbit is not part of the user model, users should not need to know how to administer it in order to recover from workflow problems.

Once Gaia becomes responsible for starting and stopping Sisyphus workers, halting a workflow should ensure that those workers eventually shut down.

1fish2 commented 4 years ago

I can pull the file fine using sisyphus.cloud/download!. Maybe it is hanging somewhere else in the task logic?

Excellent question. One way to test that hypothesis would be to look on the worker for the file

wc /tmp/sisyphus/inputs/DemoWorkflow/20191202.165350/lines.txt

wc should print 100 400 1590. Sisyphus deletes all task outputs before starting the next task, and it uses a separate directories for inputs & outputs, so if the file is present when the worker is stuck, it must've downloaded it for the stuck task.

The last thing that journalctl printed before Sisyphus got stuck was:

Dec 03 01:47:42 sisyphus-jerry-demo-20191202-174449-0 bash[1157]: INFO: downloading /tmp/sisyphus/inputs/DemoWorkflow/20191202.165350/lines.txt from sisyphus-jerry:DemoWorkflow/20191202.165350/lines.txt

Pretty much the only thing that can get stuck between

(log/info! "downloading" local "from" (str bucket ":" key))
(cloud/download! storage bucket key local)

and the next log message

(log/info! "created container" id config)

is

(docker/create! docker config)

We could check if the new docker container got created. The worker now deletes containers so this should be clearcut.

prismofeverything commented 4 years ago

Hmmm I can imagine docker/create! failing to pull the image. Almost more plausible than storage. I've seen docker pull fail before. Apparently we don't have enough log messages (!)

We could check if the new docker container got created. The worker now deletes containers so this should be clearcut.

This would be awesome except of course it has deleted itself! We should leave these failed VMs around for debugging purposes.

1fish2 commented 4 years ago

Sorry to say that I deleted that worker and its predecessor. That's a lesson for next time.

I'll add log messages around these steps.

Note that this task pulls python:2.7.16 which the previous task already pulled, and for that matter pulling the image happens before pulling the inputs. So it's just creating the docker container that could've failed there.

prismofeverything commented 4 years ago

So it seems either storage is hanging or docker create is hanging.... It is not necessarily a simple operation to create a container, especially in a VM. Through a java client. I can imagine something going wrong there, probably in the interaction. More investigation required : )

1fish2 commented 4 years ago

I added some logging and looked around after the worker got stuck. There is no /tmp/sisyphus/inputs directory, which means either (1) it got stuck before trying to download the file (and certainly before creating the container) or (2) it threw an exception that it didn't log and then the finally clause called delete-local-trees!.

When su'd to the sisyphus user, I can mkdir /tmp/sisyphus/newdir just fine.

It must be stuck in cloud/download! or its subroutine download-blob! or in error handling.

1fish2 commented 4 years ago

gsutil downloaded that file just fine while the worker was stuck there.

I'm not having success (nor a useful stack trace) calling cloud/download! directly.

sisyphus.core=> (def s (cloud/connect-storage! {}))
#'sisyphus.core/s

sisyphus.core=> (cloud/download! s "sisyphus-jerry" "DemoWorkflow/20191202.234412/lines.txt" "/tmp/debug/lines.txt")
Execution error (NoSuchMethodError) at com.google.cloud.storage.spi.v1.HttpStorageRpc/createReadRequest (HttpStorageRpc.java:655).
'com.google.api.client.googleapis.services.AbstractGoogleClientRequest com.google.api.services.storage.Storage$Objects$Get.setReturnRawInputStream(boolean)'

...
sisyphus.core=> (cloud/download-blob! blob "/tmp/debug/lines.txt")
Execution error (NoSuchMethodError) at com.google.cloud.storage.spi.v1.HttpStorageRpc/createReadRequest (HttpStorageRpc.java:655).
'com.google.api.client.googleapis.services.AbstractGoogleClientRequest com.google.api.services.storage.Storage$Objects$Get.setReturnRawInputStream(boolean)'

prismofeverything commented 4 years ago

Oh yeah! On my branch I have the line [com.google.api-client/google-api-client "1.30.4"] in project.clj uncommented, I think I had to do that awhile back. Not sure how it got commented in the first place, but it sounds like you are missing the google client API. That might be the whole problem.

1fish2 commented 4 years ago

I'll fix a bug in cloud/download!: If blob is nil then it'd call log/error! but it ought to throw an exception rather than press on.

In this case it isn't logging that message, and it can tell the file exists in cloud storage:

(cloud/list-directory s "sisyphus-jerry" "DemoWorkflow/20191202.234412/")
("sisyphus-jerry:DemoWorkflow/20191202.234412/" "sisyphus-jerry:DemoWorkflow/20191202.234412/lines.txt" "sisyphus-jerry:DemoWorkflow/20191202.234412/logs/" "sisyphus-jerry:DemoWorkflow/20191202.234412/logs/lines.log")

1fish2 commented 4 years ago

There are more calls to log/exception! in that namespace but I'll revisit them another day.

1fish2 commented 4 years ago

it sounds like you are missing the google client API. That might be the whole problem.

Whoa! That could be. How'd you connect that to the skimpy error msg with a missing stack trace? Why is it only now failing (in Sisyphus workers)? Why does pushing files work but pulling fails?

1fish2 commented 4 years ago

It worked!!! I uncommented com.google.api-client/google-api-client and restarted the worker (which of course didn't time out all night), and it quickly completed the task!

(We could update to com.google.api-client/google-api-client 1.30.5 and there are probably newer versions of the other cloud libs.)

Does Gaia need this updated Sisyphus lib, or is it OK just listing cloud storage blobs?

prismofeverything commented 4 years ago

How'd you connect that to the skimpy error msg with a missing stack trace?

It came to me like a flash Jerry.

It worked!!!

Awesome! Yeah I'll make a new package for Sisyphus that Gaia can use.

1fish2 commented 4 years ago

Split out #25 and #26.

CovertLab / sisyphus

Sisyphus gets stuck pulling an input file #24