DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

dsub 0.4.3 crashes on ubuntu 20.04 #219

Closed hylkedonker closed 3 years ago

hylkedonker commented 3 years ago

When I run the hello world example

dsub \
   --provider local \
   --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
   --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
   --command 'echo "Hello World" > "${OUT}"' \
   --wait

it crashes as follows:

***WARNING: No Docker image specified. The default, `ubuntu:14.04` will be used.
***WARNING: For reproducible pipelines, specify an image with the `--image` flag.
Job properties:
  job-id: echo--hylke--201220-165144-45
  job-name: echo
  user-id: hylke
Launched job-id: echo--hylke--201220-165144-45
To check the status, run:
  dstat --provider local --jobs 'echo--hylke--201220-165144-45' --users 'hylke' --status '*'
To cancel the job, run:
  ddel --provider local --jobs 'echo--hylke--201220-165144-45' --users 'hylke'
Waiting for job to complete...
Waiting for: echo--hylke--201220-165144-45.
Traceback (most recent call last):
  File "/home/hylke/.local/bin/dsub", line 8, in <module>
    sys.exit(main())
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1106, in main
    dsub_main(prog, argv)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1091, in dsub_main
    launched_job = run_main(args)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1168, in run_main
    return run(
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1322, in run
    error_messages = _wait_after(provider, [job_metadata['job-id']],
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 786, in _wait_after
    jobs_left = _wait_for_any_job(provider, job_ids_to_check, poll_interval,
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1015, in _wait_for_any_job
    tasks = provider.lookup_job_tasks({'*'}, job_ids=job_ids)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/providers/local.py", line 519, in lookup_job_tasks
    task = self._get_task_from_task_dir(j, u, task_id, task_attempt)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/providers/local.py", line 669, in _get_task_from_task_dir
    end_time = self._get_end_time_from_task_dir(task_dir)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/providers/local.py", line 583, in _get_end_time_from_task_dir
    datetime.datetime.strptime(f.readline().strip(),
  File "/usr/lib/python3.8/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/lib/python3.8/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '' does not match format '%Y-%m-%d %H:%M:%S.%f

Any ideas how to get the Hello World example to work?

Versions: dsub: 0.4.3 Ubuntu: 20.04 Python: 3.8.5 Docker: 19.03.11

Thanks,

Hylke

wnojopra commented 3 years ago

Hi @hylkedonker,

That is a very interesting error, and one that I haven't ran across before. When using the local provider, it writes a timestamp to a tmp file named 'end-time.txt' in /tmp. The error seems to indicate that that either that file is empty, or was read incorrectly. There are a few things you could do to help debug:

1) Do you still get the same error if you run the hello world example again? 2) Perhaps there is a problem getting the timestamp? Run this command: python -c 'import datetime; print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"))' You should see a timestamp. If not, then we'll need to debug your system python. 3) Perhaps there is a problem writing to the file? Run this command: cat /tmp/dsub-local/echo--hylke--201220-165144-45/task/end-time.txt (or replace echo--hylke--201220-165144-45 with any job id). Let me know what is output here.

hylkedonker commented 3 years ago

Thanks for getting in touch.

  1. Yes, I can confirm that this problem still persists when running the example again.
  2. python3 -c 'import datetime; print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"))' gives the output 2020-12-21 12:01:42.924656.
  3. The output indicates an empty file:
    
    $ cat /tmp/dsub-local/echo--hylke--201221-115840-24/task/end-time.txt 

$

4. When I uninstall the package and reinstall in the virtual environment:
```bash
python3 -m venv dsub_libs
source dsub_libs/bin/activate
pip3 install dsub
dsub \        
   --provider local \
   --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
   --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
   --command 'echo "Hello World" > "${OUT}"' \
   --wait

I get a slightly different error:

***WARNING: No Docker image specified. The default, `ubuntu:14.04` will be used.
***WARNING: For reproducible pipelines, specify an image with the `--image` flag.
Job properties:
  job-id: echo--hylke--201221-123051-85
  job-name: echo
  user-id: hylke
Launched job-id: echo--hylke--201221-123051-85
To check the status, run:
  dstat --provider local --jobs 'echo--hylke--201221-123051-85' --users 'hylke' --status '*'
To cancel the job, run:
  ddel --provider local --jobs 'echo--hylke--201221-123051-85' --users 'hylke'
Waiting for job to complete...
Waiting for: echo--hylke--201221-123051-85.
  echo--hylke--201221-123051-85: FAILURE
['2020-12-21 12:30:54 I: Copying the logs before cleanup']
JobExecutionError: One or more jobs finished with status FAILURE or CANCELED during wait.
echo--hylke--201221-123051-85

with the stderr output:

$ cat stderr.txt 
Error: No such container: dsub-echo--hylke--201221-123051-85.task

I hope this helps.

wnojopra commented 3 years ago

Hi @hylkedonker, do you have Docker installed? The Hello World example uses the local provider, and installing Docker is necessary if using the local provider.

Going back to your first error, I imagine looking at cat /tmp/dsub-local/echo--hylke--201220-165144-45/task/stderr.txt may also reveal information about whether Docker is installed.

The error we are seeing here is certainly not helpful. In the past, I recall seeing errors about Docker not being installed. I will work on a fix about improving the error handling here.

hylkedonker commented 3 years ago

I do have docker installed on my system.

On Mon, 21 Dec 2020, 18:37 Willy Nojopranoto, notifications@github.com wrote:

Hi @hylkedonker https://github.com/hylkedonker, do you have Docker installed? The Hello World example uses the local provider, and installing Docker is necessary https://github.com/DataBiosphere/dsub#post-installation-steps if using the local provider.

Going back to your first error, I imagine looking at cat /tmp/dsub-local/echo--hylke--201220-165144-45/task/stderr.txt may also reveal information about whether Docker is installed.

The error we are seeing here is certainly not helpful. In the past, I recall seeing errors about Docker not being installed. I will work on a fix about improving the error handling here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataBiosphere/dsub/issues/219#issuecomment-749102867, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALRD5VBYQIEEY6UHK6OMHNTSV6BV5ANCNFSM4VDFZCMQ .

wnojopra commented 3 years ago

I was able to get a machine with ubuntu 20.04 running, and was able to reproduce your error regardless of whether Docker was installed.

The root issue is that the local provider makes a call to python (not python3) to get timestamps. On ubuntu 20.04, python3 is installed, but python is not. This is not a problem if you are using dsub from a venv, but we'll fix it in a future release. Using a venv would be my recommended workaround, but if you're unable to resolve the 'No such container' error, you could also try creating a symlink named 'python' that points to 'python3'. This command would do that:

update-alternatives --install /usr/bin/python python /usr/bin/python3 10

hylkedonker commented 3 years ago

I tried the command and was able to verify that python now points to python3. However the "No such container" error still persists (both with and without venv). Any suggestions for further debugging?

wnojopra commented 3 years ago

I'm unable to reproduce the "no such container" error, and doing a couple searches mostly shows that its a transient issue, which isn't too helpful here. There are a couple things we could do to debug further:

1) In the same /tmp/.../task dir, can you share the output of the log.txt and runner-log.txt files? There may be more detailed error messages here. 2) Can we confirm that docker is set up correctly? Does running docker run hello-world produce any errors? 3) Are there any unexpected containers showing when you run docker ps -a?

hylkedonker commented 3 years ago
  1. Output of runner-log.txt:
    2020-12-23 13:19:19 I: Localizing inputs.
    2020-12-23 13:19:19 I: Checking image userid.
    2020-12-23 13:19:19 I: Task image: ubuntu:14.04
    2020-12-23 13:19:21 I: Running Docker image.
    docker: open /tmp/dsub-local/echo--hylke--201223-131919-18/task/docker.env: no such file or directory.
    See 'docker run --help'.
    Error response from daemon: No such container: dsub-echo--hylke--201223-131919-18.task
    2020-12-23 13:19:21 I: Docker exit code .
    2020-12-23 13:19:21 I: Ensure host user (1000:1000) owns Docker-written data
    2020-12-23 13:19:22 I: Copying outputs.
    CommandException: No URLs matched: /tmp/dsub-local/echo--hylke--201223-131919-18/task/data/output/file/tmp/dsub-test/output/out.txt
    CommandException: 1 file/object could not be transferred.
    2020-12-23 13:19:23 E: Error on or near line 47; /tmp/dsub-local/echo--hylke--201223-131919-18/task/runner.sh exiting with status 1
    2020-12-23 13:19:23 I: Copying the logs before cleanup

    As for the error docker: open /tmp/dsub-local/echo--hylke--201223-131919-18/task/docker.env: no such file or directory., that file does seem to exist:

    $ cat /tmp/dsub-local/echo--hylke--201223-131919-18/task/docker.env       
    DATA_ROOT=/mnt/data
    TMPDIR=/mnt/data/tmp
    OUT=/mnt/data/output/file/tmp/dsub-test/output/out.txt
    $

Output of log.txt

2020-12-23 13:19:19 I: Localizing inputs.
2020-12-23 13:19:19 I: Checking image userid.
2020-12-23 13:19:19 I: Task image: ubuntu:14.04
2020-12-23 13:19:21 I: Running Docker image.
2020-12-23 13:19:21 I: Docker exit code .
2020-12-23 13:19:21 I: Ensure host user (1000:1000) owns Docker-written data
2020-12-23 13:19:22 I: Copying outputs.
2020-12-23 13:19:23 E: Error on or near line 47; /tmp/dsub-local/echo--hylke--201223-131919-18/task/runner.sh exiting with status 1
2020-12-23 13:19:23 I: Copying the logs before cleanup
  1. The hello-world seems to work correctly.
    
    $ docker run hello-world

Hello from Docker! This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:

  1. The Docker client contacted the Docker daemon.
  2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64)
  3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
  4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal.

To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/

For more examples and ideas, visit: https://docs.docker.com/get-started/ $

Also, I haven't experienced any problems with Docker lately. No problems building, running and pushing containers.
I am running
```Bash
$ docker --version      
Docker version 19.03.11, build dd360c7
  1. Nothing pops up that strikes me as out of the ordinary.

I would be more inclined to think that some arguments may be empty, resulting in arguments being passed in the incorrect order.

wnojopra commented 3 years ago

The missing docker.env file looks to be the most suspicious to me, and I'm not sure why we're seeing that. It is a little bit interesting that it is the last file dsub writes before executing the runner.sh script. I wonder if the file is still buffered and not flushed or synced before command to run docker is executed.

I believe the expectation is that because we're using with open(env_fname, 'wt') as f: construct, Python guarantees that the file descriptor close method is called, which flushes the file. This should mean that the subprocess for runner.sh should be able to read the docker.env file, but I'm not sure...

hylkedonker commented 3 years ago

Thanks for the pointer. I'll try to take a closer look after the Christmas holidays.

On Thu, 24 Dec 2020, 00:35 Willy Nojopranoto, notifications@github.com wrote:

The missing docker.env file looks to be the most suspicious to me, and I'm not sure why we're seeing that. It is a little bit interesting that it is the last file dsub writes https://github.com/DataBiosphere/dsub/blob/master/dsub/providers/local.py#L377 before executing the runner.sh script. I wonder if the file is still buffered and not flushed or synced before command to run docker is executed.

I believe the expectation is that because we're using with open(env_fname, 'wt') as f: construct, Python guarantees that the file descriptor close method is called, which flushes the file. This should mean that the subprocess for runner.sh should be able to read the docker.env file, but I'm not sure...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataBiosphere/dsub/issues/219#issuecomment-750570092, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALRD5VHL6CACEQB6EM24C43SWJ5FJANCNFSM4VDFZCMQ .

hylkedonker commented 3 years ago

It seems that the problem is indeed related to Docker, which does not like the /tmp folder. For example, this does not work

$ docker run \
   --env-file=/tmp/docker.env \
   "ubuntu:14.04"
docker: open /tmp/docker.env: no such file or directory.
See 'docker run --help'.

But if I copy /tmp/docker.env to my $HOME folder, it works okay:

$ docker run \
   --env-file=$HOME/docker.env \
   "ubuntu:14.04"
$

Solution: I removed the snap Docker package

sudo snap remove docker

and installed the apt Docker package instead (after adding the Docker apt repository)

sudo apt-get install docker-ce docker-ce-cli containerd.io

Now it seems to be working as expected (I am now running Docker version 20.10.1, build 831ebea):

$ dsub \                
   --provider local \
   --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
   --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
   --command 'echo "Hello World" > "${OUT}"' \
   --wait
***WARNING: No Docker image specified. The default, `ubuntu:14.04` will be used.
***WARNING: For reproducible pipelines, specify an image with the `--image` flag.
Job properties:
  job-id: echo--hylke--201230-204230-26
  job-name: echo
  user-id: hylke
Launched job-id: echo--hylke--201230-204230-26
To check the status, run:
  dstat --provider local --jobs 'echo--hylke--201230-204230-26' --users 'hylke' --status '*'
To cancel the job, run:
  ddel --provider local --jobs 'echo--hylke--201230-204230-26' --users 'hylke'
Waiting for job to complete...
Waiting for: echo--hylke--201230-204230-26.
  echo--hylke--201230-204230-26: SUCCESS
echo--hylke--201230-204230-26
wnojopra commented 3 years ago

I'm happy to see you were able to get it working. I am not familiar with snap, but it looks like the way it containerizes docker prevents it from accessing /tmp. I can't think of a workaround with dsub other than to use the apt package of docker as you have done.

On our end we'll update the documentation to make this clear, and provider better error handling in the local provider.