codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
60 stars 26 forks source link

Update Python using Poetry (Issue #1413) #1416

Closed bbearce closed 2 weeks ago

bbearce commented 2 months ago

Reviewers

@Didayolo, @ihsaan-ullah

Summary

Meant to transition us to poetry from pip dependency management. I only upped the python version to 3.9 and not even for all Dockerfiles. I ran into issues using python3.10. There are too many changes in a variety of packages. I think the only real way to do this is one package at a time and slowly over time bump up the python. For now I really just switched us to poetry from pip but kept the same packages versions. I think I changed one called ipdb and upgraded a bit because python 3.9 required it.

Dockerfiles updated:

PS: Do we use Dockerfile.celery at all?

Issues this PR resolves

Checklist

Didayolo commented 2 months ago

Nice progress. This is a fundamental change that needs to be tested deeply before merging.

ihsaan-ullah commented 2 months ago

Nice work done @bbearce. I see that there is poetry for compute worker. Will this affect the setting up of a compute worker? https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup

Didayolo commented 2 months ago

Also, what is the file compute_worker/poetry.lock ?

Didayolo commented 3 weeks ago
bbearce commented 2 weeks ago

Was working on this over the weekend. A couple questions.

What do you think?

ihsaan-ullah commented 2 weeks ago

If you want to use a cuda image, you can check this one which we are using in another project

https://github.com/FAIR-Universe/HEP-Challenge/blob/master/docker/Dockerfile

bbearce commented 2 weeks ago

Exactly. The more I think about it, this is great but how much should the gpu worker and the cpu worker match (Dockerfile wise)? My take is that you'd want them to be as similar as possible and I'm worried using an nvidia image for the cpu worker doesn't make sense so then these would come from different bases. Does anyone else think having both workers start from ubuntu is too low level?

PS: Totally down to do an nvidia image for gpu worker and python or ubuntu for cpu worker if folks don't think the different base matters much. Would greatly simplify gpu setup, effectively allowing us to skip cuda install altogether.

Didayolo commented 2 weeks ago

Totally down to do an nvidia image for gpu worker and python or ubuntu for cpu worker if folks don't think the different base matters much. Would greatly simplify gpu setup, effectively allowing us to skip cuda install altogether.

Yeah I guess we can go for this

Didayolo commented 2 weeks ago

We updated the compute worker docker images:

The CPU one is currently running submissions on the test server without problems.

We are in the process of testing the GPU version. Bug:

Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Didayolo commented 2 weeks ago

Testing the GPU compute worker, I get this error on a submission of the "GPU test" bundle:

[2024-06-19 18:02:02,324: INFO/MainProcess] compute-worker@88d49e766880 ready.
[2024-06-19 18:02:02,325: INFO/MainProcess] Received task: compute_worker_run[ef4800b6-9ff8-4626-8702-260f99d2c8b5]  
[2024-06-19 18:02:02,426: INFO/ForkPoolWorker-1] Received run arguments: {'user_pk': 7, 'submissions_api_url': 'https://www.codabench.org/api', 'secret': 'fd3e705e-4a4f-4777-9507-fede640bda15', 'docker_image': 'codalab/codalab-legacy:gpu', 'execution_time_limit': 600, 'id': 71224, 'is_scoring': False, 'prediction_result': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/prediction_result/2024-06-19-1718819984/6d99e685bc54/prediction_result.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=34B6XI%2FOee1gMw1L8p88gx5UNw8%3D&content-type=application%2Fzip&Expires=1718906385', 'ingestion_only_during_scoring': False, 'program_data': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/dataset/2024-06-19-1718819979/0b3a8502929c/submission.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=Q4%2FLO4Va4yIINYzRn2JSCcz%2BgBQ%3D&Expires=1718906385', 'prediction_stdout': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/87a4537c79e3/prediction_stdout.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=BHhGAaPiK%2BBz0VgIyuBkzyy3hOM%3D&content-type=application%2Fzip&Expires=1718906385', 'prediction_stderr': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/ea6d07365e4b/prediction_stderr.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=5G2icoWHEBZgMLXJrbSxAAUKsGA%3D&content-type=application%2Fzip&Expires=1718906385', 'prediction_ingestion_stdout': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/a890fb3b7207/prediction_ingestion_stdout.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=FSOuv3o9UGq4hjRnpJTa8jKe8Nk%3D&content-type=application%2Fzip&Expires=1718906385', 'prediction_ingestion_stderr': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/e030e979fdb9/prediction_ingestion_stderr.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=oFdZi6FjiY1BckMc%2BoXct6l%2BwpY%3D&content-type=application%2Fzip&Expires=1718906385'}
[2024-06-19 18:02:02,427: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/71224/ with data = {'status': 'Preparing', 'status_details': None, 'secret': 'fd3e705e-4a4f-4777-9507-fede640bda15'}
[2024-06-19 18:02:02,702: INFO/ForkPoolWorker-1] Submission updated successfully!
[2024-06-19 18:02:02,702: INFO/ForkPoolWorker-1] Checking if cache directory needs to be pruned...
[2024-06-19 18:02:02,703: INFO/ForkPoolWorker-1] Cache directory does not need to be pruned!
[2024-06-19 18:02:02,703: INFO/ForkPoolWorker-1] Getting bundle https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/dataset/2024-06-19-1718819979/0b3a8502929c/submission.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=Q4%2FLO4Va4yIINYzRn2JSCcz%2BgBQ%3D&Expires=1718906385 to unpack @ program
[2024-06-19 18:02:02,777: INFO/ForkPoolWorker-1] Beginning MD5 checksum of submission: /codabench/tmp42i5mcin/bundles/tmpctn6s1_z
[2024-06-19 18:02:02,777: INFO/ForkPoolWorker-1] Checksum result: 573f142dfb0c45b19c063131db595bb5
[2024-06-19 18:02:02,777: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/71224/ with data = {'md5': '573f142dfb0c45b19c063131db595bb5', 'secret': 'fd3e705e-4a4f-4777-9507-fede640bda15'}
[2024-06-19 18:02:02,870: INFO/ForkPoolWorker-1] Submission updated successfully!
[2024-06-19 18:02:02,870: INFO/ForkPoolWorker-1] Running pull for image: codalab/codalab-legacy:gpu
[2024-06-19 18:02:02,873: INFO/ForkPoolWorker-1] Destroying submission temp dir: /codabench/tmp42i5mcin
[2024-06-19 18:02:02,876: ERROR/ForkPoolWorker-1] Task compute_worker_run[ef4800b6-9ff8-4626-8702-260f99d2c8b5] raised unexpected: FileNotFoundError(2, 'No such file or directory')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/compute_worker.py", line 115, in run_wrapper
    run.prepare()
  File "/compute_worker.py", line 803, in prepare
    self._get_container_image(self.container_image)
  File "/compute_worker.py", line 367, in _get_container_image
    container_engine_pull = check_output(cmd)
  File "/usr/local/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/local/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/local/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'nvidia-docker'

This comes from the new Dockerfile, with the older docker image the error does not appear. Indeed, we are using the new nvidia toolkit. We should update the code to stop using nvidia-docker.

Didayolo commented 2 weeks ago

Now the CPU version is broken...

[2024-06-19 19:26:28,933: INFO/ForkPoolWorker-1] Connecting to wss://codabench-test.lri.fr/submission_input/6/797/d3673c8d-e345-4f34-aa44-73e01e63c530/
[2024-06-19 19:26:31,802: WARNING/ForkPoolWorker-1] WS: b'docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].\n'
[2024-06-19 19:26:31,802: INFO/ForkPoolWorker-1] Process exited with 125
[2024-06-19 19:26:31,802: INFO/ForkPoolWorker-1] Disconnecting from websocket wss://codabench-test.lri.fr/submission_input/6/797/d3673c8d-e345-4f34-aa44-73e01e63c530/
[2024-06-19 19:26:33,936: INFO/ForkPoolWorker-1] [exited with 125]
[2024-06-19 19:26:33,936: INFO/ForkPoolWorker-1] [stderr]
b'docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].\n'
[2024-06-19 19:26:33,936: INFO/ForkPoolWorker-1] Putting raw data b'docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].\n' in https://minio-test.lri.fr/codabench-private/submission_details/2024-06-19-1718825186/183b292a2ba8/scoring_stderr.txt?AWSAccessKeyId=AKIAIOSFODNN7EXAMPLE&Signature=08Orj1ptjBxO7cguAlMCA83YwGs%3D&content-type=application%2Fzip&Expires=1718911586

The code in compute_worker.py:

        engine_cmd = [
            CONTAINER_ENGINE_EXECUTABLE,
            'run',
            # Remove it after run
            '--rm',
            f'--name={self.ingestion_container_name if kind == "ingestion" else self.program_container_name}',

            # Don't allow subprocesses to raise privileges
            '--security-opt=no-new-privileges',

            # GPU or not
            '--gpus', 
            'all' if os.environ.get("USE_GPU") else '0',

            # Set the volumes
            '-v', f'{self._get_host_path(program_dir)}:/app/program',
            '-v', f'{self._get_host_path(self.output_dir)}:/app/output',
            '-v', f'{self.data_dir}:/app/data:ro',

            # Start in the right directory
            '-w', '/app/program',

            # Don't buffer python output, so we don't lose any
            '-e', 'PYTHONUNBUFFERED=1',
        ]
Didayolo commented 2 weeks ago

Fixed!