Closed bbearce closed 2 weeks ago
Nice progress. This is a fundamental change that needs to be tested deeply before merging.
Nice work done @bbearce. I see that there is poetry for compute worker. Will this affect the setting up of a compute worker? https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup
Also, what is the file compute_worker/poetry.lock
?
This currently runs on https://codabench-test.lri.fr/
We need also to test compute workers
Building the GPU compute worker Dockerfile did NOT work
We are supposed to remove the requirements.txt files in this PR
Was working on this over the weekend. A couple questions.
What do you think?
If you want to use a cuda image, you can check this one which we are using in another project
https://github.com/FAIR-Universe/HEP-Challenge/blob/master/docker/Dockerfile
Exactly. The more I think about it, this is great but how much should the gpu worker and the cpu worker match (Dockerfile wise)? My take is that you'd want them to be as similar as possible and I'm worried using an nvidia image for the cpu worker doesn't make sense so then these would come from different bases. Does anyone else think having both workers start from ubuntu is too low level?
PS: Totally down to do an nvidia image for gpu worker and python or ubuntu for cpu worker if folks don't think the different base matters much. Would greatly simplify gpu setup, effectively allowing us to skip cuda install altogether.
Totally down to do an nvidia image for gpu worker and python or ubuntu for cpu worker if folks don't think the different base matters much. Would greatly simplify gpu setup, effectively allowing us to skip cuda install altogether.
Yeah I guess we can go for this
We updated the compute worker docker images:
test
for testing CPUgpu
for GPU versionThe CPU one is currently running submissions on the test server without problems.
We are in the process of testing the GPU version. Bug:
Traceback (most recent call last):
File "/usr/bin/celery", line 5, in <module>
from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
File "/usr/bin/celery", line 5, in <module>
from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
File "/usr/bin/celery", line 5, in <module>
from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
File "/usr/bin/celery", line 5, in <module>
from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
File "/usr/bin/celery", line 5, in <module>
from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Testing the GPU compute worker, I get this error on a submission of the "GPU test" bundle:
[2024-06-19 18:02:02,324: INFO/MainProcess] compute-worker@88d49e766880 ready.
[2024-06-19 18:02:02,325: INFO/MainProcess] Received task: compute_worker_run[ef4800b6-9ff8-4626-8702-260f99d2c8b5]
[2024-06-19 18:02:02,426: INFO/ForkPoolWorker-1] Received run arguments: {'user_pk': 7, 'submissions_api_url': 'https://www.codabench.org/api', 'secret': 'fd3e705e-4a4f-4777-9507-fede640bda15', 'docker_image': 'codalab/codalab-legacy:gpu', 'execution_time_limit': 600, 'id': 71224, 'is_scoring': False, 'prediction_result': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/prediction_result/2024-06-19-1718819984/6d99e685bc54/prediction_result.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=34B6XI%2FOee1gMw1L8p88gx5UNw8%3D&content-type=application%2Fzip&Expires=1718906385', 'ingestion_only_during_scoring': False, 'program_data': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/dataset/2024-06-19-1718819979/0b3a8502929c/submission.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=Q4%2FLO4Va4yIINYzRn2JSCcz%2BgBQ%3D&Expires=1718906385', 'prediction_stdout': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/87a4537c79e3/prediction_stdout.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=BHhGAaPiK%2BBz0VgIyuBkzyy3hOM%3D&content-type=application%2Fzip&Expires=1718906385', 'prediction_stderr': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/ea6d07365e4b/prediction_stderr.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=5G2icoWHEBZgMLXJrbSxAAUKsGA%3D&content-type=application%2Fzip&Expires=1718906385', 'prediction_ingestion_stdout': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/a890fb3b7207/prediction_ingestion_stdout.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=FSOuv3o9UGq4hjRnpJTa8jKe8Nk%3D&content-type=application%2Fzip&Expires=1718906385', 'prediction_ingestion_stderr': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/e030e979fdb9/prediction_ingestion_stderr.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=oFdZi6FjiY1BckMc%2BoXct6l%2BwpY%3D&content-type=application%2Fzip&Expires=1718906385'}
[2024-06-19 18:02:02,427: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/71224/ with data = {'status': 'Preparing', 'status_details': None, 'secret': 'fd3e705e-4a4f-4777-9507-fede640bda15'}
[2024-06-19 18:02:02,702: INFO/ForkPoolWorker-1] Submission updated successfully!
[2024-06-19 18:02:02,702: INFO/ForkPoolWorker-1] Checking if cache directory needs to be pruned...
[2024-06-19 18:02:02,703: INFO/ForkPoolWorker-1] Cache directory does not need to be pruned!
[2024-06-19 18:02:02,703: INFO/ForkPoolWorker-1] Getting bundle https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/dataset/2024-06-19-1718819979/0b3a8502929c/submission.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=Q4%2FLO4Va4yIINYzRn2JSCcz%2BgBQ%3D&Expires=1718906385 to unpack @ program
[2024-06-19 18:02:02,777: INFO/ForkPoolWorker-1] Beginning MD5 checksum of submission: /codabench/tmp42i5mcin/bundles/tmpctn6s1_z
[2024-06-19 18:02:02,777: INFO/ForkPoolWorker-1] Checksum result: 573f142dfb0c45b19c063131db595bb5
[2024-06-19 18:02:02,777: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/71224/ with data = {'md5': '573f142dfb0c45b19c063131db595bb5', 'secret': 'fd3e705e-4a4f-4777-9507-fede640bda15'}
[2024-06-19 18:02:02,870: INFO/ForkPoolWorker-1] Submission updated successfully!
[2024-06-19 18:02:02,870: INFO/ForkPoolWorker-1] Running pull for image: codalab/codalab-legacy:gpu
[2024-06-19 18:02:02,873: INFO/ForkPoolWorker-1] Destroying submission temp dir: /codabench/tmp42i5mcin
[2024-06-19 18:02:02,876: ERROR/ForkPoolWorker-1] Task compute_worker_run[ef4800b6-9ff8-4626-8702-260f99d2c8b5] raised unexpected: FileNotFoundError(2, 'No such file or directory')
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 650, in __protected_call__
return self.run(*args, **kwargs)
File "/compute_worker.py", line 115, in run_wrapper
run.prepare()
File "/compute_worker.py", line 803, in prepare
self._get_container_image(self.container_image)
File "/compute_worker.py", line 367, in _get_container_image
container_engine_pull = check_output(cmd)
File "/usr/local/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/local/lib/python3.9/subprocess.py", line 505, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/local/lib/python3.9/subprocess.py", line 951, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/local/lib/python3.9/subprocess.py", line 1837, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'nvidia-docker'
This comes from the new Dockerfile, with the older docker image the error does not appear. Indeed, we are using the new nvidia toolkit. We should update the code to stop using nvidia-docker
.
Now the CPU version is broken...
[2024-06-19 19:26:28,933: INFO/ForkPoolWorker-1] Connecting to wss://codabench-test.lri.fr/submission_input/6/797/d3673c8d-e345-4f34-aa44-73e01e63c530/
[2024-06-19 19:26:31,802: WARNING/ForkPoolWorker-1] WS: b'docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].\n'
[2024-06-19 19:26:31,802: INFO/ForkPoolWorker-1] Process exited with 125
[2024-06-19 19:26:31,802: INFO/ForkPoolWorker-1] Disconnecting from websocket wss://codabench-test.lri.fr/submission_input/6/797/d3673c8d-e345-4f34-aa44-73e01e63c530/
[2024-06-19 19:26:33,936: INFO/ForkPoolWorker-1] [exited with 125]
[2024-06-19 19:26:33,936: INFO/ForkPoolWorker-1] [stderr]
b'docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].\n'
[2024-06-19 19:26:33,936: INFO/ForkPoolWorker-1] Putting raw data b'docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].\n' in https://minio-test.lri.fr/codabench-private/submission_details/2024-06-19-1718825186/183b292a2ba8/scoring_stderr.txt?AWSAccessKeyId=AKIAIOSFODNN7EXAMPLE&Signature=08Orj1ptjBxO7cguAlMCA83YwGs%3D&content-type=application%2Fzip&Expires=1718911586
The code in compute_worker.py
:
engine_cmd = [
CONTAINER_ENGINE_EXECUTABLE,
'run',
# Remove it after run
'--rm',
f'--name={self.ingestion_container_name if kind == "ingestion" else self.program_container_name}',
# Don't allow subprocesses to raise privileges
'--security-opt=no-new-privileges',
# GPU or not
'--gpus',
'all' if os.environ.get("USE_GPU") else '0',
# Set the volumes
'-v', f'{self._get_host_path(program_dir)}:/app/program',
'-v', f'{self._get_host_path(self.output_dir)}:/app/output',
'-v', f'{self.data_dir}:/app/data:ro',
# Start in the right directory
'-w', '/app/program',
# Don't buffer python output, so we don't lose any
'-e', 'PYTHONUNBUFFERED=1',
]
Fixed!
Reviewers
@Didayolo, @ihsaan-ullah
Summary
Meant to transition us to poetry from pip dependency management. I only upped the python version to 3.9 and not even for all Dockerfiles. I ran into issues using python3.10. There are too many changes in a variety of packages. I think the only real way to do this is one package at a time and slowly over time bump up the python. For now I really just switched us to poetry from pip but kept the same packages versions. I think I changed one called ipdb and upgraded a bit because python 3.9 required it.
Dockerfiles updated:
Dockerfile
Dockerfile.compute_worker
Dockerfile.compute_worker_gpu
Dockerfile.flower
PS: Do we use Dockerfile.celery at all?
Issues this PR resolves
1413
Checklist