codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
60 stars 26 forks source link

Pull for codalab/codalab-legacy:py39 failed #1462

Closed johanneskruse closed 4 weeks ago

johanneskruse commented 1 month ago

Hi,

I am running a competition using the default Docker image (codalab/codalab-legacy).

The submissions have started to fail, giving me this error:

pull-docker-failed

When resubmitting, it sometimes goes through; other times, it doesn't. Any explanation or suggestions on what I can do?

I started to notice this on May 29. I haven't had this issue before. Competition has been running since April.

Best, Johannes

Didayolo commented 1 month ago

Hi @johanneskruse,

Is your competition using the default queue?

@ObadaS Could this be due to the new setup of workers with 60 GB allocated for Docker?

ObadaS commented 1 month ago

@ObadaS Could this be due to the new setup of workers with 60 GB allocated for Docker?

It is possible, it could explain why it sometimes works since not all workers would have the same images stored. I did change the crontab to prune every 6 hours though, so it would be weird to have the same problem happen multiple days in a row unless there is a competition using very large images.

johanneskruse commented 1 month ago

Hi @Didayolo,

I am running the competition using my own remote workers. I had 2-3 days were it was bad, then it got good, and not it happens all the time again.

johanneskruse commented 1 month ago

@ObadaS - should I remove the crontab; it is currently quite a big issue.

Didayolo commented 1 month ago

@johanneskruse

If you are using your own compute workers, you should try to find more logs by connecting into the machines and using the following command:

docker logs -f compute_worker

To try to understand why sometimes the docker pull command fails. It may be connection issues, or lack of storage, etc.

should I remove the crontab; it is currently quite a big issue.

The goal of the crontab is to remove the unused docker images and avoid cluttering the disk. It should not be an issue. However, if your workers are linked to only one competition, only 1 docker image will be used so you can indeed remove the crontab.

johanneskruse commented 1 month ago

It seemed to have run out of storage. I've deleted that worker and started a new one - it seems to be working again.

Is there a way to prevent it from running out of storage? It is good for a period, but then suddenly it's all full.

@Didayolo thanks for the quick reply.

ObadaS commented 4 weeks ago

@johanneskruse Docker images are the main objects taking space. How big is the storage that the worker has access to ?

johanneskruse commented 4 weeks ago

The default storage on the worker is 45 GB. This can be increased.

ObadaS commented 4 weeks ago

I recommend you increase it to 100 GB, it should fix your storage issues.

Didayolo commented 4 weeks ago

Also, if your worker is pulling different docker images (from different competitions) it is important to include the crontab (see https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup).

I marked this issue as solved, but feel free to come back to us if you are still experiencing issues.

johanneskruse commented 4 weeks ago

I recommend you increase it to 100 GB, it should fix your storage issues.

This could be considered to mention in the Compute Worker Management Setup, as a recommendation/consideration.

Thank you for the help; it has been running smooth since.