ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
224 stars 148 forks source link

🐛 Bug: Investigate "No space left on device" in our pipelines #1067

Closed DhanshreeA closed 8 months ago

DhanshreeA commented 8 months ago

Describe the bug.

A number of our model pipelines, typically in the "upload model to dockerhub" stage fail because of "No space left on device". Exhibits:

Describe the steps to reproduce the behavior

Go to any one of the jobs above and re run. I tried to run these jobs when no other jobs were running on our runners and yet they failed.

Expected behavior.

These jobs should pass.

Screenshots.

No response

Operating environment

Runner OS

Additional context

Potentially useful resources:

GemmaTuron commented 8 months ago

Hi @DhanshreeA I am highlighting this model: https://github.com/ersilia-os/eos9taz/actions as I need it for chemsampler but it is not passing either :)

DhanshreeA commented 8 months ago

@GemmaTuron, I'm aware of this (https://github.com/ersilia-os/eos9taz/issues/11) For now, I've pushed it manually to unblock you. However there's a related issue that I'm on top of: #1068 which I opened with specifically this model in mind.

DhanshreeA commented 8 months ago

Root Cause: With GitHub hosted runners, one of the guarantees is getting software updates. This could mean one or all of the following frequently: runner updates, runner provisioner updates, patches to the OS on the runner, the software bundled in the OS, etc. Every such update is quite likely to eat into the disk space. For example, here's the list of all the tools installed on the runner OS, that we do not need for our builds. So far there is no straightforward solution other than an aggressive disk clean up. This has been implemented at the level of the eos-template repository. While this works for now, it is not guaranteed that this issue will not come up again.