bioimage-io / collection

Maintains the resources displayed on bioimage.io (Successor to collection-bioimage-io)
https://bioimage-io.github.io/collection/
0 stars 2 forks source link

CI stopped due to error "System.IO.IOException: No space left on device" #93

Open mese79 opened 2 weeks ago

mese79 commented 2 weeks ago

I have a model uploaded, but at pytorch_state_dict test it stopped with this error:

System.IO.IOException: No space left on device : '/home/runner/runners/2.319.1/_diag/Worker_20240829-130542-utc.log'

You can see the CI here: https://github.com/bioimage-io/collection/actions/runs/10613914843

I'm not sure if the test server capacity became full or if there is something wrong with the model. The local tests are successful.

FynnBe commented 2 weeks ago

it seem like the limit for GH actions workflows is 14GB in our case: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources The uploaded weights file model_weights.pth for plucky-goat/draft has only 2KB, so it seems to have failed during upload somehow?

how big is the model you tried to upload?

FynnBe commented 2 weeks ago

@oeway should we maybe warn uploaders of that size limit? we should test how big the limit actually is in practice. But as it seems to be below 14GB we might need to think about alternatives to bump that limit

mese79 commented 2 weeks ago

You think is it a good idea to have a test server and do all the testings for newly uploaded models there via backend scripts and then push the model to the BMZ server rather than using github CIs?
In this way, we can put the uploader on the testing server as well.

FynnBe commented 2 weeks ago

that would be one of the alternative (or paying for higher tier GH actions) or using another "free tier" provider.

We did consider this server approach in the past. It does incur an additional maintenance burden. so it would be nice to get away with using GH actions... not sure what other ways there are to circumvent the disk space issue...

oeway commented 2 weeks ago

@FynnBe @mese79 Could you explain to me what exactly is the issue? In the previous message you said it's a small weight file, and it was working before even with big weight files, what has changed?

We are setting up a cluster at KTH which can helpfully used for running tests. @FynnBe I am not following the development of the core library, is it possible that I spin up a docker container, and just run the bioimage.io core library to run bioimageio test-model <MODEL>? Can it run all the models? We need several conda env? Maybe it's a bit tricky to build conda env on the fly though.

mese79 commented 2 weeks ago

Sorry @FynnBe I missed your first comment.
So I needed to upload a Cellpose model and as a workaround I've made a wrapper model which inside its forward method It calls the Cellpose API which downloads the Cellpose weights. But even those weights are small (maybe ~50Mb).

FynnBe commented 2 weeks ago

is it possible that I spin up a docker container, and just run the bioimage.io core library to run bioimageio test-model ? Can it run all the models? We need several conda env?

Given the right environment core can run any model (except those with only tensorflow_js weights, I'm not aware of any that use it though). so yes, the right dependencies are needed. The code to create conda environments from the spec (and the default version choices) are implemented in the conda_env collection backoffice submodule

I don't know what went wrong with @mese79 initial uploads. Maybe it was just a fluke, some worker availability issue on GH side? The logged error only hints at a full disk:

[call / test (pytorch_state_dict)](https://github.com/bioimage-io/collection/actions/runs/10613914843/job/29424193673)
System.IO.IOException: No space left on device : '/home/runner/runners/2.319.1/_diag/Worker_20240829-130542-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
System.IO.IOException: No space left on device : '/home/runner/runners/2.319.1/_diag/Worker_20240829-130542-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Common.Tracing.Error(Exception exception)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/runners/2.319.1/_diag/Worker_20240829-130542-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)

... inside its forward method It calls the Cellpose API which downloads the Cellpose weights...

The weights should be packaged with the model and referenced under weights.pytorch_state_dict.source. The forward model should never download anything. This would make using the model offline impossible and also circumvents bioimageio's caching. @mese79 maybe try to change the model and just give it another go.. if it fails you could share the model package zip file with us and we can further debug the issue.