Lilypad-Tech / lilypad

Run AI workloads easily in a decentralized GPU network. https://www.youtube.com/watch?v=yQnB2Yxia4Y
https://lilypad.tech
Apache License 2.0
52 stars 16 forks source link

fix: Handle accept result and download result errors #416

Closed bgins closed 2 weeks ago

bgins commented 3 weeks ago

Summary

This pull request makes the following changes:

We would like to handle these errors to avoid silently failing when running jobs from the CLI. Error reporting also supports observing this information in traces.

The unbuffered error channel blocks on the receiver which does not get set up before the send. The buffered channel will hold the message until the receiver is available to receive the message.

Task/Issue reference

Closes: #414

Test plan

Run a few jobs. Check that everything still works.

Running more than one job exercises the download check. If we attempt to download again, we would observe a file exists error.

To test the error cases, return a temporary error from the downloadResult and acceptResult functions.

Details (optional)

The new downloads check determines if we have already downloaded the results for a job. If the download path already exists, we skip the download. This check is necessary because our control loop checks for any completed deals and re-downloads the results. The solver may report completed deals for jobs whose results we already have.