bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
677 stars 88 forks source link

Retrieve execution logs from published results after job is complete #2337

Open rossjones opened 1 year ago

rossjones commented 1 year ago

When a job is complete, the output from that job (stdout, stderr and actual output) are available from the place where they are published, but no longer from the compute node. As a user, I want to see the logs for a job that completed earlier today and currently that is problematic, giving me only a summary unless I want to download the potentially large ouput from the job.

Alternatively we could download all of the output from the job, and from there access the stdout and stderr files containing the output from the job. We could improve this by enabling the downloading of single files from the output. This still feels like the wrong solution, as it enforces us having the logs live with the output data and this seems inflexible.

Instead, we should separate out long-term storage of the logs from the storage of the output data. Whereas currently they live in the output folder and are stored in the same location, we should enable configuration of a single storage provider for log files. Initially this could default to using IPFS and have the CID retrievable in the same way we retrieve the output CID. It is also worth considering whether we need the separation of logs (stdout/stderr) or whether we can store them in a single file.

TODO

aronchick commented 9 months ago

still needed?

rossjones commented 9 months ago

Sort of. This was really some thinking around Logs v2 where we move it out of the current download and into a place where it is published independently - that way the logs could be pushed to a org's chosen log platform, or just S3 or whatever. Happy to close this in favour of a logs v2 doc.

wdbaruni commented 6 months ago

Still needed so far. Options we should consider:

  1. Do not delete containers and local resources from the compute node as soon as the job completes, and only garbage collect them at a later time (in days). This should allow compute nodes to serve logs even after the job is completed. This wouldn't allow serving months old logs, but maybe is enough for most use cases. This feature is needed regardless as we are getting some race conditions today for quick jobs where the requester sees the job didn't finish yet, route log fetching to the compute node, but then the compute node fails the request seeing the container no longer exist.
  2. Have dedicated logging store (e.g. Loki or CWL) that is independent from the result store. The expectation for compute nodes to publish their logs periodically and not only when the job is completed. This should be optional where by default the logs will only live in the compute nodes until they are garbage collected. The requester node will attempt to fetch logs from the dedicated log store either by default or if they are not found in the compute node.
  3. Use the logs in the published results. This won't work with long running jobs, and won't work if no publisher was defined.

Overall, I believe option 1 will solve most of the issues for now, and then we can explore option 2 in the future

wdbaruni commented 3 months ago

4175 can help if we decouple the logic of job completion from cleaning up all of their resources