bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
672 stars 87 forks source link

The UI for failed jobs needs a lot of work #4461

Closed aronchick closed 3 minutes ago

aronchick commented 5 days ago
❯ bacalhau-1.5.0 docker run non_existent_image
Job successfully submitted. Job ID: j-713a0da5-edd7-475e-9ad0-d271f334619c
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME                 EXEC. ID    TOPIC            EVENT
 Sep 18 12:18:22.931  e-8acfdd37  Requesting Node  Could not inspect image "non_existent_image" - could be due
                                                   to repo/image not existing, or registry needing authorizatio
                                                   n: Error response from daemon: errors:
                                                   denied: requested access to the resource is denied
                                                   unauthorized: authentication required
                                                   * Hint: If the image is private, supply the node with valid
                                                   Docker login credentials using the DOCKER_USERNAME and DOCKE
                                                   R_PASSWORD environment variables
 Sep 18 12:18:24.029  e-59b92e6c  Requesting Node  Could not inspect image "non_existent_image" - could be due
                                                   to repo/image not existing, or registry needing authorizatio
                                                   n: Error response from daemon: errors:
                                                   denied: requested access to the resource is denied
                                                   unauthorized: authentication required
                                                   * Hint: If the image is private, supply the node with valid
                                                   Docker login credentials using the DOCKER_USERNAME and DOCKE
                                                   R_PASSWORD environment variables
 Sep 18 12:18:25.318  e-cf7e80c2  Requesting Node  Could not inspect image "non_existent_image" - could be due
                                                   to repo/image not existing, or registry needing authorizatio
                                                   n: Error response from daemon: errors:
                                                   denied: requested access to the resource is denied
                                                   unauthorized: authentication required
                                                   * Hint: If the image is private, supply the node with valid
                                                   Docker login credentials using the DOCKER_USERNAME and DOCKE
                                                   R_PASSWORD environment variables
 Sep 18 12:18:26.561  e-f1d22717  Requesting Node  Could not inspect image "non_existent_image" - could be due
                                                   to repo/image not existing, or registry needing authorizatio
                                                   n: Error response from daemon: errors:
                                                   denied: requested access to the resource is denied
                                                   unauthorized: authentication required
                                                   * Hint: If the image is private, supply the node with valid
                                                   Docker login credentials using the DOCKER_USERNAME and DOCKE
                                                   R_PASSWORD environment variables
 Sep 18 12:18:26.574              Scheduling       not enough nodes to run job. requested: 1, available: 2, sui
                                                   table: 0.
                                                   • 2 of 2 nodes: job already executed on this node more than
                                                   once
Error: job failed

Job Results By Node:

• 4 runs on 2 nodes: Could not inspect image "non_existent_image" - could be due to repo/image not existing, or registry needing authorization: Error response from
                       daemon: errors:
                       denied: requested access to the resource is denied
                       unauthorized: authentication required

To get more details about the run, execute:
    /Users/daaronch/code/bacalhau-versions/v1.5.0/bacalhau job describe j-713a0da5-edd7-475e-9ad0-d271f334619c

To get more details about the run executions, execute:
    /Users/daaronch/code/bacalhau-versions/v1.5.0/bacalhau job executions j-713a0da5-edd7-475e-9ad0-d271f334619c

Issues:

wdbaruni commented 16 hours ago

This PR #4488 mitigates most of the mentioned concerns:

Prints in red (why?)

No longer printing errors in red. Just an Error prefix in red

Doesn't wrap properly

Fixed

The animated fish are still not showing

Brought back the animated fix, but not exactly how it use to be. Your feedback is needed here

It's passing through raw Go errors and not explaining what's going on

That is more on the errors reported from the servers. We did some work on improving the returned errors along with hints, but obviously there is still room for improvement here

It says "4 runs on 2 nodes" - i don't understand this? It's one job submission, one run.

No longer printed that by default. There is already a --node-details flag that prints a summary of executions across all nodes, and we used to print the summary if there are failures even if --node-details wasn't set. This is a bit confusing and not needed any more as we are printing more detailed information in the progress tracker anyways

It's not a "not enough nodes to run a job" error, it's a bad image. Saying it's not enough nodes to run a job is backwards.

I agree. This is a bigger topic to solve which is what is the error at the job level if there are multiple executions and each failed for a different reason. Right now the scheduler keeps retrying until it cannot find any more nodes to retry on, and then prints that error message, which is not great. Hopefully it will be improved with https://github.com/bacalhau-project/bacalhau/issues/4015. In the meantime, I did a workaround to just filter out that message on the client side and just print execution level errors