bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
Apache License 2.0
665 stars 86 forks source link

Confusing error message when data not found on IPFS #2847

Open simonwo opened 11 months ago

simonwo commented 11 months ago

I am running a bacalhau job and it always gives an error state: bacalhau describe 3e57f703-8877-4b87-a82f-71bc16c16031 and always has nodes giving this error: State: Failed Status: 'error calculating resource requirements for job: error getting job disk space requirements: Post "": What is this about?

We could provide a better error message here about content not being found on IPFS.

wdbaruni commented 2 months ago

This is how errors look on v1.4.0

→ bacalhau docker run -i ipfs://QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD:/input ubuntu cat /input
Job successfully submitted. Job ID: j-ab71100d-2497-43d0-b303-8db9c99657f1
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

    Communicating with the network  ................  done ✅  0.2s
       Creating job for submission  ................  err  ❌  16m01.9s

Error: calculating resource usage of job: error getting job disk space requirements: IPFS storage provider was unable to retrieve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD" before timeout 2m0s: Post
       "": context deadline exceeded

Job Results By Node:

• 8 runs on 4 nodes: calculating resource usage of job: error getting job disk space requirements: IPFS storage provider was unable to retrieve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD" before timeout 2m0s: Post
                       "": context deadline exceeded

To get more details about the run, execute:
    bacalhau job describe j-ab71100d-2497-43d0-b303-8db9c99657f1

To get more details about the run executions, execute:
    bacalhau job executions j-ab71100d-2497-43d0-b303-8db9c99657f1
→ bacalhau job describe j-ab71100d-2497-43d0-b303-8db9c99657f1
ID            = j-ab71100d-2497-43d0-b303-8db9c99657f1
Name          = j-ab71100d-2497-43d0-b303-8db9c99657f1
Namespace     = default
Type          = batch
State         = Failed
Message       = not enough nodes to run job. requested: 1, available: 4, suitable: 0.
• 4 of 4 nodes: job already executed on this node more than once
Count         = 1
Created Time  = 2024-07-01 07:57:29
Modified Time = 2024-07-01 08:13:31
Version       = 0

Failed = 8

Job History
 TIME                 REV.  STATE    TOPIC       EVENT
 2024-07-01 07:57:29  1     Pending  Submission  Job submitted
 2024-07-01 08:13:31  2     Failed   Scheduling  not enough nodes to run job. requested: 1, available: 4, sui
                                                 table: 0.
                                                 • 4 of 4 nodes: job already executed on this node more than

 e-cbe986da  n-7ea9ef64  Failed  Stopped  3     59m53s ago  57m53s ago  calculating resource usage of job: error
                                                                         getting job disk space requirements: IP
                                                                        FS storage provider was unable to retrie
                                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72v
                                                                        edxjQkddTCYkzzLbmD" before timeout 2m0s:
                                                                         Post "
                                                                        KLwHCnL72vedxjQkddTCYkzzLbmD": context d
                                                                        eadline exceeded
 e-a60d261b  n-b75224b7  Failed  Stopped  3     1h1m ago    59m53s ago  calculating resource usage of job: error
                                                                         getting job disk space requirements: IP
                                                                        FS storage provider was unable to retrie
                                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72v
                                                                        edxjQkddTCYkzzLbmD" before timeout 2m0s:
                                                                         Post "
                                                                        KLwHCnL72vedxjQkddTCYkzzLbmD": context d
                                                                        eadline exceeded
 e-834854aa  n-f50db1f9  Failed  Stopped  3     1h3m ago    1h1m ago    calculating resource usage of job: error
                                                                         getting job disk space requirements: IP
                                                                        FS storage provider was unable to retrie
                                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72v
                                                                        edxjQkddTCYkzzLbmD" before timeout 2m0s:
                                                                         Post "
                                                                        KLwHCnL72vedxjQkddTCYkzzLbmD": context d
                                                                        eadline exceeded
 e-f19c166d  n-d42422fd  Failed  Stopped  3     1h5m ago    1h3m ago    calculating resource usage of job: error
                                                                         getting job disk space requirements: IP
                                                                        FS storage provider was unable to retrie
                                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72v
                                                                        edxjQkddTCYkzzLbmD" before timeout 2m0s:
                                                                         Post "
                                                                        KLwHCnL72vedxjQkddTCYkzzLbmD": context d
                                                                        eadline exceeded
 e-e9b7c800  n-7ea9ef64  Failed  Stopped  3     1h7m ago    1h5m ago    calculating resource usage of job: error
                                                                         getting job disk space requirements: IP
                                                                        FS storage provider was unable to retrie
                                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72v
                                                                        edxjQkddTCYkzzLbmD" before timeout 2m0s:
                                                                         Post "
                                                                        KLwHCnL72vedxjQkddTCYkzzLbmD": context d
                                                                        eadline exceeded
 e-458a5bf6  n-b75224b7  Failed  Stopped  3     1h9m ago    1h7m ago    calculating resource usage of job: error
                                                                         getting job disk space requirements: IP
                                                                        FS storage provider was unable to retrie
                                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72v
                                                                        edxjQkddTCYkzzLbmD" before timeout 2m0s:
                                                                         Post "
                                                                        KLwHCnL72vedxjQkddTCYkzzLbmD": context d
                                                                        eadline exceeded
 e-7c579992  n-d42422fd  Failed  Stopped  3     1h11m ago   1h9m ago    calculating resource usage of job: error
                                                                         getting job disk space requirements: IP
                                                                        FS storage provider was unable to retrie
                                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72v
                                                                        edxjQkddTCYkzzLbmD" before timeout 2m0s:
                                                                         Post "
                                                                        KLwHCnL72vedxjQkddTCYkzzLbmD": context d
                                                                        eadline exceeded
 e-85a299dd  n-f50db1f9  Failed  Stopped  3     1h13m ago   1h11m ago   calculating resource usage of job: error
                                                                         getting job disk space requirements: IP
                                                                        FS storage provider was unable to retrie
                                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72v
                                                                        edxjQkddTCYkzzLbmD" before timeout 2m0s:
                                                                         Post "
                                                                        KLwHCnL72vedxjQkddTCYkzzLbmD": context d
                                                                        eadline exceeded

Execution e-cbe986da History
 TIME                 REV.  STATE      TOPIC            EVENT
 2024-07-01 08:11:31  1     New
 2024-07-01 08:11:31  2     AskForBid
 2024-07-01 08:13:31  3     Failed     Requesting Node  calculating resource usage of job: error getting job disk sp
                                                        ace requirements: IPFS storage provider was unable to retrie
                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD"
                                                        before timeout 2m0s: Post "
                                                        YkzzLbmD": context deadline exceeded

Execution e-a60d261b History
 TIME                 REV.  STATE      TOPIC            EVENT
 2024-07-01 08:09:31  1     New
 2024-07-01 08:09:31  2     AskForBid
 2024-07-01 08:11:31  3     Failed     Requesting Node  calculating resource usage of job: error getting job disk sp
                                                        ace requirements: IPFS storage provider was unable to retrie
                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD"
                                                        before timeout 2m0s: Post "
                                                        YkzzLbmD": context deadline exceeded

Execution e-834854aa History
 TIME                 REV.  STATE      TOPIC            EVENT
 2024-07-01 08:07:31  1     New
 2024-07-01 08:07:31  2     AskForBid
 2024-07-01 08:09:31  3     Failed     Requesting Node  calculating resource usage of job: error getting job disk sp
                                                        ace requirements: IPFS storage provider was unable to retrie
                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD"
                                                        before timeout 2m0s: Post "
                                                        YkzzLbmD": context deadline exceeded

Execution e-f19c166d History
 TIME                 REV.  STATE      TOPIC            EVENT
 2024-07-01 08:05:30  1     New
 2024-07-01 08:05:30  2     AskForBid
 2024-07-01 08:07:31  3     Failed     Requesting Node  calculating resource usage of job: error getting job disk sp
                                                        ace requirements: IPFS storage provider was unable to retrie
                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD"
                                                        before timeout 2m0s: Post "
                                                        YkzzLbmD": context deadline exceeded

Execution e-e9b7c800 History
 TIME                 REV.  STATE      TOPIC            EVENT
 2024-07-01 08:03:30  1     New
 2024-07-01 08:03:30  2     AskForBid
 2024-07-01 08:05:30  3     Failed     Requesting Node  calculating resource usage of job: error getting job disk sp
                                                        ace requirements: IPFS storage provider was unable to retrie
                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD"
                                                        before timeout 2m0s: Post "
                                                        YkzzLbmD": context deadline exceeded

Execution e-458a5bf6 History
 TIME                 REV.  STATE      TOPIC            EVENT
 2024-07-01 08:01:30  1     New
 2024-07-01 08:01:30  2     AskForBid
 2024-07-01 08:03:30  3     Failed     Requesting Node  calculating resource usage of job: error getting job disk sp
                                                        ace requirements: IPFS storage provider was unable to retrie
                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD"
                                                        before timeout 2m0s: Post "
                                                        YkzzLbmD": context deadline exceeded

Execution e-7c579992 History
 TIME                 REV.  STATE      TOPIC            EVENT
 2024-07-01 07:59:30  1     New
 2024-07-01 07:59:30  2     AskForBid
 2024-07-01 08:01:30  3     Failed     Requesting Node  calculating resource usage of job: error getting job disk sp
                                                        ace requirements: IPFS storage provider was unable to retrie
                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD"
                                                        before timeout 2m0s: Post "
                                                        YkzzLbmD": context deadline exceeded

Execution e-85a299dd History
 TIME                 REV.  STATE      TOPIC            EVENT
 2024-07-01 07:57:29  1     New
 2024-07-01 07:57:29  2     AskForBid
 2024-07-01 07:59:30  3     Failed     Requesting Node  calculating resource usage of job: error getting job disk sp
                                                        ace requirements: IPFS storage provider was unable to retrie
                                                        ve content "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkddTCYkzzLbmD"
                                                        before timeout 2m0s: Post "
                                                        YkzzLbmD": context deadline exceeded
MichaelHoepler commented 2 months ago

That is super convoluted. The problem is, that the added information from the text is minimal the further you go down. We can do two things from here: rewrite the errors to cut after a certain point and just give us back the 'IPFS storage provider was unable to retrieve content. Execution failed: x/x'. If the information is really crucial in our eyes, we can think about chunking up this information and providing some other form of getting the error logs (e.g. a download link for the logs or a specific extra command just for getting all the specific information like bacalhau job debug ...). What are your thoughts @aronchick @frrist?

wdbaruni commented 2 months ago

@MichaelHoepler I was using an outdated client. I have updated the error reported with v1.4.0. They are slightly better, but still with lots of room of improvements. One improvement is controlling the number of retries which is tracked by #4015 . Another is improving the reported error along with a better hint, which is tracked by #3791. Both are planned for v1.5.0