bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
671 stars 87 forks source link

Job stuck in "Running" State? #4197

Open aronchick opened 2 months ago

aronchick commented 2 months ago

I don't know how i got here.

Job spec:

Constraints: []
Labels: {}
Count: 1
Namespace: default
Priority: 0
Tasks:
  - Engine:
      Params:
        Image: docker.io/bacalhauproject/stress-ng:0.0.2
        Parameters:
          - "--cpu"
          - "2"
          - "--timeout"
          - "60"
        WorkingDirectory: ""
      Type: docker
    Name: main
    Network:
      Type: None
    Publisher:
      Type: ""
    Resources: {}
    Timeouts:
      ExecutionTimeout: 600
      QueueTimeout: 600
Type: batch
❯ bacalhau job describe j-6a7ed0b8
ID            = j-6a7ed0b8-48c3-453d-bdfc-2d7362611488
Name          = j-6a7ed0b8-48c3-453d-bdfc-2d7362611488
Namespace     = default
Type          = batch
State         = Running
Count         = 1
Created Time  = 2024-07-02 17:29:30
Modified Time = 2024-07-02 17:29:44
Version       = 0

Summary
Completed = 1

Job History
 TIME                 REV.  STATE    TOPIC       EVENT
 2024-07-02 17:29:30  1     Pending  Submission  Job submitted
 2024-07-02 17:29:44  2     Running

Executions
 ID          NODE ID     STATE      DESIRED  REV.  CREATED  MODIFIED   COMMENT
 e-c720091f  n-589109c0  Completed  Stopped  6     7h ago   6h22m ago  Accepted job

Execution e-c720091f History
 TIME                 REV.  STATE              TOPIC            EVENT
 2024-07-02 17:29:30  1     New
 2024-07-02 17:29:31  2     AskForBid
 2024-07-02 17:29:31  3     AskForBidAccepted  Requesting Node  Accepted job
 2024-07-02 17:29:44  4     AskForBidAccepted
 2024-07-02 17:29:44  5     BidAccepted
 2024-07-02 18:08:10  6     Completed

Standard Output
stress-ng: info:  [1] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1] dispatching hogs: 2 cpu
stress-ng: info:  [1] skipped: 0
stress-ng: info:  [1] passed: 2: cpu (2)
stress-ng: info:  [1] failed: 0
stress-ng: info:  [1] metrics untrustworthy: 0
stress-ng: info:  [1] successful run completed in 1 min, 0.01 secs
❯ bacalhau node describe n-589109c0
Connection: CONNECTED
Info:
  BacalhauVersion:
    BuildDate: "0001-01-01T00:00:00Z"
    GOARCH: ""
    GOOS: ""
    GitCommit: ""
    GitVersion: ""
  ComputeNodeInfo:
    AvailableCapacity:
      CPU: 0.8
      Disk: 22845820108
      Memory: 2851392716
    EnqueuedExecutions: 0
    ExecutionEngines:
    - docker
    - wasm
    MaxCapacity:
      CPU: 0.8
      Disk: 22845820108
      Memory: 2851392716
    MaxJobRequirements:
      CPU: 0.8
      Disk: 22845820108
      Memory: 2851392716
    Publishers:
    - local
    - noop
    QueueCapacity: {}
    RunningExecutions: 0
    StorageSources:
    - inline
    - urldownload
  Labels:
    Architecture: amd64
    Operating-System: linux
    count: "5"
  NodeID: n-589109c0-824e-4f50-8604-3c6a2839b434
  NodeType: Compute
Membership: APPROVED

Job is "running", but history says "completed" and the container is completed.

wdbaruni commented 1 month ago

I tried running the job and the job state was marked as Completed for me. This is an example where Reliable Orchestrator epic would help as today there can be a disconnect between the different components in the network that can result in this out of sync and orphan state. This is a current work in progress

→       bacalhau job describe j-fcc9711f-19fb-48a8-ad7e-0638a6f6041a
ID            = j-fcc9711f-19fb-48a8-ad7e-0638a6f6041a
Name          = j-fcc9711f-19fb-48a8-ad7e-0638a6f6041a
Namespace     = default
Type          = batch
State         = Completed
Count         = 1
Created Time  = 2024-08-11 13:56:06
Modified Time = 2024-08-11 13:57:11
Version       = 0

Summary
Completed = 1

Job History
 TIME                 TOPIC       EVENT
 2024-08-11 13:56:06  Submission  Job submitted
 2024-08-11 13:56:08
 2024-08-11 13:57:11

Executions
 ID          NODE ID     STATE      DESIRED  REV.  CREATED    MODIFIED  COMMENT
 e-032daba7  n-e002001e  Completed  Stopped  6     1m48s ago  43s ago   Accepted job

Execution e-032daba7 History
 TIME                 TOPIC            EVENT
 2024-08-11 13:56:06
 2024-08-11 13:56:06
 2024-08-11 13:56:08  Requesting Node  Accepted job
 2024-08-11 13:56:08
 2024-08-11 13:56:08
 2024-08-11 13:57:11

Standard Output
stress-ng: info:  [1] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1] dispatching hogs: 2 cpu
stress-ng: info:  [1] skipped: 0
stress-ng: info:  [1] passed: 2: cpu (2)
stress-ng: info:  [1] failed: 0
stress-ng: info:  [1] metrics untrustworthy: 0
stress-ng: info:  [1] successful run completed in 1 min, 0.00 secs