bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
711 stars 89 forks source link

[Bug] web ui shows Connection Lost #4624

Closed YuriyGavrilov closed 3 weeks ago

YuriyGavrilov commented 1 month ago

Bug Description

Briefly describe the unexpected behavior or performance regression. What happened that wasn’t supposed to?

Just run bacalhau serve --node-type requester,compute --web-ui

Expected Behavior

Detail what you expected to happen instead of the bug.

Steps to Reproduce

  1. Step one to reproduce run bacalhau serve --node-type requester,compute --web-ui
  2. Step two open browser
  3. Step three see message
  4. (run compute node ) bacalhau serve --node-type=compute --orchestrators=192.168.0.105
  5. see that there is no nodes in web UI
Снимок экрана 2024-10-14 в 21 37 18

but there is one and second on requestor

Снимок экрана 2024-10-14 в 21 38 16

Bacalhau Versions - 1.5

Host Environment

Provide details about the environment where the bug occurred:

Job Specification

(If applicable, provide the job spec used when the issue occurred.)

Logs

Agent Logs:

(Include here if applicable.)

Client Logs:

(Include here if applicable.)

There also some panic error when try to run ubuntu hello world

bacalhau docker run ubuntu echo hello --api-host=192.168.0.105

(base) yuriygavrilov@MBP-Yuriy trino % bacalhau serve --node-type=compute --orchestrators=192.168.0.105
Flag --node-type has been deprecated, Use --orchestrator and/or --compute to set the node type.
Flag --orchestrators has been deprecated, Use --config Compute.Orchestrators=<value> to set this configuration
21:39:50.94 | INF cmd/cli/serve/serve.go:102 > Config loaded from: [/Users/yuriygavrilov/.bacalhau/config.yaml], and with data-dir /Users/yuriygavrilov/.bacalhau
21:39:50.942 | INF cmd/cli/serve/serve.go:181 > Starting bacalhau...
21:39:51.502 | INF cmd/cli/serve/serve.go:256 > bacalhau node running [address:0.0.0.0:1234] [capacity:"{CPU: 8.40, Memory: 24 GB, Disk: 319 GB, GPU: 0}"] [compute_enabled:true] [engines:["docker","wasm"]] [name:QmTeDSDo6QCUuZw17qEU9LHMMtNFTWs1vLP46nwe7V5txw] [orchestrator_enabled:false] [orchestrators:["192.168.0.105"]] [publishers:["noop","s3","local"]] [storages:["s3","urldownload","inline"]] [webui_enabled:false]

To connect to this node from the local client, run the following commands in your shell:
export BACALHAU_API_HOST=127.0.0.1
export BACALHAU_API_PORT=1234

A copy of these variables have been written to: /Users/yuriygavrilov/.bacalhau/bacalhau.run
panic: runtime error: index out of range [-1]

goroutine 26 [running]:
github.com/bacalhau-project/bacalhau/pkg/docker.(*Client).SupportedPlatforms(0x0?, {0xcd26278?, 0xc0008255c0?})
    github.com/bacalhau-project/bacalhau/pkg/docker/docker.go:244 +0x250
github.com/bacalhau-project/bacalhau/pkg/executor/docker/bidstrategy/semantic.(*ImagePlatformBidStrategy).ShouldBid(0xc000092a40, {0xcd26278, _}, {{0xc000e06300, 0x2e}, {{0xc00005e930, 0x26}, {0xc00005e960, 0x26}, {0xc000e81ca0, ...}, ...}})
    github.com/bacalhau-project/bacalhau/pkg/executor/docker/bidstrategy/semantic/image_platform.go:52 +0x125
github.com/bacalhau-project/bacalhau/pkg/executor/docker.(*Executor).ShouldBid(0x20?, {0xcd26278, _}, {{0xc000e06300, 0x2e}, {{0xc00005e930, 0x26}, {0xc00005e960, 0x26}, {0xc000e81ca0, ...}, ...}})
    github.com/bacalhau-project/bacalhau/pkg/executor/docker/executor.go:103 +0x88
github.com/bacalhau-project/bacalhau/pkg/executor/util.(*bidStrategyFromExecutor).ShouldBid(0xc0005c44c0?, {0xcd26278, _}, {{0xc000e06300, 0x2e}, {{0xc00005e930, 0x26}, {0xc00005e960, 0x26}, {0xc000e81ca0, ...}, ...}})
    github.com/bacalhau-project/bacalhau/pkg/executor/util/executors_bid_strategy.go:47 +0xc8
github.com/bacalhau-project/bacalhau/pkg/bidstrategy.(*ChainedBidStrategy).ShouldBid(0xc000769410, {0xcd26278, _}, {{0xc000e06300, 0x2e}, {{0xc00005e930, 0x26}, {0xc00005e960, 0x26}, {0xc000e81ca0, ...}, ...}})
    github.com/bacalhau-project/bacalhau/pkg/bidstrategy/chained.go:53 +0x10b
github.com/bacalhau-project/bacalhau/pkg/compute.Bidder.runSemanticBidding({{0xc000e06300, 0x2e}, {0xcd39e28, 0xc0006eda40}, {0xcd02e80, 0xc0003398a8}, {0xcd0f738, 0xc0004bba40}, {0xcd27258, 0xc0002373a0}, ...}, ...)
    github.com/bacalhau-project/bacalhau/pkg/compute/bidder.go:271 +0x1f2
github.com/bacalhau-project/bacalhau/pkg/compute.Bidder.doBidding({{0xc000e06300, 0x2e}, {0xcd39e28, 0xc0006eda40}, {0xcd02e80, 0xc0003398a8}, {0xcd0f738, 0xc0004bba40}, {0xcd27258, 0xc0002373a0}, ...}, ...)
    github.com/bacalhau-project/bacalhau/pkg/compute/bidder.go:229 +0x5d
github.com/bacalhau-project/bacalhau/pkg/compute.Bidder.RunBidding({{0xc000e06300, 0x2e}, {0xcd39e28, 0xc0006eda40}, {0xcd02e80, 0xc0003398a8}, {0xcd0f738, 0xc0004bba40}, {0xcd27258, 0xc0002373a0}, ...}, ...)
    github.com/bacalhau-project/bacalhau/pkg/compute/bidder.go:103 +0xde
created by github.com/bacalhau-project/bacalhau/pkg/compute.BaseEndpoint.AskForBid in goroutine 73
    github.com/bacalhau-project/bacalhau/pkg/compute/endpoint.go:71 +0x505
linear[bot] commented 1 month ago

ENG-276 [Bug] web ui shows Connection Lost

YuriyGavrilov commented 1 month ago

also don't know how to run in docker mode Node n-f4ce17b5: does not support docker, only wasm"]

aronchick commented 1 month ago

i'm so sorry, we're on it!

For your second problem, it's likely you don't have docker running on the machine you're running on.

wdbaruni commented 1 month ago

Hy Yuriy,

Issue 1:

I wasn't able to re-produce the issue. I do see from the screenshot the webui is having trouble connecting to the orchestrator node. What url are you using to reach the webui? Is it 0.0.0.0:8438 where the orchestrator is deployed locally? or a remote node?

If you are calling the webui on a remote node, and I'll guess the node is 192.168.0.105, try and see if this works bacalhau serve --node-type requester,compute --web-ui --config WebUI.Backend=192.168.0.105:1234.

If this works, it is just telling the frontend to connect to the orchestrator at 192.168.0.105:1234 instead of the default endpoint 0.0.0.0:1234 which would only work for local deployments. More documentation is required from our end.

Issue 2:

The panic you are getting seems to originate from the code where we try to query docker's daemon information. It seems we are not handling edge cases well when the returned information is not what we expect. This code hasn't change in while, and it seems the bug just wasn't triggered all this time. Do you mind providing us with more information about your setup? Mainly the output of docker version

https://github.com/bacalhau-project/bacalhau/blob/3fd303df4b0de71520a771b53c85b8dbd68eb72d/pkg/docker/docker.go#L233-L241

YuriyGavrilov commented 1 month ago

i'm so sorry, we're on it!

For your second problem, it's likely you don't have docker running on the machine you're running on.

Thanks for helping 🙏🏻 @aronchick @wdbaruni

@aronchick yep you was right, regular use Podman so I run docker on second node. checked but actually same results.

@wdbaruni

  1. run the node with bacalhau serve --node-type=compute --orchestrators=192.168.0.105
  2. run the job bacalhau docker run ubuntu echo hello --api-host=192.168.0.105
  3. receive:
panic: runtime error: index out of range [-1]

goroutine 29 [running]:
github.com/bacalhau-project/bacalhau/pkg/docker.(*Client).SupportedPlatforms(0x205?, {0xaff6278?, 0xc0009df350?})
    github.com/bacalhau-project/bacalhau/pkg/docker/docker.go:244 +0x250
github.com/bacalhau-project/bacalhau/pkg/executor/docker/bidstrategy/semantic.(*ImagePlatformBidStrategy).ShouldBid(0xc000990008, {0xaff6278, _}, {{0xc0000cbe60, 0x2e}, {{0xc000b021e0, 0x26}, {0xc000b02210, 0x26}, {0xc0008a6630, ...}, ...}})
    github.com/bacalhau-project/bacalhau/pkg/executor/docker/bidstrategy/semantic/image_platform.go:52 +0x125
github.com/bacalhau-project/bacalhau/pkg/executor/docker.(*Executor).ShouldBid(0x20?, {0xaff6278, _}, {{0xc0000cbe60, 0x2e}, {{0xc000b021e0, 0x26}, {0xc000b02210, 0x26}, {0xc0008a6630, ...}, ...}})
    github.com/bacalhau-project/bacalhau/pkg/executor/docker/executor.go:103 +0x88
github.com/bacalhau-project/bacalhau/pkg/executor/util.(*bidStrategyFromExecutor).ShouldBid(0xc00057e1c0?, {0xaff6278, _}, {{0xc0000cbe60, 0x2e}, {{0xc000b021e0, 0x26}, {0xc000b02210, 0x26}, {0xc0008a6630, ...}, ...}})
    github.com/bacalhau-project/bacalhau/pkg/executor/util/executors_bid_strategy.go:47 +0xc8
github.com/bacalhau-project/bacalhau/pkg/bidstrategy.(*ChainedBidStrategy).ShouldBid(0xc00088c8a0, {0xaff6278, _}, {{0xc0000cbe60, 0x2e}, {{0xc000b021e0, 0x26}, {0xc000b02210, 0x26}, {0xc0008a6630, ...}, ...}})
    github.com/bacalhau-project/bacalhau/pkg/bidstrategy/chained.go:53 +0x10b
github.com/bacalhau-project/bacalhau/pkg/compute.Bidder.runSemanticBidding({{0xc0000cbe60, 0x2e}, {0xb009e28, 0xc00085e2e0}, {0xafd2e80, 0xc000011290}, {0xafdf738, 0xc0004ad9d0}, {0xaff7258, 0xc000926400}, ...}, ...)
    github.com/bacalhau-project/bacalhau/pkg/compute/bidder.go:271 +0x1f2
github.com/bacalhau-project/bacalhau/pkg/compute.Bidder.doBidding({{0xc0000cbe60, 0x2e}, {0xb009e28, 0xc00085e2e0}, {0xafd2e80, 0xc000011290}, {0xafdf738, 0xc0004ad9d0}, {0xaff7258, 0xc000926400}, ...}, ...)
    github.com/bacalhau-project/bacalhau/pkg/compute/bidder.go:229 +0x5d
github.com/bacalhau-project/bacalhau/pkg/compute.Bidder.RunBidding({{0xc0000cbe60, 0x2e}, {0xb009e28, 0xc00085e2e0}, {0xafd2e80, 0xc000011290}, {0xafdf738, 0xc0004ad9d0}, {0xaff7258, 0xc000926400}, ...}, ...)
    github.com/bacalhau-project/bacalhau/pkg/compute/bidder.go:103 +0xde
created by github.com/bacalhau-project/bacalhau/pkg/compute.BaseEndpoint.AskForBid in goroutine 82
    github.com/bacalhau-project/bacalhau/pkg/compute/endpoint.go:71 +0x505
  1. On the client:

(base) yuriygavrilov@MBP-Yuriy mvn % bacalhau --api-host=192.168.0.105 node list                   
 ID          TYPE     APPROVAL  STATUS     LABELS                                      CPU     MEMORY      DISK         GPU  
 QmTeDSDo    Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=darwin  8.4 /   22.4 GB /   296.8 GB /   0 /  
                                                                                       8.4     22.4 GB     296.8 GB     0    
 n-f4ce17b5  Compute  APPROVED  CONNECTED  Architecture=arm64 Operating-System=linux   4.2 /   2.6 GB /    24.6 GB /    0 /  
                                                                                       4.2     2.6 GB      24.6 GB      0    

(base) yuriygavrilov@MBP-Yuriy mvn % bacalhau docker run ubuntu echo hello --api-host=192.168.0.105
Job successfully submitted. Job ID: j-0c765d7d-dba2-49a3-9328-f800eab9318d
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT         
 18:37:06.391              Submission       Job submitted 
 18:37:06.442  e-147eba88  Scheduling       Requested execution on QmTeDSDo 
 Processing    ..................🐟..
  1. Funny but now it shows this after run with --config WebUI.Backend=192.168.0.105:1234 Снимок экрана 2024-10-15 в 21 41 01

on the server side:


ration:1.104249] [path:/192.168.0.105:1234/api/v1/agent/alive] [status:404]
18:40:36.912 | WRN webui/webui.go:117 > File not found [attempted_paths:["build/192.168.0.105:1234/api/v1/agent/alive","build/192.168.0.105:1234/api/v1/agent/alive.html"]] [duration:18.804029] [path:/192.168.0.105:1234/api/v1/agent/alive] [status:404]
18:40:55.493 | WRN webui/webui.go:117 > File not found [attempted_paths:["build/192.168.0.105:1234/api/v1/agent/alive","build/192.168.0.105:1234/api/v1/agent/alive.html"]] [duration:0.808791] [path:/192.168.0.105:1234/api/v1/agent/alive] [status:404]
18:41:00.514 | WRN webui/webui.go:117 > File not found [attempted_paths:["build/192.168.0.105:1234/api/v1/agent/alive","build/192.168.0.105:1234/api/v1/agent/alive.html"]] [duration:1.139832] [path:/192.168.0.105:1234/api/v1/agent/alive] [status:404]
18:41:05.523 | WRN webui/webui.go:117 > File not found [attempted_paths:["build/192.168.0.105:1234/api/v1/agent/alive","build/192.168.0.105:1234/api/v1/agent/alive.html"]] [duration:0.990499] [path:/192.168.0.105:1234/api/v1/agent/alive] [status:404]
18:41:14.543 | WRN webui/webui.go:117 > File not found [attempted_paths:["build/192.168.0.105:1234/api/v1/agent/alive","build/192.168.0.105:1234/api/v1/agent/alive.html"]] [duration:1.255332] [path:/192.168.0.105:1234/api/v1/agent/alive] [status:404]
18:41:23.39 | WRN webui/webui.go:117 > File not found [attempted_paths:["build/192.168.0.105:1234/api/v1/agent/alive","build/192.168.0.105:1234/api/v1/agent/alive.html"]] [duration:0.821625] [path:/192.168.0.105:1234/api/v1/agent/alive] [status:404]
  1. Today installed latest version ( Mac OS, intel )
    
    (base) yuriygavrilov@MBP-Yuriy trino % docker version
    Client:
    Version:           27.2.0
    API version:       1.47
    Go version:        go1.21.13
    Git commit:        3ab4256
    Built:             Tue Aug 27 14:14:45 2024
    OS/Arch:           darwin/amd64
    Context:           desktop-linux

Server: Docker Desktop 4.34.3 (170107) Engine: Version: 27.2.0 API version: 1.47 (minimum version 1.24) Go version: go1.21.13 Git commit: 3ab5c7d Built: Tue Aug 27 14:15:15 2024 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.7.20 GitCommit: 8fc6bcff51318944179630522a095cc9dbf9f353 runc: Version: 1.1.13 GitCommit: v1.1.13-0-g58aa920 docker-init: Version: 0.19.0 GitCommit: de40ad0


Orchestrator run with: `bacalhau serve --node-type requester,compute --web-ui --config WebUI.Backend=192.168.0.105:1234` also tried run with sudo and port 8438 for example.

on docker:

Client: Docker Engine - Community Version: 24.0.2 API version: 1.43 Go version: go1.20.4 Git commit: cb74dfc Built: Thu May 25 21:51:03 2023 OS/Arch: linux/arm64 Context: default

Server: Docker Engine - Community Engine: Version: 24.0.2 API version: 1.43 (minimum version 1.12) Go version: go1.20.4 Git commit: 659604f Built: Thu May 25 21:51:03 2023 OS/Arch: linux/arm64 Experimental: false containerd: Version: 1.6.21 GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8 runc: Version: 1.1.7 GitCommit: v1.1.7-0-g860f061 docker-init: Version: 0.19.0 GitCommit: de40ad0


7. at the end tried to run only with one node

(base) yuriygavrilov@MBP-Yuriy lib % bacalhau --api-host=192.168.0.105 node list ID TYPE APPROVAL STATUS LABELS CPU MEMORY DISK GPU
QmTeDSDo Compute APPROVED DISCONNECTED Architecture=amd64 Operating-System=darwin 8.4 / 22.4 GB / 296.8 GB / 0 /
8.4 22.4 GB 296.8 GB 0
n-f4ce17b5 Compute APPROVED CONNECTED Architecture=arm64 Operating-System=linux 4.2 / 2.6 GB / 24.6 GB / 0 /
4.2 2.6 GB 24.6 GB 0


so run this: `bacalhau docker run ubuntu echo hello --api-host=192.168.0.105`

Receive: 

(base) yuriygavrilov@MBP-Yuriy mvn % bacalhau docker run ubuntu echo hello --api-host=192.168.0.105 Job successfully submitted. Job ID: j-dbcadbd1-515b-4fcf-8cf8-4c2319dd6585 Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

TIME EXEC. ID TOPIC EVENT
18:50:12.212 Submission Job submitted 18:50:12.283 e-7a14776e Scheduling Requested execution on n-f4ce17b5 18:50:13.661 e-7a14776e Execution Running 18:50:31.253 e-7a14776e Execution Completed successfully

To get more details about the run, execute: bacalhau job describe j-dbcadbd1-515b-4fcf-8cf8-4c2319dd6585

To get more details about the run executions, execute: bacalhau job executions j-dbcadbd1-515b-4fcf-8cf8-4c2319dd6585

But there its no jobs in web ui

<img width="625" alt="Снимок экрана 2024-10-15 в 21 54 06" src="https://github.com/user-attachments/assets/7cf83f51-2932-4c9a-8e11-9f020871f437">

ok at the end

bacalhau job describe j-dbcadbd1-515b-4fcf-8cf8-4c2319dd6585 --api-host=192.168.0.105 ID = j-dbcadbd1-515b-4fcf-8cf8-4c2319dd6585 Name = j-dbcadbd1-515b-4fcf-8cf8-4c2319dd6585 Namespace = default Type = batch State = Completed Count = 1 Created Time = 2024-10-15 18:50:12 Modified Time = 2024-10-15 18:50:31 Version = 0

Summary Completed = 1

Job History TIME TOPIC EVENT
2024-10-15 18:50:12 Submission Job submitted 2024-10-15 18:50:13 State Update Running
2024-10-15 18:50:31 State Update Completed

Executions ID NODE ID STATE DESIRED REV. CREATED MODIFIED COMMENT e-7a14776e n-f4ce17b5 Completed Stopped 6 5m10s ago 4m51s ago

Execution e-7a14776e History TIME TOPIC EVENT
2024-10-15 18:50:12 Scheduling Requested execution on n-f4ce17b5 2024-10-15 18:50:13 Execution Running
2024-10-15 18:50:31 Execution Completed successfully

Standard Output hello

wdbaruni commented 1 month ago

@YuriyGavrilov sorry this took long to resolve. A fix #4645 for the WebUI request routing has just landed and will be released with v1.5.1 in the next couple of days.

YuriyGavrilov commented 1 month ago

@wdbaruni thank you 🙏 happy to know it

wdbaruni commented 3 weeks ago

v1.5.1. is released with the fix. Thank you for your patience. Please feel free to reopen the issue if you are still facing issues