bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
674 stars 88 forks source link

user_id.pem is empty in prod network #4562

Open wdbaruni opened 2 days ago

wdbaruni commented 2 days ago

Deployment failed in prod network. For some reason user_id.pem is empty and causing nil panic

walid_expanso_io@bacalhau-vm-prod-0:~$ ls -la /data/user_id.pem
-rw------- 1 root root 0 Sep 30 16:10 /data/user_id.pem
walid_expanso_io@bacalhau-vm-prod-0:~$
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.118 | INF cmd/cli/serve/serve.go:102 > Config loaded from: [/data/config.yaml], and with data-dir /data
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.118 | INF cmd/cli/serve/serve.go:102 > Config loaded from: [/data/config.yaml], and with data-dir /data
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.119 | INF pkg/repo/migration.go:46 > Migrating repo to latest version...
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.136 | INF pkg/repo/migration.go:55 > Migration successful
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.139 | DBG pkg/compute/store/boltdb/store.go:83 > creating new bbolt database at /data/compute/state_boltdb.db [NodeID:QmbxGSsM]
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.161 | DBG pkg/compute/capacity/system/provider.go:70 > Cannot inspect Nvidia GPUs so they will not be used: tool "nvidia-smi" is not installed or not on PATH: exec: "nvidia-smi": executable file not found in $PATH [NodeID:QmbxGSsM]
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.161 | DBG pkg/compute/capacity/system/provider.go:70 > Cannot inspect AMD GPUs so they will not be used: tool "rocm-smi" is not installed or not on PATH: exec: "rocm-smi": executable file not found in $PATH [NodeID:QmbxGSsM]
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.161 | DBG pkg/compute/capacity/system/provider.go:70 > Cannot inspect Intel GPUs so they will not be used: tool "xpu-smi" is not installed or not on PATH: exec: "xpu-smi": executable file not found in $PATH [NodeID:QmbxGSsM]
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.169 | DBG pkg/compute/capacity/system/provider.go:70 > Cannot inspect Nvidia GPUs so they will not be used: tool "nvidia-smi" is not installed or not on PATH: exec: "nvidia-smi": executable file not found in $PATH
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.169 | DBG pkg/compute/capacity/system/provider.go:70 > Cannot inspect AMD GPUs so they will not be used: tool "rocm-smi" is not installed or not on PATH: exec: "rocm-smi": executable file not found in $PATH
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.169 | DBG pkg/compute/capacity/system/provider.go:70 > Cannot inspect Intel GPUs so they will not be used: tool "xpu-smi" is not installed or not on PATH: exec: "xpu-smi": executable file not found in $PATH
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.169 | DBG pkg/node/config_compute.go:221 > Compute config: {TotalResourceLimits:{CPU:1.4 Memory:5827434905 Disk:1451835638579 GPU:0 GPUs:[]} JobResourceLimits:{CPU:1.4 Memory:5827434905 Disk:1451835638579 GPU:0 GPUs:[]} DefaultJobResourceLimits:{CPU:1.4 Memory:5827434905 Disk:1451835638579 GPU:0 GPUs:[]} IgnorePhysicalResourceLimits:false JobNegotiationTimeout:3m0s MinJobExecutionTimeout:500ms MaxJobExecutionTimeout:2562047h47m16s DefaultJobExecutionTimeout:2562047h47m16s JobSelectionPolicy:{Locality:Anywhere RejectStatelessJobs:false AcceptNetworkedJobs:true ProbeHTTP: ProbeExec:/terraform_node/apply-http-allowlist.sh} LogRunningExecutionsInterval:10s LogStreamBufferSize:0 FailureInjectionConfig:{IsBadActor:false} BidSemanticStrategy:<nil> BidResourceStrategy:<nil> ExecutionStore:0xc00072c020 LocalPublisher:{Address:35.245.161.250 Port:6001 Directory:/data/compute/executions/bacalhau-local-publisher} ControlPlaneSettings:{InfoUpdateFrequency:1m0s ResourceUpdateFrequency:30s HeartbeatFrequency:15s HeartbeatTopic:}}
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.174 | DBG pkg/node/config_requester.go:90 > Requester config: {JobDefaults:{Batch:{Priority:0 Task:{Resources:{CPU:500m Memory:1Gb Disk: GPU:} Publisher:{Config:{Type:local Params:map[]}} Timeouts:{TotalTimeout:0s ExecutionTimeout:0s}}} Ops:{Priority:0 Task:{Resources:{CPU:500m Memory:1Gb Disk: GPU:} Publisher:{Config:{Type:local Params:map[]}} Timeouts:{TotalTimeout:0s ExecutionTimeout:0s}}} Daemon:{Priority:0 Task:{Resources:{CPU:500m Memory:1Gb Disk: GPU:}}} Service:{Priority:0 Task:{Resources:{CPU:500m Memory:1Gb Disk: GPU:}}}} HousekeepingBackgroundTaskInterval:30s HousekeepingTimeoutBuffer:2m0s NodeRankRandomnessRange:5 OverAskForBidsFactor:3 JobSelectionPolicy:{Locality:Anywhere RejectStatelessJobs:false AcceptNetworkedJobs:true ProbeHTTP: ProbeExec:/terraform_node/apply-http-allowlist.sh} ExternalValidatorWebhook:<nil> FailureInjectionConfig:{IsBadActor:false} MinBacalhauVersion:{Major:1 Minor:0 GitVersion:v1.0.4 GitCommit: BuildDate:0001-01-01 00:00:00 +0000 UTC GOOS: GOARCH:} RetryStrategy:<nil> EvalBrokerVisibilityTimeout:1m0s EvalBrokerInitialRetryDelay:1s EvalBrokerSubsequentRetryDelay:30s EvalBrokerMaxRetryCount:10 WorkerCount:2 WorkerEvalDequeueTimeout:5s WorkerEvalDequeueBaseBackoff:1s WorkerEvalDequeueMaxBackoff:30s SchedulerQueueBackoff:0s NodeOverSubscriptionFactor:1.5 TranslationEnabled:true S3PreSignedURLDisabled:false S3PreSignedURLExpiration:30m0s JobStore:0xc000c3c120 NodeInfoStoreTTL:10m0s DefaultApprovalState:APPROVED ControlPlaneSettings:{HeartbeatCheckFrequency:30s HeartbeatTopic: NodeDisconnectedAfter:30s}}
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: 16:10:44.174 | INF cmd/cli/serve/serve.go:228 > Starting bacalhau...
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: panic: runtime error: invalid memory address or nil pointer dereference
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1d9ba68]
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: goroutine 23 [running]:
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: github.com/bacalhau-project/bacalhau/pkg/lib/crypto.(*UserKey).PublicKey(...)
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]:         github.com/bacalhau-project/bacalhau/pkg/lib/crypto/keys.go:27
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: github.com/bacalhau-project/bacalhau/pkg/node.createAPIServer({{0xc000befaa0, 0x2e}, 0xc000446de0, {0x2790fc5, 0x7}, 0x4d2, {0x0, 0x0}, {0x0, 0x0}, ...}, ...)
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]:         github.com/bacalhau-project/bacalhau/pkg/node/node.go:376 +0xa8
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: github.com/bacalhau-project/bacalhau/pkg/node.NewNode({_, _}, {{{0x2790fc5, 0x7}, 0x4d2, {{0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...}, ...)
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]:         github.com/bacalhau-project/bacalhau/pkg/node/node.go:147 +0x212
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: github.com/bacalhau-project/bacalhau/cmd/cli/serve.serve(_, {{{0x2790fc5, 0x7}, 0x4d2, {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, ...}, ...}, ...}, ...)
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]:         github.com/bacalhau-project/bacalhau/cmd/cli/serve/serve.go:229 +0xde5
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: github.com/bacalhau-project/bacalhau/cmd/cli/serve.NewCmd.func2(0xc00099a008, {0x27848b7?, 0x4?, 0x278479b?})
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]:         github.com/bacalhau-project/bacalhau/cmd/cli/serve/serve.go:111 +0x299
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: github.com/spf13/cobra.(*Command).execute(0xc00099a008, {0xc00014c908, 0x21, 0x23})
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]:         github.com/spf13/cobra@v1.8.0/command.go:983 +0xaaa
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: github.com/spf13/cobra.(*Command).ExecuteC(0xc0005cc008)
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]:         github.com/spf13/cobra@v1.8.0/command.go:1115 +0x3ff
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: github.com/spf13/cobra.(*Command).Execute(...)
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]:         github.com/spf13/cobra@v1.8.0/command.go:1039
Sep 30 16:10:44 bacalhau-vm-prod-0 bash[5879]: github.com/bacalhau-project/bacalhau/cmd/cli.Execute({0x308b9e8, 0xc000590c40})
frrist commented 2 days ago
root@bacalhau-vm-prod-0:/data# pwd
/data

root@bacalhau-vm-prod-0:/data# ll
total 72
drwxr-xr-x  8 root root  4096 Sep 30 16:27 ./
drwxr-xr-x 22 root root  4096 Sep 30 16:31 ../
drwxr-xr-x  2 root root  4096 Nov  9  2023 .bacalhau/
-rwxr-xr-x  1 root root    65 Sep 30 16:31 bacalhau.run*
drwxr-xr-x  3 root root  4096 Sep 30 16:10 compute/
-rw-r--r--  1 root root   332 Sep 30 16:10 config.yaml
drwxr-xr-x  5 root root  4096 Sep 30 16:38 ipfs/
-rw-------  1 root root  1597 Nov  9  2023 libp2p_private_key
drwx------  2 root root 16384 Nov  9  2023 lost+found/
drwxr-xr-x  3 root root  4096 Sep 30 16:10 orchestrator/
drwx------  2 root root  4096 Nov  9  2023 plugins/
-rw-------  1 root root    13 Sep 30 16:10 repo.version
-rw-r--r--  1 root root   825 Sep 30 16:31 secrets.sh
-rw-------  1 root root   197 Sep 30 16:31 system_metadata.yaml
-rw-------  1 root root  1679 Sep 30 16:27 user_id.pem

root@bacalhau-vm-prod-0:/data# which bacalhau
/usr/local/bin/bacalhau

root@bacalhau-vm-prod-0:~# bacalhau --repo=/data version
Flag --repo has been deprecated, Use --data-dir=<path> to set this configuration
 CLIENT          SERVER          LATEST  UPDATE MESSAGE 
 v1.5.0-alpha10  v1.5.0-alpha10  1.4.0                  

Checking the instance now, it appears the key has content in it, was this a one time issue or were manual steps taken to mitigate it?

wdbaruni commented 2 days ago

I had to delete user_id.pem. Not sure why that happened. The logs don't say much and I am not sure what state the repo was at before the migration. It does say the file was created today and was empty

frrist commented 2 days ago

My hunch right now is: The key in the repo was empty before the migration ran and thus a new (valid) one was never initialized, but that's a pretty weak hunch.

wdbaruni commented 1 day ago

but how did the node run with an empty user_id.pem before using v1.4? Did we change anything related to the user key in v1.5?