bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
657 stars 86 forks source link

Job get stuck at "Creating job for submission" until it times out while looking for input CID #1592

Closed enricorotundo closed 6 months ago

enricorotundo commented 1 year ago

Issue

Originally raised in this Slack thread, the following command bacalhau docker run -v bafybeidnwj4wlfsshgsfbuinelg5hsovlg6ibozyzyii7uw6dwgv6lytcm:/inputs mihirsinhparmar/bac reportedly got stuck at "Creating job for submission ..." (on v0.3.16), the report indicates c90bad95-8202-4638-96a0-4e09a545f180 as the related job-id.

There're suspiciously too few event logs related to that job-id. These were recorded between 2023-01-02T16:40:00 and 2023-01-02T16:45:00:

{"Type":"model.JobEvent","Event":{"APIVersion":"V1beta1","JobID":"c90bad95-8202-4638-96a0-4e09a545f180","ClientID":"b98095e56dd5f8f2de2063e2c6d0a21af540e2dbe2de5e0c54ae8c5a146347ec","SourceNodeID":"QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL","EventName":"Created","Spec":{"Engine":"Docker","Verifier":"Noop","Publisher":"Estuary","Docker":{"Image":"mihirsinhparmar/bac"},"Language":{"JobContext":{}},"Wasm":{},"Resources":{"GPU":""},"Timeout":1800,"inputs":[{"StorageSource":"IPFS","CID":"bafybeidnwj4wlfsshgsfbuinelg5hsovlg6ibozyzyii7uw6dwgv6lytcm","path":"/inputs"}],"outputs":[{"StorageSource":"IPFS","Name":"outputs","path":"/outputs"}],"Sharding":{"BatchSize":1,"GlobPatternBasePath":"/inputs"},"ExecutionPlan":{},"Deal":{"Concurrency":1}},"JobExecutionPlan":{"ShardsTotal":1},"Deal":{"Concurrency":1},"VerificationResult":{},"PublishedResult":{},"EventTime":"2023-01-02T16:42:39.814365508Z","SenderPublicKey":"CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDVRKPgCfY2fgfrkHkFjeWcqno+MDpmp8DgVaY672BqJl/dZFNU9lBg2P8Znh8OTtHPPBUBk566vU3KchjW7m3uK4OudXrYEfSfEPnCGmL6GuLiZjLf+eXGEez7qPaoYqo06gD8ROdD8VVse27E96LlrpD1xKshHhqQTxKoq1y6Rx4DpbkSt966BumovWJ70w+Nt9ZkPPydRCxVnyWS1khECFQxp5Ep3NbbKtxHNX5HeULzXN5q0EQO39UN6iBhiI34eZkH7PoAm3Vk5xns//FjTAvQw6wZUu8LwvZTaihs+upx2zZysq6CEBKoeNZqed9+Tf+qHow0P5pxmiu+or+DAgMBAAE="},"time":"2023-01-02T16:42:39.821139891Z"}
{"Type":"model.JobEvent","Event":{"APIVersion":"V1beta1","JobID":"c90bad95-8202-4638-96a0-4e09a545f180","ClientID":"b98095e56dd5f8f2de2063e2c6d0a21af540e2dbe2de5e0c54ae8c5a146347ec","SourceNodeID":"QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL","EventName":"Created","Spec":{"Engine":"Docker","Verifier":"Noop","Publisher":"Estuary","Docker":{"Image":"mihirsinhparmar/bac"},"Language":{"JobContext":{}},"Wasm":{},"Resources":{"GPU":""},"Timeout":1800,"inputs":[{"StorageSource":"IPFS","CID":"bafybeidnwj4wlfsshgsfbuinelg5hsovlg6ibozyzyii7uw6dwgv6lytcm","path":"/inputs"}],"outputs":[{"StorageSource":"IPFS","Name":"outputs","path":"/outputs"}],"Sharding":{"BatchSize":1,"GlobPatternBasePath":"/inputs"},"ExecutionPlan":{},"Deal":{"Concurrency":1}},"JobExecutionPlan":{"ShardsTotal":1},"Deal":{"Concurrency":1},"VerificationResult":{},"PublishedResult":{},"EventTime":"2023-01-02T16:42:39.814365508Z"},"time":"2023-01-02T16:42:39.82430572Z"}
{"Type":"model.JobEvent","Event":{"JobID":"c90bad95-8202-4638-96a0-4e09a545f180","SourceNodeID":"QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL","EventName":"Error","Spec":{"Docker":{},"Language":{"JobContext":{}},"Wasm":{},"Resources":{"GPU":""},"Sharding":{},"ExecutionPlan":{},"Deal":{}},"JobExecutionPlan":{},"Deal":{},"Status":"[QmdZQ7Zb] shard: c90bad95-8202-4638-96a0-4e09a545f180:0 at state: Error error completing job due to shard timed out while in state EnqueuingBids, see --timeout","VerificationResult":{},"PublishedResult":{},"EventTime":"2023-01-02T16:44:47.080957365Z"},"time":"2023-01-02T16:44:47.0872095Z"}
{"Type":"model.JobEvent","Event":{"JobID":"c90bad95-8202-4638-96a0-4e09a545f180","SourceNodeID":"QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL","EventName":"Error","Spec":{"Docker":{},"Language":{"JobContext":{}},"Wasm":{},"Resources":{"GPU":""},"Sharding":{},"ExecutionPlan":{},"Deal":{}},"JobExecutionPlan":{},"Deal":{},"Status":"[QmdZQ7Zb] shard: c90bad95-8202-4638-96a0-4e09a545f180:0 at state: Error error completing job due to shard timed out while in state EnqueuingBids, see --timeout","VerificationResult":{},"PublishedResult":{},"EventTime":"2023-01-02T16:44:47.080957365Z","SenderPublicKey":"CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDVRKPgCfY2fgfrkHkFjeWcqno+MDpmp8DgVaY672BqJl/dZFNU9lBg2P8Znh8OTtHPPBUBk566vU3KchjW7m3uK4OudXrYEfSfEPnCGmL6GuLiZjLf+eXGEez7qPaoYqo06gD8ROdD8VVse27E96LlrpD1xKshHhqQTxKoq1y6Rx4DpbkSt966BumovWJ70w+Nt9ZkPPydRCxVnyWS1khECFQxp5Ep3NbbKtxHNX5HeULzXN5q0EQO39UN6iBhiI34eZkH7PoAm3Vk5xns//FjTAvQw6wZUu8LwvZTaihs+upx2zZysq6CEBKoeNZqed9+Tf+qHow0P5pxmiu+or+DAgMBAAE="},"time":"2023-01-02T16:44:47.088332539Z"}

Furthermore, it seems there're no logs pointing to that job-id around the time of the incident. This is werid.

Investigation

Two things to note here:

  1. mihirsinhparmar/bac is not an existing image (check on Docker Hub). We can probably rule out this option because although https://github.com/filecoin-project/bacalhau/issues/1587 and https://github.com/filecoin-project/bacalhau/issues/1589, the submit command should still return quickly with a generic Error while executing the job.
  2. The input CID was created a few moments (minutes?) before the issue was encountered, maybe it had not been propagated yet and bacalhau failed at outputting a proper error message? Submitting a job with a non-existing input CID shows the same symptoms reported in slack: stuck at "Creating job for submission ...", try it yourself with the command below 👇
❯ bacalhau docker run -v bafybeidnwj4wlfsshgsfbuinelg5hsovlg6ibozyzyii7uw6dwgv6lytcz:/inputs ubuntu ls /inputs
Job successfully submitted. Job ID: c4963f3c-69f5-40de-8bfb-a59f793e9b43
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

           Creating job for submission ... done ✅
      Unknown error while running job.
...
enricorotundo@bacalhau-vm-production-0:/data/.bacalhau$ journalctl -u bacalhau-daemon --since="2022-12-30" | grep --text c4963f3c-69f5-40de-8bfb-a59f793e9b43
Jan 04 14:10:09 bacalhau-vm-production-0 bash[359111]: 14:10:09.269 | DBG localdb/eventhandler.go:47 > JobID = c4963f3c-69f5-40de-8bfb-a59f793e9b43
Jan 04 14:10:09 bacalhau-vm-production-0 bash[359111]: 14:10:09.282 | DBG frontend/service.go:54 > job created: c4963f3c-69f5-40de-8bfb-a59f793e9b43 [NodeID:QmdZQ7Zb]
Jan 04 14:12:09 bacalhau-vm-production-0 bash[359111]: 14:12:09.301 | INF eventhandler/chained_handlers.go:132 > Handled event [ClientID:bae9c3b2adfa04cc647a2457e8c0c605cef8ed93bdea5ac5f19f94219f722dfe] [EventName:Created] [HandleDuration:120036.02186] [HandlerError:"error calculating job requirements: error getting job disk space requirements: Post \"http://127.0.0.1:5001/api/v0/object/stat?arg=%2Fipfs%2Fbafybeidnwj4wlfsshgsfbuinelg5hsovlg6ibozyzyii7uw6dwgv6lytcy\": context deadline exceeded"] [JobID:c4963f3c-69f5-40de-8bfb-a59f793e9b43] [NodeID:QmdZQ7Zb] [ShardIndex:0] [SourceNodeID:QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3] [Status:] [TargetNodeID:]
Jan 04 14:12:09 bacalhau-vm-production-0 bash[359111]: {APIVersion:V1beta1 JobID:c4963f3c-69f5-40de-8bfb-a59f793e9b43 ShardIndex:0 ClientID:bae9c3b2adfa04cc647a2457e8c0c605cef8ed93bdea5ac5f19f94219f722dfe SourceNodeID:QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3 TargetNodeID: EventName:Created Spec:{Engine:Docker Verifier:Noop Publisher:Estuary Docker:{Image:ubuntu Entrypoint:[ls /inputs] EnvironmentVariables:[] WorkingDirectory:} Language:{Language: LanguageVersion: Deterministic:false Context:{StorageSource:storageSourceUnknown Name: CID: URL: Path: Metadata:map[]} Command: ProgramPath: RequirementsPath:} Wasm:{EntryPoint: Parameters:[] EnvironmentVariables:map[] ImportModules:[]} Resources:{CPU: Memory: Disk: GPU:} Timeout:1800 Inputs:[{StorageSource:IPFS Name: CID:bafybeidnwj4wlfsshgsfbuinelg5hsovlg6ibozyzyii7uw6dwgv6lytcz URL: Path:/inputs Metadata:map[]}] Contexts:[] Outputs:[{StorageSource:IPFS Name:outputs CID: URL: Path:/outputs Metadata:map[]}] Annotations:[] Sharding:{GlobPattern: BatchSize:1 BasePath:/inputs} DoNotTrack:false ExecutionPlan:{TotalShards:0} Deal:{Concurrency:1 Confidence:0 MinBids:0}} JobExecutionPlan:{TotalShards:1} Deal:{Concurrency:1 Confidence:0 MinBids:0} Status: VerificationProposal:[] VerificationResult:{Complete:false Result:false} PublishedResult:{StorageSource:storageSourceUnknown Name: CID: URL: Path: Metadata:map[]} EventTime:2023-01-04 14:10:09.258495872 +0000 UTC SenderPublicKey:[8 0 18 166 2 48 130 1 34 48 13 6 9 42 134 72 134 247 13 1 1 1 5 0 3 130 1 15 0 48 130 1 10 2 130 1 1 0 197 230 150 26 77 219 120 83 12 236 148 244 67 175 197 197 94 253 92 150 219 70 24 133 132 54 89 196 8 181 106 179 10 208 86 158 113 172 203 8 255 158 204 172 149 44 48 0 236 207 166 191 114 42 125 44 236 98 202 109 219 101 82 30 103 149 166 110 177 105 227 193 116 89 190 209 8 66 192 177 71 37 255 60 120 20 112 138 248 13 133 42 169 174 134 150 61 55 37 77 26 234 72 12 21 2 62 240 150 176 54 252 237 74 255 253 175 65 116 35 132 78 230 155 227 158 251 123 123 197 9 173 223 19 239 99 159 145 110 134 134 180 155 150 123 25 251 153 181 77 22 216 85 250 157 206 21 33 251 30 49 106 214 164 112 254 6 167 191 145 123 62 241 17 65 85 109 128 99 44 206 242 80 156 30 212 66 224 182 245 50 93 85 230 23 40 0 221 105 123 156 159 86 227 229 79 218 178 2 0 167 31 121 200 33 107 111 17 140 252 6 116 131 129 117 211 29 212 254 55 238 165 174 111 104 50 9 77 135 180 233 79 166 91 154 152 168 33 178 90 149 233 64 202 17 183 2 3 1 0 1] RunOutput:<nil>} [NodeID:QmdZQ7Zb]
Jan 04 14:12:17 bacalhau-vm-production-0 bash[359111]: 14:12:17.16 | DBG localdb/eventhandler.go:47 > JobID = c4963f3c-69f5-40de-8bfb-a59f793e9b43
Jan 04 14:12:17 bacalhau-vm-production-0 bash[359111]: 14:12:17.173 | INF eventhandler/chained_handlers.go:132 > Handled event [ClientID:] [EventName:Error] [HandleDuration:16.854123] [HandlerError:"error getting active execution: no executions found for shard: c4963f3c-69f5-40de-8bfb-a59f793e9b43:0"] [JobID:c4963f3c-69f5-40de-8bfb-a59f793e9b43] [NodeID:QmdZQ7Zb] [ShardIndex:0] [SourceNodeID:QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3] [Status:"[QmYgxZiy] shard: c4963f3c-69f5-40de-8bfb-a59f793e9b43:0 at state: Error error completing job due to shard timed out while in state EnqueuingBids, see --timeout"] [TargetNodeID:]
Jan 04 14:12:17 bacalhau-vm-production-0 bash[359111]: 14:12:17.173 | ERR libp2p/libp2p.go:475 > error in handle event: error getting active execution: no executions found for shard: c4963f3c-69f5-40de-8bfb-a59f793e9b43:0
Jan 04 14:12:17 bacalhau-vm-production-0 bash[359111]: {APIVersion: JobID:c4963f3c-69f5-40de-8bfb-a59f793e9b43 ShardIndex:0 ClientID: SourceNodeID:QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3 TargetNodeID: EventName:Error Spec:{Engine:engineUnknown Verifier:verifierUnknown Publisher:publisherUnknown Docker:{Image: Entrypoint:[] EnvironmentVariables:[] WorkingDirectory:} Language:{Language: LanguageVersion: Deterministic:false Context:{StorageSource:storageSourceUnknown Name: CID: URL: Path: Metadata:map[]} Command: ProgramPath: RequirementsPath:} Wasm:{EntryPoint: Parameters:[] EnvironmentVariables:map[] ImportModules:[]} Resources:{CPU: Memory: Disk: GPU:} Timeout:0 Inputs:[] Contexts:[] Outputs:[] Annotations:[] Sharding:{GlobPattern: BatchSize:0 BasePath:} DoNotTrack:false ExecutionPlan:{TotalShards:0} Deal:{Concurrency:0 Confidence:0 MinBids:0}} JobExecutionPlan:{TotalShards:0} Deal:{Concurrency:0 Confidence:0 MinBids:0} Status:[QmYgxZiy] shard: c4963f3c-69f5-40de-8bfb-a59f793e9b43:0 at state: Error error completing job due to shard timed out while in state EnqueuingBids, see --timeout VerificationProposal:[] VerificationResult:{Complete:false Result:false} PublishedResult:{StorageSource:storageSourceUnknown Name: CID: URL: Path: Metadata:map[]} EventTime:2023-01-04 14:12:17.15292983 +0000 UTC SenderPublicKey:[8 0 18 166 2 48 130 1 34 48 13 6 9 42 134 72 134 247 13 1 1 1 5 0 3 130 1 15 0 48 130 1 10 2 130 1 1 0 197 230 150 26 77 219 120 83 12 236 148 244 67 175 197 197 94 253 92 150 219 70 24 133 132 54 89 196 8 181 106 179 10 208 86 158 113 172 203 8 255 158 204 172 149 44 48 0 236 207 166 191 114 42 125 44 236 98 202 109 219 101 82 30 103 149 166 110 177 105 227 193 116 89 190 209 8 66 192 177 71 37 255 60 120 20 112 138 248 13 133 42 169 174 134 150 61 55 37 77 26 234 72 12 21 2 62 240 150 176 54 252 237 74 255 253 175 65 116 35 132 78 230 155 227 158 251 123 123 197 9 173 223 19 239 99 159 145 110 134 134 180 155 150 123 25 251 153 181 77 22 216 85 250 157 206 21 33 251 30 49 106 214 164 112 254 6 167 191 145 123 62 241 17 65 85 109 128 99 44 206 242 80 156 30 212 66 224 182 245 50 93 85 230 23 40 0 221 105 123 156 159 86 227 229 79 218 178 2 0 167 31 121 200 33 107 111 17 140 252 6 116 131 129 117 211 29 212 254 55 238 165 174 111 104 50 9 77 135 180 233 79 166 91 154 152 168 33 178 90 149 233 64 202 17 183 2 3 1 0 1] RunOutput:<nil>} [NodeID:QmdZQ7Zb]

enricorotundo@bacalhau-vm-production-0:/data/.bacalhau$ cat bacalhau-event-tracer.json | grep --text c4963f3c-69f5-40de-8bfb-a59f793e9b43
{"Type":"model.JobEvent","Event":{"APIVersion":"V1beta1","JobID":"c4963f3c-69f5-40de-8bfb-a59f793e9b43","ClientID":"bae9c3b2adfa04cc647a2457e8c0c605cef8ed93bdea5ac5f19f94219f722dfe","SourceNodeID":"QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3","EventName":"Created","Spec":{"Engine":"Docker","Verifier":"Noop","Publisher":"Estuary","Docker":{"Image":"ubuntu","Entrypoint":["ls","/inputs"]},"Language":{"JobContext":{}},"Wasm":{},"Resources":{"GPU":""},"Timeout":1800,"inputs":[{"StorageSource":"IPFS","CID":"bafybeidnwj4wlfsshgsfbuinelg5hsovlg6ibozyzyii7uw6dwgv6lytcz","path":"/inputs"}],"outputs":[{"StorageSource":"IPFS","Name":"outputs","path":"/outputs"}],"Sharding":{"BatchSize":1,"GlobPatternBasePath":"/inputs"},"ExecutionPlan":{},"Deal":{"Concurrency":1}},"JobExecutionPlan":{"ShardsTotal":1},"Deal":{"Concurrency":1},"VerificationResult":{},"PublishedResult":{},"EventTime":"2023-01-04T14:10:09.258495872Z","SenderPublicKey":"CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDF5pYaTdt4UwzslPRDr8XFXv1clttGGIWENlnECLVqswrQVp5xrMsI/57MrJUsMADsz6a/cip9LOxiym3bZVIeZ5WmbrFp48F0Wb7RCELAsUcl/zx4FHCK+A2FKqmuhpY9NyVNGupIDBUCPvCWsDb87Ur//a9BdCOETuab4577e3vFCa3fE+9jn5Fuhoa0m5Z7GfuZtU0W2FX6nc4VIfseMWrWpHD+Bqe/kXs+8RFBVW2AYyzO8lCcHtRC4Lb1Ml1V5hcoAN1pe5yfVuPlT9qyAgCnH3nIIWtvEYz8BnSDgXXTHdT+N+6lrm9oMglNh7TpT6ZbmpioIbJalelAyhG3AgMBAAE="},"time":"2023-01-04T14:10:09.265541705Z"}
{"Type":"model.JobEvent","Event":{"JobID":"c4963f3c-69f5-40de-8bfb-a59f793e9b43","SourceNodeID":"QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3","EventName":"Error","Spec":{"Docker":{},"Language":{"JobContext":{}},"Wasm":{},"Resources":{"GPU":""},"Sharding":{},"ExecutionPlan":{},"Deal":{}},"JobExecutionPlan":{},"Deal":{},"Status":"[QmYgxZiy] shard: c4963f3c-69f5-40de-8bfb-a59f793e9b43:0 at state: Error error completing job due to shard timed out while in state EnqueuingBids, see --timeout","VerificationResult":{},"PublishedResult":{},"EventTime":"2023-01-04T14:12:17.15292983Z","SenderPublicKey":"CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDF5pYaTdt4UwzslPRDr8XFXv1clttGGIWENlnECLVqswrQVp5xrMsI/57MrJUsMADsz6a/cip9LOxiym3bZVIeZ5WmbrFp48F0Wb7RCELAsUcl/zx4FHCK+A2FKqmuhpY9NyVNGupIDBUCPvCWsDb87Ur//a9BdCOETuab4577e3vFCa3fE+9jn5Fuhoa0m5Z7GfuZtU0W2FX6nc4VIfseMWrWpHD+Bqe/kXs+8RFBVW2AYyzO8lCcHtRC4Lb1Ml1V5hcoAN1pe5yfVuPlT9qyAgCnH3nIIWtvEYz8BnSDgXXTHdT+N+6lrm9oMglNh7TpT6ZbmpioIbJalelAyhG3AgMBAAE="},"time":"2023-01-04T14:12:17.160197219Z"}

Now, the log Error error completing job due to shard timed out while in state EnqueuingBids, see --timeout appears in both my repro and in the reported issue so I'd say the root cause is probably a silent timeout whilst looking up a CID in IPFS.

enricorotundo commented 1 year ago

More info from Mihirsinh Parmar (link to slack thread here):

Hey, the docker image mihirsinhparmar/bac was deleted, as it was giving error and even when I updated the docker image, bacalhau was still running old docker container, so I moved to a different docker container name mihirsinhparmar/bacalhau-ie This one works on bacalhau. Though I have to try it by providing inputs.

I believe, bacalhau waiting at Creating job for submission because IPFS hash was not propagated to filecoin. I pinned the file using Web3Storage. But I recently checked the file is still in queue. I pinned a file few days ago. the file: bafybeidnwj4wlfsshgsfbuinelg5hsovlg6ibozyzyii7uw6dwgv6lytcm

I just uploaded a new file today with web3 storage. It is still queuing, can this IPFS hash be used for inputs or does the filecoin deal should be made before the IPFS hash can be used with bacalhau.