camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Task '/bin/sleep 60' has failed #76

Open gabrielecastellano opened 3 years ago

gabrielecastellano commented 3 years ago

Hello everyone, I managed to run firmament using the provided docker image. When I run the container, it gives me the following error (don't know if it is related to my issue):

$ docker run -p 9999:9999 -w /firmament camsas/firmament:dev /firmament/build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8081 --http_ui_port 9999 --task_lib_dir=/firmament/build/src
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/H2GR6RBYUIPHBXDMSGKPBAYWNE:/var/lib/docker/overlay2/l/NKEZN6MLXD4DGK5HNNI2K4SN7K:/var/lib/docker/overlay2/l/5H5GK4TBC5MY7NFNYEW2P7MPRP:/var/lib/docker/overlay2/l/2DVGBZKGQHVXVENMWHAW3HNEGB:/var/lib/docker/overlay2/l/DA5VWJ6IOM3MFNW3T6VLSZ4ZDR:/var/lib/docker/overlay2/l/NFSSHKRHC7XPWN7BXCLFDMXHF6:/var/lib/docker/overlay2/l/C4RYQ3MDIDZ376KHATSEPRHOOC:/var/lib/docker/overlay2/l/23CTT2D5BDVQOVVUTHAGP4SPKX:/var/lib/docker/overlay2/l/UTO3PZRTFU4CU'

Despite this, the server seems running correctly, and I am able to access the gui at http://:9999/

However, when I tried to submit a job with python scripts/job/job_submit.py 172.17.0.2 9999 /bin/sleep 60 I got the following error: E1116 17:19:04.534961 6 task_health_checker.cc:51] Task 18085502784089753274 has failed!

Here is /tmp/coordinator.INFO:

I1116 17:16:14.514029     1 coordinator_main.cc:36] Firmament coordinator starting ...
I1116 17:16:14.531463     1 coordinator.cc:120] Using Quincy-style min cost flow-based scheduler.
I1116 17:16:14.531641     1 coordinator.cc:133] Coordinator starting on host tcp:0.0.0.0:8081, UUID 42f151f8-deef-46b8-b8a6-88ab53e5e6a7
I1116 17:16:14.531744     1 coordinator.cc:221] Detecting resource topology:
I1116 17:16:14.531754     1 topology_manager.cc:212] *** LEVEL: 0
I1116 17:16:14.531767     1 topology_manager.cc:217] Index: 0: Machine#0(7470MB)
I1116 17:16:14.531774     1 topology_manager.cc:212] *** LEVEL: 1
I1116 17:16:14.531781     1 topology_manager.cc:217] Index: 0: Socket#0
I1116 17:16:14.531786     1 topology_manager.cc:212] *** LEVEL: 2
I1116 17:16:14.531793     1 topology_manager.cc:217] Index: 0: L3(6144KB)
I1116 17:16:14.531800     1 topology_manager.cc:212] *** LEVEL: 3
I1116 17:16:14.531805     1 topology_manager.cc:217] Index: 0: L2(256KB)
I1116 17:16:14.531812     1 topology_manager.cc:217] Index: 1: L2(256KB)
I1116 17:16:14.531819     1 topology_manager.cc:217] Index: 2: L2(256KB)
I1116 17:16:14.531826     1 topology_manager.cc:217] Index: 3: L2(256KB)
I1116 17:16:14.531831     1 topology_manager.cc:212] *** LEVEL: 4
I1116 17:16:14.531838     1 topology_manager.cc:217] Index: 0: L1d(32KB)
I1116 17:16:14.531846     1 topology_manager.cc:217] Index: 1: L1d(32KB)
I1116 17:16:14.531852     1 topology_manager.cc:217] Index: 2: L1d(32KB)
I1116 17:16:14.531859     1 topology_manager.cc:217] Index: 3: L1d(32KB)
I1116 17:16:14.531864     1 topology_manager.cc:212] *** LEVEL: 5
I1116 17:16:14.531870     1 topology_manager.cc:217] Index: 0: Core#0
I1116 17:16:14.531877     1 topology_manager.cc:217] Index: 1: Core#1
I1116 17:16:14.531883     1 topology_manager.cc:217] Index: 2: Core#2
I1116 17:16:14.531889     1 topology_manager.cc:217] Index: 3: Core#3
I1116 17:16:14.531894     1 topology_manager.cc:212] *** LEVEL: 6
I1116 17:16:14.531900     1 topology_manager.cc:217] Index: 0: PU#0
I1116 17:16:14.531908     1 topology_manager.cc:217] Index: 1: PU#1
I1116 17:16:14.531913     1 topology_manager.cc:217] Index: 2: PU#2
I1116 17:16:14.531920     1 topology_manager.cc:217] Index: 3: PU#3
I1116 17:16:14.531926     1 coordinator.cc:176] Found 4 local PUs.
I1116 17:16:14.531932     1 coordinator.cc:177] Resource URI is tcp:0.0.0.0:8081
I1116 17:16:14.534741     1 coordinator_http_ui.cc:1321] Coordinator HTTP interface up!
I1116 17:16:22.949242    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:16:23.151162     9 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:16:23.151223     9 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:17:25.160835     9 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:17:25.308948    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:17:25.308990    16 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:18:28.951195    14 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:18:29.114184    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:18:29.114243    16 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:18:57.184258    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /job/submit/
I1116 17:18:57.198359    16 coordinator.cc:865] NEW JOB: 1468db75-43d3-417e-9e26-f9843eba8c8e
I1116 17:18:57.198387    16 flow_scheduler.cc:405] START SCHEDULING (via 1468db75-43d3-417e-9e26-f9843eba8c8e)
W1116 17:18:57.198391    16 flow_scheduler.cc:406] This way of scheduling a job is slow in the flow scheduler! Consider using ScheduleAllJobs() instead.
I1116 17:18:57.198488    16 utils.cc:341] External execution of command: build/third_party/cs2/src/cs2/cs2.exe
I1116 17:18:57.475673    20 local_executor.cc:393] COMMAND LINE for task 18085502784089753274: perf stat -o /tmp/firmament-perf/aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf -e cpu-clock,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,cache-misses,cache-references,stalled-cycles-frontend,stalled-cycles-backend,node-loads,node-load-misses -- /bin/sleep 60
I1116 17:18:57.476095    16 coordinator.cc:911] Attempted to schedule job 1468db75-43d3-417e-9e26-f9843eba8c8e, successfully scheduled 1 tasks.
E1116 17:19:04.534961     6 task_health_checker.cc:51] Task 18085502784089753274 has failed!
I1116 17:19:04.535176     6 event_driven_scheduler.cc:144] Task 18085502784089753274 has not reported heartbeats for 60s and its handler thread has exited. Declaring it FAILED!
I1116 17:19:04.535195     6 local_executor.cc:145] kill(2) for task 18085502784089753274 returned -1

And here is what I get from the GUI: firmament

By clicking both on the stderr link, I get:

E1116 17:18:57.757828 21 local_executor.cc:443] execvp failed for task command 'perf stat -o /tmp/firmament-perf/aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf -e cpu-clock,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,cache-misses,cache-references,stalled-cycles-frontend,stalled-cycles-backend,node-loads,node-load-misses -- /bin/sleep 60 ': No such file or directory [2]

What am I missing?

Thanks! Gabriele

5symx commented 6 months ago

I fixed it by adding aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf file to the content /tmp/firmament-perf/ in the docker container. It seems like working well.