camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

hello world task failed! #60

Closed lilelr closed 6 years ago

lilelr commented 6 years ago

hi Malte and everyone, I got a problem when I try to run the 'hello word' example after installing firmament on a virtual machine of Ubuntu 14.04. I had built firmament from source and all 'ctest' tests had been passed successfully. I started up by using the command ./build/src/coordinator --listen_uri tcp:127.0.0.1:8081 --task_lib_dir=$(pwd)/build/src/ , then I could successfully observe the firmament status fromhttp://127.0.0.1:8080. However, when I tried to run the 'hello world' example using command python job_submit.py localhost 8080 /home/lilelr/opensource/firmament/firmament/build/src/examples/hello_world/hello_world

in the '/firmament/scripts/job' directory, I could see the job was submitted successfully, but quickly it showed me that E0823 14:33:35.263828 784 task_health_checker.cc:51] Task 2611075011106894433 has failed!

snip20170823_3

snip20170823_1

snip20170823_2

The log coodinator.INFO shows that:

snip20170823_5

From your codes in 'hello_world.cc', I guess if the 'hello world' example job was scheduled successfully, I should see the word 'Hello world' on the terminal. It seems the task failed because it didn't send its heartbeat in less than 1 minute, so it was killed by the coordinator. Is that right? How could I fix this problem? Or just because my virtual machine runs firmament too slowly? By the way, when I tried to run the example

python job_submit.py localhost 8080 /bin/sleep 60 the same problem occurred again.

CHENLIELIE commented 6 years ago

hello, lilelr. How do you install from source code? Is there anything with downloading of grpc? Did you used vps or vpn? thanks.

lilelr commented 6 years ago

haha, maybe CHENLIELIE you are also from China. A week ago, I succeed in downloading grpc. But now when I try to build firmament from source code on a new machine, it cannot work because it fails to download the boringssl from a google server.

ms705 commented 6 years ago

Hi @lilelr,

The most likely reason for this problem is that the task doesn't report back to the coordinator once it has started up. A common reason is that the --task_lib_dir parameter is incorrect and the LD_PRELOAD fails to inject the task communication library into the binary.

There are several ways you might deal with this:

1) If you haven't, look at the stdout and stderr of the task binary itself. Those should be stored as log files (src/engine/local_executor.cc redirects their output). If the task communication library is correctly injected, it should produce some output there. Alternatively, you might see dynamic linking errors, which would point towards a missing dependency for the task communication library. If you can locate this output, please post it here.

2) You could use Poseidon, which integrates Firmament with the Kubernetes cluster manager. The cluster management components of standalone Firmament are a research prototype, so they're not as stable or production-ready as Kubernetes; if you want to just play with Firmament scheduling policies, the Kubernetes integration might make that easier for you.

I hope this helps!

CHENLIELIE commented 6 years ago

@lilelr yes, your guess is right. I have found the reason why I can't build successly, it's my company's network issues which can't access foreign website and forbid shadowsocks. In my individual computer, if using shadowsocks, it can download successly, if not, it will fail. But I meet to another problem in build grpc, it give the error of "can't found -lpthreads". What library file does it lack? I write a c file which build command is "gcc file.c -lpthread" to test thread library, it work well. I can't understand the relationship of -lpthreads and -lpthread. If you don't mind, we can communicate with mail in chinese, my qq mail is 1247735366@qq.com, thanks.

lilelr commented 6 years ago

@ms705, Thank you very much for your answer. But I still have trouble running the 'hello world' task. By adding --task_lib_dir=/home/lilelr/firmament (absolute path) to the command starting up a coordinate, the job /bin/sleep 60 successfully completed. However, the tasks of the 'hello world' or 'timespin' job still failed. I still used the cmmand python job_submit.py localhost 8080 /home/lilelr/opensource/firmament/firmament/build/src/examples/hello_world/hello_world to submit the job. The log file coordinator.INFO still showed that: snip20170829_3

The error written in the task stderr file was as the following:

snip20170829_4

And the task stdout file was empty. The same problem above occurred when I tried to run the 'timespin' example job.

I could only find the libtask_lib_inject.so in the /firmament/build/src directory. I can not find the task_lib.a in the firmament directory. The error task_lib.cc is being linked both statically confused me, hope you could help me solve it.

ms705 commented 6 years ago

@lilelr,

The problem is likely the fact that the Hello World example binary links the task communication library statically, but Firmament then also tries to inject it via LD_PRELOAD for dynamic linking. One thing you could try is to change line 38 of the job_submit.py script to:

job_desc.root_task.inject_task_lib = False

... and see if that resolves the problem.

More broadly, however, we should probably remove the example binaries from the repo or mark them more clearly as deprecated. In practice, almost all use cases have been wanting to run unmodified legacy binaries, so the dynamic injection of the task communication library is a better way to operate. (When using Kubernetes, this issue does not arise, because we monitor the health of the container instead of injecting a task communication library.)

lilelr commented 6 years ago

@ms705, After I changed job_desc.root_task.inject_task_lib = False, I ran the hello world example again and soon the log info showed that the task still has failed. snip20170830_1 But the task stdout successfully printed 'Hello world (stdout)!'. And the task stderr shows a Segmentation fault as the following. snip20170830_2

From your codes hello_world.cc, I could not see any problem. snip20170830_3

So what do you think of the cause? By the way, I also want to know how to write my own codes of my jobs? For example, if I want to write some codes of algorithm quicksort, are there programming interfaces that I need to comply with? Or how to write a MapReduce job in Firmament? I'm sorry that I have several questions to ask. I am a graduate student researching in resources management of cluster. I need to figure out the fantastic design principles of Firmament from your codes. thus, I might not use Kubernetes soon.

ms705 commented 6 years ago

I'm not sure what causes the segfault; note however that it only occurs once the task is shutting down.

I suspect the reason that a task failure is reported here is actually that the "hello world" task exits so quickly that its monitor thread doesn't even get around to sending a status update to the coordinator. You can probably work around this by adding a sleep() invocation to below the code in HelloWorldTask::Invoke().

That said, I think you should consider writing your own jobs as standalone binaries and use the task lib injection -- it's better tested and works with any binary, so it is our preferred way of running Firmament jobs. Within your own binaries, you can do whatever you want -- including implementing MapReduce. However, since Firmament is a cluster-level scheduler, it won't do any of the framework-level MapReduce orchestration (e.g., the shuffle between mappers and reducers) for you, and you'll have to implement your own controller, mapper and reducer binaries.

lilelr commented 6 years ago

@ms705 , Thanks for your good advice. Now the "hello world“ example successfully completed on a single machine(Ubuntu 14.04). I changed the codes as you suggested. First, I modified the job_sumbit.py by changing job_desc.root_task.inject_task_lib = False. And the picture below shows that I let the task thread sleep 70s (line 27) before printing "hello world". snip20170910_2 But the perf file of this job dose not contain data. snip20170910_14 snip20170910_15

The the other question is that when I just wrote the primitive "hello world" codes as the picture below showed without using your "task_main" function. snip20170910_9

However, this time the perf file of this job did contain data. snip20170910_16

The task failed. And the coordinator.INFO shows the similar error. However, the task stdout could successfully printed what I wanted it to print. the task stdout: snip20170910_12

the coordinator.INFO : snip20170910_10

May I need to use the ""task_main" function each time?

Another problem occurred when I running Firmament on two machines. Both two machines are Ubuntu 14.04. Assumed that the parent coordinator is named "246", the child coordinator is "247". I started up Firmament on "247” by the following command.

snip20170910_3 The flow graph from the webUI showed that coordinator 247 successfully connected to coordinator 246 and the log info from 247 confirmed this. snip20170910_8

When I submitted the "hello world" job on the parent coordinator 246. The parent coordinator seemed to schedule its task to run on the child coordinator 247. However, the task_node became NULL, and the connection between 246 and 247 was closed. And, the coordinator 246 log showed that: snip20170910_7 The coordinator 247 log showed that: snip20170910_5 My guess is that coordinator 246 could not transfer task info to coordinator 247. What do you think? How to fix it?

ms705 commented 6 years ago

@lilelr Sorry for the delayed response.

I suspect the coordinator failure you saw is because it sounds like both your parent and your child coordinator are using the flow scheduler. While this ought to work in principle, it's not a setting we've tested. The easiest way to work around this is to use the simple scheduler on the child coordinator. This will have no impact on the goodness of scheduling decisions, since the parent coordinator already has global knowledge when using the flow scheduler.

ms705 commented 6 years ago

@lilelr Did this get fixed / did my suggestions help? If so, can we close the issue?