douban / dpark

Python clone of Spark, a MapReduce alike framework in Python
BSD 3-Clause "New" or "Revised" License
2.69k stars 534 forks source link

DPark failed to submitTasks when running on mesos #27

Closed GoSteven closed 11 years ago

GoSteven commented 11 years ago

I set up mesos cluster on Amazon ec2 using mesos EC2-Scripts:

Then I run python27 demo.py -m mesos://master@ec2-54-224-207-120.compute-1.amazonaws.com:5050 -p 2

The program was halting there, it stuck at submitTasks() in schedule.py. Press Ctrl-C: ec2-user@ip-10-31-194-149 examples]$ python27 demo.py -m mesos://master@ec2-54-224-207-120.compute-1.amazonaws.com:5050 -p 2 2013-05-20 12:30:43,786 [INFO] [scheduler] Got a job with 4 tasks ^CTraceback (most recent call last): File "demo.py", line 10, in <module> print nums.count() File "/home/ec2-user/dpark/dpark/rdd.py", line 271, in count return sum(self.ctx.runJob(self, lambda x: ilen(x))) File "/home/ec2-user/dpark/dpark/context.py", line 204, in runJob for it in self.scheduler.runJob(rdd, func, partitions, allowLocal): File "/home/ec2-user/dpark/dpark/schedule.py", line 269, in runJob submitStage(finalStage) File "/home/ec2-user/dpark/dpark/schedule.py", line 231, in submitStage submitMissingTasks(stage) File "/home/ec2-user/dpark/dpark/schedule.py", line 267, in submitMissingTasks self.submitTasks(tasks) File "/home/ec2-user/dpark/dpark/schedule.py", line 436, in _ r = f(self, *a, **kw) File "/usr/lib64/python2.7/threading.py", line 154, in __exit__ self.release() File "/usr/lib64/python2.7/threading.py", line 142, in release raise RuntimeError("cannot release un-acquired lock") RuntimeError: cannot release un-acquired lock

In Mesos logs: Log file created at: 2013/05/20 11:40:49 Running on machine: ip-10-31-194-149.ec2.internal Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0520 11:40:49.170337 2099 logging.cpp:70] Logging to /mnt/mesos-logs I0520 11:40:49.172806 2099 main.cpp:95] Build: 2011-12-03 06:24:10 by root I0520 11:40:49.172871 2099 main.cpp:96] Starting Mesos master I0520 11:40:49.176777 2099 webui.cpp:81] Starting master web server on port 8080 I0520 11:40:49.176911 2101 master.cpp:264] Master started at mesos://master@10.31.194.149:5050 I0520 11:40:49.177106 2104 webui.cpp:47] Master web server thread started I0520 11:40:49.177109 2101 master.cpp:279] Master ID: 201305201140-0 I0520 11:40:49.177775 2101 master.cpp:462] Elected as master! I0520 11:40:49.191300 2104 webui.cpp:59] Loading webui/master/webui.py I0520 11:40:54.348682 2101 master.cpp:814] Attempting to register slave 201305201140-0-0 at slave@10.28.0.219:33513 I0520 11:40:54.349149 2101 master.cpp:1057] Master now considering a slave at ip-10-28-0-219.ec2.internal:33513 as active I0520 11:40:54.349210 2101 master.cpp:1588] Adding slave 201305201140-0-0 at ip-10-28-0-219.ec2.internal with cpus=2; mem=677 I0520 11:40:54.349393 2101 simple_allocator.cpp:71] Added slave 201305201140-0-0 with cpus=2; mem=677 W0520 12:00:52.365500 2101 protobuf.hpp:260] Initialization errors: framework.executor

Question:

Does DPark require a specific mesos version? Is there any relavant documentation for setting up DPark+Mesos?

davies commented 11 years ago

For simpler testing, you could run mesos in local mode on the same host

# mesos-local 

Then you could test DPark with it:

python demo.py -m localhost:5050 -M 100

By default, DPark need at least 1G memory to run each task, you could use -M to specify another value, such 100M

  1. In order to run DPark over Mesos in cluster mode, you MUST install DPark on all mesos slaves.
  2. In the mean time, the user who run DPark MUST exists in all mesos slaves.
  3. All slaves MUST been accessed by it's hostname.
GoSteven commented 11 years ago

Hi @davies , your tips are really helpful. I think my problem could due to the wrong version of the underlying dependency.

Have tried using mesos-local: [root@ip-10-28-135-104 bin]# ./mesos-local I0520 23:54:55.486122 2988 logging.cpp:70] Logging to /root/mesos/logs I0520 23:54:55.503355 2989 master.cpp:264] Master started at mesos://master@10.28.135.104:5050 I0520 23:54:55.503468 2989 master.cpp:279] Master ID: 201305202354-0 I0520 23:54:55.503769 2989 master.cpp:462] Elected as master! I0520 23:54:55.503969 2989 slave.cpp:257] Slave started at slave@10.28.135.104:5050 I0520 23:54:55.504026 2989 slave.cpp:258] Slave resources: cpus=1; mem=1024 I0520 23:54:55.504513 2989 slave.cpp:320] New master detected at master@10.28.135.104:5050 I0520 23:54:55.504706 2989 master.cpp:814] Attempting to register slave 201305202354-0-0 at slave@10.28.135.104:5050 I0520 23:54:55.505198 2989 master.cpp:1057] Master now considering a slave at ip-10-28-135-104.ec2.internal:5050 as active I0520 23:54:55.505271 2989 master.cpp:1588] Adding slave 201305202354-0-0 at ip-10-28-135-104.ec2.internal with cpus=1; mem=1024 I0520 23:54:55.505342 2989 simple_allocator.cpp:71] Added slave 201305202354-0-0 with cpus=1; mem=1024 I0520 23:54:55.505429 2989 slave.cpp:340] Registered with master; given slave ID 201305202354-0-0 libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "mesos.internal.RegisterFrameworkMessage" because it is missing required fields: framework.executor W0520 23:56:08.176432 2989 protobuf.hpp:260] Initialization errors: framework.executor libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "mesos.internal.RegisterFrameworkMessage" because it is missing required fields: framework.executor W0520 23:58:35.337236 2989 protobuf.hpp:260] Initialization errors: framework.executor

I used pip install protobuf, protobuf version is 2.5.0

davies commented 11 years ago

The error log of mesos seemed that mesos is not compatible with DPark, you should try Mesos 0.9. We use 0.9 in production for more than a year, with several patches. Just wait a moment, we will push them out.

davies commented 11 years ago

Mesos 0.9 with out patches: https://github.com/windreamer/mesos/tree/master

GoSteven commented 11 years ago

Thanks a lot for your help @davies , DPark works fine with Mesos 0.9 when I run mesos-local. I will try to run it on a ec2 cluster shortly.