douban / dpark

Python clone of Spark, a MapReduce alike framework in Python
BSD 3-Clause "New" or "Revised" License
2.69k stars 534 forks source link

Memory consumption is over the limit #34

Closed andr0s closed 11 years ago

andr0s commented 11 years ago

I have a txt file to process. I use -M options to set the memory limit (I've tried -M 400 param, should be 400 megabytes as far as I understood). But on the test run, the memory consumption jumps up to 3.2GB which is really high. The script works as intended and the results are valid, the only issue here is the memory consumption. I'm afraid on larger files it will start swapping and simply choke.

I use no extra parameters, only -M, so I guess it should work using the default ones.

The memory consumption jumps not smoothly so I don't think it's a memory garbage issue.

Looks like Dpark creates a heap or something like that but how to set a limit for that? As I wrote, I'm afraid that it will eat all the memory. Now my test file is around 1GB, but what if I want to process 20GB?

In general I really like Dpark and don't want to switch to any other map-reduce framework (I've tried most of them, I guess). Please tell me how to predict and control the memory consumption. Thank you very much!

andr0s commented 11 years ago

oh - forgot to mention, might be helpful - I use .collect() method to collect all the results, not the collectAsMap().

davies commented 11 years ago

The -M means that the maximum memory used by worker process in mesos slave. If the memory used is larger than offered, the worker process will been killed and retried in another offer with larger memory (2x more).

DPark has no memory limit for the client process (the process you run). If you collect large datasets, Python will consume huge memory for large sets of small objects. It's better the following computing with map/reduce in the nodes, or save the results by saveAsTextFile(), then processing them in streaming mode.

andr0s commented 11 years ago

wow davies thanks for the quick reply! =) Unfortunately I don't have a deep knowledge of DPark, of its 'internals'. I do collect large results, that's true - the output files are supposed to be huge. So what should I do in this case? By ' computing with map/reduce in the nodes' you meant using mesos, right? But how do I find how many nodes I need? I haven't used mesos before

davies commented 11 years ago

No short answer for this. You'd better to learn more details about DPark.

To find out the computing process, you could use the multiprocess mode by -m process. It will show you how many tasks and how much memory used by each tasks.

windreamer commented 11 years ago

@andr0s If you really want collect large results, you should use saveAsTextFile() to save them in moosefs, as @davies said.

andr0s commented 11 years ago

Thank you guys I'll try =)