Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.61k stars 587 forks source link

Number of mappers on a single-node cluster #905

Closed andr0s closed 9 years ago

andr0s commented 10 years ago

Using Cloudera Manager 5 with Yarn 2 and a powerful instance from Digital Ocean with 12 cores. This is a simple 1-node cluster. But it should work relatively fast because it has a lot of RAM and CPUs and SSD.

Executing job like that: python script.py -r hadoop --jobconf mapred.map.tasks=10 input_file.gz > output_file.

After the source file (a big gzip archive created by cat "somefile" | gzip -c > input_file.gz) has been pushed into DFS, I see that the server load is still really low, and cores are doing nothing. I see only 1 python process there, in htop, running my script with --step-num 0 --mapper arguments.

Isn't that extremely inefficient? How to, finally, run many mappers on 1 node?

andr0s commented 10 years ago

Update. I have tried with a plain-text source file instead of GZIP one. Same thing - only one mapper running =(

tarnfeld commented 10 years ago

This is due to the gzip compression format not being splittable. The format does not allow you to seek to an arbitrary byte offset and start reading/decompressing - unfortunately - and this is exactly what Hadoop needs.

If you want to use a compression format that does support splitting, check out Snappy or LZO. If you use LZO you'll need to use the custom input format provided by that library, as well as index the files so Hadoop knows where it's able to split.

coyotemarin commented 9 years ago

Huh, how big is your file? Hadoop may have decided it's not worth the trouble to split it, and it'd rather leave those mappers/reducers ready for the next job.

I believe there's s jobconf that allows you to reduce the split size.