Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.62k stars 586 forks source link

attach default hadoop formats/jobconf to protocols? #808

Open coyotemarin opened 10 years ago

coyotemarin commented 10 years ago

Protocols should be allowed to have HADOOP_*_FORMAT and JOBCONF fields, as well as hadoop_*_format() and jobconf() methods, which supply defaults if something is not already specified for that step. That way, they can be made to do specific useful things on their own (e.g. read binary data from a sequence file).

For jobconfs, we just combine them, with step-specific jobconf taking priority.

With input/output formats, it gets trickier. Say we're looking at the first step in the job. In decreasing level of precedence, it seems like it would make the most sense to pick input format based on:

(input_protocol is either step.input_protocol, job.input_protocol(), job.INPUT_PROTOCOL or the default, RawValueProtocol)

What sucks about this is we can't just pick an input format for the step and then combine it with information about the job, because whether it takes precedence over job.HADOOP_INPUT_FORMAT depends on whether it was set explicitly, or derived from the step's input protocol.

Maybe it should look more like this:

This isn't actually as complicated as it looks; we just give first priority to the step definition, and use information from the job to fill in anything that's missing.

coyotemarin commented 6 years ago

This is probably not worth the complication; better to use manifests to read arbitrary binary files (see #754).