attach default hadoop formats/jobconf to protocols?

Protocols should be allowed to have HADOOP_*_FORMAT and JOBCONF fields, as well as hadoop_*_format() and jobconf() methods, which supply defaults if something is not already specified for that step. That way, they can be made to do specific useful things on their own (e.g. read binary data from a sequence file).

For jobconfs, we just combine them, with step-specific jobconf taking priority.

With input/output formats, it gets trickier. Say we're looking at the first step in the job. In decreasing level of precedence, it seems like it would make the most sense to pick input format based on:

step.hadoop_input_format
job.hadoop_input_format()
job.HADOOP_INPUT_FORMAT
input_protocol.hadoop_input_format()
input_protocol.HADOOP_INPUT_FORMAT
the default (None)

(input_protocol is either step.input_protocol, job.input_protocol(), job.INPUT_PROTOCOL or the default, RawValueProtocol)

What sucks about this is we can't just pick an input format for the step and then combine it with information about the job, because whether it takes precedence over job.HADOOP_INPUT_FORMAT depends on whether it was set explicitly, or derived from the step's input protocol.

Maybe it should look more like this:

step.hadoop_input_format
step.input_protocol.hadoop_input_format()
step.input_protocol.HADOOP_INPUT_FORMAT
job.hadoop_input_format()
job.HADOOP_INPUT_FORMAT
job.input_protocol().hadoop_input_format()
job.input_protocol().HADOOP_INPUT_FORMAT
job.INPUT_PROTOCOL.hadoop_input_format()
job INPUT_PROTOCOL.HADOOP_INPUT_FORMAT
the default (None)

This isn't actually as complicated as it looks; we just give first priority to the step definition, and use information from the job to fill in anything that's missing.

Yelp / mrjob

attach default hadoop formats/jobconf to protocols? #808