Protocols should be allowed to have HADOOP_*_FORMAT and JOBCONF fields, as well as hadoop_*_format() and jobconf() methods, which supply defaults if something is not already specified for that step. That way, they can be made to do specific useful things on their own (e.g. read binary data from a sequence file).
For jobconfs, we just combine them, with step-specific jobconf taking priority.
With input/output formats, it gets trickier. Say we're looking at the first step in the job. In decreasing level of precedence, it seems like it would make the most sense to pick input format based on:
step.hadoop_input_format
job.hadoop_input_format()
job.HADOOP_INPUT_FORMAT
input_protocol.hadoop_input_format()
input_protocol.HADOOP_INPUT_FORMAT
the default (None)
(input_protocol is either step.input_protocol, job.input_protocol(), job.INPUT_PROTOCOL or the default, RawValueProtocol)
What sucks about this is we can't just pick an input format for the step and then combine it with information about the job, because whether it takes precedence over job.HADOOP_INPUT_FORMAT depends on whether it was set explicitly, or derived from the step's input protocol.
Maybe it should look more like this:
step.hadoop_input_format
step.input_protocol.hadoop_input_format()
step.input_protocol.HADOOP_INPUT_FORMAT
job.hadoop_input_format()
job.HADOOP_INPUT_FORMAT
job.input_protocol().hadoop_input_format()
job.input_protocol().HADOOP_INPUT_FORMAT
job.INPUT_PROTOCOL.hadoop_input_format()
job INPUT_PROTOCOL.HADOOP_INPUT_FORMAT
the default (None)
This isn't actually as complicated as it looks; we just give first priority to the step definition, and use information from the job to fill in anything that's missing.
Protocols should be allowed to have
HADOOP_*_FORMAT
andJOBCONF
fields, as well ashadoop_*_format()
andjobconf()
methods, which supply defaults if something is not already specified for that step. That way, they can be made to do specific useful things on their own (e.g. read binary data from a sequence file).For jobconfs, we just combine them, with step-specific jobconf taking priority.
With input/output formats, it gets trickier. Say we're looking at the first step in the job. In decreasing level of precedence, it seems like it would make the most sense to pick input format based on:
step.hadoop_input_format
job.hadoop_input_format()
job.HADOOP_INPUT_FORMAT
input_protocol.hadoop_input_format()
input_protocol.HADOOP_INPUT_FORMAT
None
)(
input_protocol
is eitherstep.input_protocol
,job.input_protocol()
,job.INPUT_PROTOCOL
or the default,RawValueProtocol
)What sucks about this is we can't just pick an input format for the step and then combine it with information about the job, because whether it takes precedence over
job.HADOOP_INPUT_FORMAT
depends on whether it was set explicitly, or derived from the step's input protocol.Maybe it should look more like this:
step.hadoop_input_format
step.input_protocol.hadoop_input_format()
step.input_protocol.HADOOP_INPUT_FORMAT
job.hadoop_input_format()
job.HADOOP_INPUT_FORMAT
job.input_protocol().hadoop_input_format()
job.input_protocol().HADOOP_INPUT_FORMAT
job.INPUT_PROTOCOL.hadoop_input_format()
job INPUT_PROTOCOL.HADOOP_INPUT_FORMAT
None
)This isn't actually as complicated as it looks; we just give first priority to the step definition, and use information from the job to fill in anything that's missing.