Found a bug in the Scoobi job planner, discussed with @etorreborre on IRC last night. If Scoobi is joining 2 DLists of the same type, but one of the DLists was computed from an input that was .join()ed to a zipped DObject, the planner generates a broken job plan.
Expected:
1 MR job:
1 Map stage that processes both inputs, 1 shuffle, 1 reduce
Actual:
2 MR jobs:
1 Map over input 1, shuffle, reduce. The job complains about an input channel with no files, this is where the output of the 2nd job is supposed to be, but it hasn't run yet.
1 Map-only job over input 2 that is producing a secondary input for the first job, but runs AFTER that job.
Found a bug in the Scoobi job planner, discussed with @etorreborre on IRC last night. If Scoobi is joining 2 DLists of the same type, but one of the DLists was computed from an input that was .join()ed to a zipped DObject, the planner generates a broken job plan.
Expected: 1 MR job: 1 Map stage that processes both inputs, 1 shuffle, 1 reduce
Actual: 2 MR jobs:
Minimal code that reproduces the issue is here: https://gist.github.com/ivmaykov/5c9b9fc7febc117e3ed8
Verbose output of local (but not inmemory, using hadoop local-mode) run here: https://gist.github.com/ivmaykov/cbdd2524f606feb0b60a