crs4 / pydoop

A Python MapReduce and HDFS API for Hadoop
Apache License 2.0
237 stars 59 forks source link

Console output of MR jobs fails to properly update progress #323

Open simleo opened 6 years ago

simleo commented 6 years ago

This is a problem with our mapreduce version of the submitter. The original mapred submitter is unaffected.

The minimal setup is a map-only, java reader & writer app:

import pydoop.mapreduce.api as api
import pydoop.mapreduce.pipes as pipes

class Mapper(api.Mapper):
    def map(self, context):
        context.emit(context.key, len(context.value))

def __main__():
    pipes.run_task(pipes.Factory(mapper_class=Mapper))

Run this with only one mapper on a substantial amount of input (e.g., replicate examples/input/alice_1.txt 1000 times). Monitor the job on the console: with our mapreduce submitter, progrss will remain stuck at 0%, then jump to 100% right before the end of the job. With the mapred submitter, progress is gradually updated as expected.

Note that this was NOT fixed by https://github.com/crs4/pydoop/pull/322.