Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.62k stars 586 forks source link

Spark harness is not populating counters when counter-output-dir is not an S3 path #2176

Closed 88manpreet closed 4 years ago

88manpreet commented 4 years ago

Hadoop Counters, an integral feature of Hadoop map-reduce provides a way to measure the progress or the number of operations that occur within map/reduce job. Spark harness runs the regular hadoop streaming job on spark using Spark Runner. Spark harness emulates the counters feature to run the same hadoop-streaming job onto spark without requiring any modifications to the job. Harness script stores the calculated counters value to the given counter_output_dir (--spark-tmp-dir) using saveAsTextFile spark api. The populated counters values is in turn read by spark runner to be provided to application user. The logic works well if the the path for --spark-tmp-dir is an S3 path. With the regular local file-path (by default), saveAsTextfile creates the counters file (part-*) on spark executors but not on the driver local file-path. Unless, the executors are running on the same host as drivers.

coyotemarin commented 4 years ago

Fixed by #2177.