Spark harness is not populating counters when counter-output-dir is not an S3 path

Hadoop Counters, an integral feature of Hadoop map-reduce provides a way to measure the progress or the number of operations that occur within map/reduce job. Spark harness runs the regular hadoop streaming job on spark using Spark Runner. Spark harness emulates the counters feature to run the same hadoop-streaming job onto spark without requiring any modifications to the job. Harness script stores the calculated counters value to the given counter_output_dir (--spark-tmp-dir) using saveAsTextFile spark api. The populated counters values is in turn read by spark runner to be provided to application user. The logic works well if the the path for --spark-tmp-dir is an S3 path. With the regular local file-path (by default), saveAsTextfile creates the counters file (part-*) on spark executors but not on the driver local file-path. Unless, the executors are running on the same host as drivers.

Yelp / mrjob

Spark harness is not populating counters when counter-output-dir is not an S3 path #2176