commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

Removing System.exit() calls as they interfere with spark Execution #8

Closed athulj closed 4 years ago

athulj commented 4 years ago

I built the spark app and tried to run it on AWS EMR. However, I encountered a strange error where all the warc files were generated twice.

Upon investigating, I found that Spark does not play nice with System.exit() and will mark the job as FAILED after retrying it once.

https://github.com/apache/spark/blob/a829234df35c87c169425f2c79fd1963b5420888/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L238

Thus, it is better to remove the System.exit() call at the end.

sebastian-nagel commented 4 years ago

Thanks, @athulj! Well, after a deeper look I think the core of the problem is that sparkSession.stop() is never called.

If we remove the System.exit(success), we would have to make sure that all errors are signalized using exceptions. Spark uses exclusively exceptions but the option parsing code relies on return values. Just from my experience, 50% of the errors are stupid configuration errors, and also those must be caught somehow, eg. when running a series of jobs from a shell script or a workflow framework.

To address the Spark session not stopped, see 3a71762. I'll test this change until tomorrow. I'll also think about handling errors during option parsing differently to make the System.exit() obsolete.

sebastian-nagel commented 4 years ago

Hi @athulj, I've successfully tested both CCIndexExport and CCIndexWarcExport using Spark on Yarn. The potential errors should be fixed now: Spark session is shut down properly and System.exit(...) is only called when there are errors regarding the command-line options. Thanks for your contribution!

athulj commented 4 years ago

@sebastian-nagel Thanks for taking the time to fix this issue! Much appreciated.