cleanzr / dblink

Distributed Bayesian Entity Resolution in Apache Spark
Other
57 stars 9 forks source link

Path for Spark checkpoints #2

Open YathishK opened 5 years ago

YathishK commented 5 years ago

When running in yarn mode , it has below warning message.

WARN SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory '/tmp/spark_checkpoint/' appears to be on the local filesystem.

ngmarchant commented 5 years ago

It looks like you're using one of the example config files to submit a job using spark-submit. The examples assume you're running Spark locally, so the key checkpointPath is set to /tmp/spark_checkpoint/. If you're running Spark in cluster mode, you should instead set checkpointPath to a location on HDFS. For example hdfs:///my-project-name/checkpoints/.

You should also ensure that the output (MCMC samples, saved state etc) is saved to HDFS when running in cluster mode. To do this, you'll need to change the outputPath setting to a HDFS URI.

Incidentally, we should probably make checkpointPath an optional setting so that it falls back to the default if not specified.