commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

PageRank & other jobs: check if output directory already exists #62

Open sylvinus opened 7 years ago

sylvinus commented 7 years ago

This would avoid errors late in the job like this:

Traceback (most recent call last):
  File "/cosr/back/spark/jobs/pagerank.py", line 459, in <module>
    job.run()
  File "/cosr/back/cosrlib/spark.py", line 207, in run
    self.run_job(sc, sqlc)
  File "/cosr/back/spark/jobs/pagerank.py", line 75, in run_job
    self.custom_pagerank(sc, sqlc)
  File "/cosr/back/spark/jobs/pagerank.py", line 289, in custom_pagerank
    compression="gzip" if self.args.gzip else "none"
  File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 632, in text
  File "/usr/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'path file:/cosr/back/out/pagerank already exists.;'

reported by @HenriqueLimas