commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

spark-sumbit stopped suddenly #21

Closed aliebrahiiimi closed 2 years ago

aliebrahiiimi commented 2 years ago

The following command is used for downloading files from common crawls, but after a few hours the process stopped and I have received the following error. It may depend on the configuration and parameters of spark-submit, could you please assist me?

script:

spark-submit --driver-memory 24g --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR --csv ../csvs/CC-MAIN-2018-51   --numRecordsPerWarcFile 10000  --warcPrefix persian-CC  s3://commoncrawl/cc-index/table/cc-main/warc/ ../data/CC-MAIN-2018-51/ > ../log.out 2>&1 &

logs:

22/05/07 11:00:33 INFO CCIndexExport: Fetching WARC record crawl-data/CC-MAIN-2018-51/segments/1544376827137.61/warc/CC-MAIN-20181215222234-20181216004234-00575.warc.gz 13132840 8337 for http://amvaj-e-bartar.ir/news/detail/val/6724/%D8%B9%D9%85%D9%84%DA%A9%D8%B1%D8%AF%20%DB%B6.%DB%B1%20%D9%85%DB%8C%D9%84%DB%8C%D8%A7%D8%B1%D8%AF%20%D8%AF%D9%84%D8%A7%D8%B1%DB%8C%20%D8%B5%D9%86%D8%B9%D8%AA%20%D8%A2%D8%A8%20%D9%88%20%D8%A8%D8%B1%D9%82%20%D8%AF%D8%B1%20%D8%AE%D8%A7%D8%B1%D8%AC%20%D8%A7%D8%B2%20%DA%A9%D8%B4%D9%88%D8%B1.html
22/05/07 11:00:33 INFO CCIndexExport: Fetching WARC record crawl-data/CC-MAIN-2018-51/segments/1544376826842.56/warc/CC-MAIN-20181215083318-20181215105318-00200.warc.gz 411716885 7390 for http://www.jenabmusic.com/tag/%D8%A2%D9%87%D9%86%DA%AF-%D9%85%D8%AD%D9%85%D8%AF-%D9%86%D8%AC%D9%81%DB%8C
22/05/07 11:00:33 INFO SparkContext: Invoking stop() from shutdown hook
22/05/07 11:00:33 INFO SparkUI: Stopped Spark web UI at http://server.domain.com:4040
22/05/07 11:00:33 INFO DAGScheduler: ResultStage 7 (runJob at SparkHadoopWriter.scala:83) failed in 15440.843 s due to Stage cancelled because SparkContext was shut down
22/05/07 11:00:33 INFO DAGScheduler: Job 5 failed: runJob at SparkHadoopWriter.scala:83, took 15440.906693 s
22/05/07 11:00:33 ERROR SparkHadoopWriter: Aborting job job_202205070643121922168778011744804_0031.
org.apache.spark.SparkException: Job 5 cancelled because SparkContext was shut down
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1166)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1164)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
    at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1164)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2666)
    at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
    at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2566)
    at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:2086)
    at org.apache.spark.SparkContext.$anonfun$new$38(SparkContext.scala:667)
    at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
    at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
    at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at scala.util.Try$.apply(Try.scala:213)
    at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
    at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:83)
    at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopDataset$1(PairRDDFunctions.scala:1077)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1075)
    at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$2(PairRDDFunctions.scala:994)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
    at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopFile(JavaPairRDD.scala:825)
    at org.commoncrawl.spark.examples.CCIndexWarcExport.run(CCIndexWarcExport.java:199)
    at org.commoncrawl.spark.examples.CCIndexExport.run(CCIndexExport.java:195)
    at org.commoncrawl.spark.examples.CCIndexWarcExport.main(CCIndexWarcExport.java:208)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:106)
    at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopDataset$1(PairRDDFunctions.scala:1077)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1075)
    at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$2(PairRDDFunctions.scala:994)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
    at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopFile(JavaPairRDD.scala:825)
    at org.commoncrawl.spark.examples.CCIndexWarcExport.run(CCIndexWarcExport.java:199)
    at org.commoncrawl.spark.examples.CCIndexExport.run(CCIndexExport.java:195)
    at org.commoncrawl.spark.examples.CCIndexWarcExport.main(CCIndexWarcExport.java:208)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Job 5 cancelled because SparkContext was shut down
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1166)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1164)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
    at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1164)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2666)
    at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
    at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2566)
    at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:2086)
    at org.apache.spark.SparkContext.$anonfun$new$38(SparkContext.scala:667)
    at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
    at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
    at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at scala.util.Try$.apply(Try.scala:213)
    at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
    at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:83)
    ... 28 more
22/05/07 11:00:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/05/07 11:00:33 INFO CCIndexExport: Fetching WARC record crawl-data/CC-MAIN-2018-51/segments/1544376829140.81/warc/CC-MAIN-20181218102019-20181218124019-00288.warc.gz 466742218 12898 for http://www.tarabarnews.com/view/87815/%D9%86%D9%85%D8%A7%DB%8C%D9%86%D8%AF%DA%AF%D8%A7%D9%86-%D9%85%D8%AC%D9%84%D8%B3-%D9%81%D8%B1%D8%AF%DB%8C-%D9%82%D9%88%DB%8C%E2%80%8C%D8%AA%D8%B1-%D8%A7%D8%B2-%D8%A2%D8%AE%D9%88%D9%86%D8%AF%DB%8C-%D8%A8%D8%B1%D8%A7%DB%8C-%D9%88%D8%B2%D8%A7%D8%B1%D8%AA-%D8%B1%D8%A7%D9%87-%D9%88-%D8%B4%D9%87%D8%B1%D8%B3%D8%A7%D8%B2%DB%8C-%D8%B3%D8%B1%D8%A7%D8%BA-%D9%86%D8%AF%D8%A7%D8%B1%D9%86%D8%AF-%D8%AA%D8%A7%DA%A9%DB%8C%D8%AF-%D8%A8%D8%B1-%D8%AD%D9%85%D8%A7%DB%8C%D8%AA-%D9%82%D8%A7%D8%B7%D8%B9-%D8%A7%D8%B2-%D8%A2%D8%AE%D9%88%D9%86%D8%AF%DB%8C-%D8%A8%D8%B1%D8%A7%DB%8C-%D8%A7%D9%86%D8%AC%D8%A7%D9%85-%D8%A7%D9%87%D8%AF%D8%A7%D9%81-%D9%88-%D8%A8%D8%B1%D9%86%D8%A7%D9%85%D9%87%E2%80%8C%D9%87%D8%A7
22/05/07 11:00:33 INFO CCIndexExport: Fetching WARC record crawl-data/CC-MAIN-2018-51/segments/1544376828697.80/warc/CC-MAIN-20181217161704-20181217183704-00220.warc.gz 416114387 15552 for http://www.licamall.com/tags/%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD+%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD+%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD+%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD+%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD+GTF3027GBX
22/05/07 11:00:33 INFO CCIndexExport: Fetching WARC record crawl-data/CC-MAIN-2018-51/segments/1544376825363.58/warc/CC-MAIN-20181214044833-20181214070333-00379.warc.gz 801554484 29772 for https://www.aparat.com/v/jUGBL/%D9%84%DB%8C%D8%A7%D9%85_%D9%84%D8%A7%DB%8C%D9%81_3_-_%D8%A2%D9%85%D9%88%D8%B2%D8%B4_%D9%85%D8%AF%D9%84_%D9%87%D8%A7%DB%8C_%D8%A8%D8%B3%D8%AA%D9%86_%D8%B4%D8%A7%D9%84
22/05/07 11:00:33 INFO CCIndexExport: Fetching WARC record crawl-data/CC-MAIN-2018-51/segments/1544376823657.20/warc/CC-MAIN-20181211151237-20181211172737-00471.warc.gz 626970615 5027 for https://golbarhar.persianblog.ir/7qZpoZMnXjFll98ejwaM-%D8%A8%D9%87-%D8%AE%D8%A7%DA%A9-%D8%B3%D9%BE%D8%B1%D8%AF
22/05/07 11:00:33 INFO CCIndexExport: Fetching WARC record crawl-data/CC-MAIN-2018-51/segments/1544376823872.13/warc/CC-MAIN-20181212112626-20181212134126-00254.warc.gz 630430449 3559 for https://hairextensions.persianblog.ir/tag/%D8%A8%D9%87%D8%AA%D8%B1%DB%8C%D9%86_%D9%85%D8%B1%DA%A9%D8%B2_%D8%A2%D9%85%D9%88%D8%B2%D8%B4_%DA%A9%D8%A7%D8%B4%D8%AA_%D9%86%D8%A7%D8%AE%D9%86
22/05/07 11:00:33 INFO CCIndexExport: Fetching WARC record crawl-data/CC-MAIN-2018-51/segments/1544376823565.27/warc/CC-MAIN-20181211040413-20181211061913-00319.warc.gz 904908127 16662 for https://www.hikvision-cctv.com/%D8%AF%D8%B1%D8%A8%D8%A7%D8%B1%D9%87-%D9%BE%D8%B1%D8%B4%DB%8C%D8%A7%D8%B3%DB%8C%D8%B3%D8%AA%D9%85/
22/05/07 11:00:33 INFO MemoryStore: MemoryStore cleared
22/05/07 11:00:33 INFO BlockManager: BlockManager stopped
22/05/07 11:00:33 INFO BlockManagerMaster: BlockManagerMaster stopped
22/05/07 11:00:33 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/05/07 11:00:33 INFO SparkContext: Successfully stopped SparkContext
22/05/07 11:00:33 INFO ShutdownHookManager: Shutdown hook called
22/05/07 11:00:33 INFO CCIndexExport: Fetching WARC record crawl-data/CC-MAIN-2018-51/segments/1544376823738.9/warc/CC-MAIN-20181212044022-20181212065522-00145.warc.gz 296813532 7780 for http://wikipg.com/tag/%D8%A2%D8%A8-%DA%AF%D8%B1%D9%85-%D8%AE%D8%A7%D9%86%DA%AF%DB%8C/
22/05/07 11:00:33 INFO ShutdownHookManager: Deleting directory /tmp/spark-29d09e64-59a8-4ec9-a825-a1b460ad8a32
22/05/07 11:00:33 INFO ShutdownHookManager: Deleting directory /tmp/spark-b98232d1-9c18-4986-be3e-eebd4284a405
sebastian-nagel commented 2 years ago

Hi @aliebrahiiimi, I cannot find any possible reason in the stack why the SparkContext was shut down.

after a few hours the process stopped

As a general recommendation, I'd split the input into multiple parts, so that every part finishes in a shorter time span (30-60 minutes). Re-running a failed but small job isn't a big issue.

aliebrahiiimi commented 2 years ago

hi @sebastian-nagel , thanks for your response.Your suggestion will be implemented. However, in your idea, Does it have something to do with Spark's memory usage? Does memory-driver or executor-memory need to be configured?

sebastian-nagel commented 2 years ago

If you run Spark in local mode (without a cluster) defining the driver memory should be sufficient because the executors run in the same JVM instance. Running the job on a cluster requires to configure also the executor memory, see spark-submit.

sebastian-nagel commented 2 years ago

Closing this issue. @aliebrahiiimi: If there are any more questions feel free to reopen again or ask for help on the Common Crawl user forum. Thanks!