USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
412 stars 143 forks source link

Failed to create thread #227

Open keiranFTW opened 3 years ago

keiranFTW commented 3 years ago

Issue Description

Please describe our issue, along with:

The crawler crashes unexpectedly after a while, claiming that resource limits have been reached.

How to reproduce it

If you are describing a bug, please describe here how to reproduce it.

Seed crawler with 10,000 unique URLS, crawl using default fetcher and you will be greeted with following:

2021-04-15 13:45:06 INFO FairFetcher$:71 - Adding doc to SOLR [15128.721s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached. 2021-04-15 13:45:06 WARN BlockManager:69 - Block rdd_25_0 could not be removed as it was not found on disk or in memory 2021-04-15 13:45:06 ERROR Executor:94 - Exception in task 0.0 in stage 15.0 (TID 11) java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached at java.lang.Thread.start0(Native Method) ~[?:?] at java.lang.Thread.start(Thread.java:799) ~[?:?] at shaded.org.apache.http.impl.client.IdleConnectionEvictor.start(IdleConnectionEvictor.java:96) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at shaded.org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:1227) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:319) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:330) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:268) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:255) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.solr.client.solrj.impl.HttpSolrClient.(HttpSolrClient.java:204) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.solr.client.solrj.impl.HttpSolrClient$Builder.build(HttpSolrClient.java:952) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at edu.usc.irds.sparkler.storage.solr.SolrProxy.newClient(SolrProxy.scala:45) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at edu.usc.irds.sparkler.storage.solr.SolrProxy.(SolrProxy.scala:78) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at edu.usc.irds.sparkler.storage.StorageProxyFactory.getProxy(StorageProxyFactory.scala:33) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at edu.usc.irds.sparkler.model.SparklerJob.newStorageProxy(SparklerJob.scala:54) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:72) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:29) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.scheduler.Task.run(Task.scala:127) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) [sparkler-app-0.2.2-SNAPSHOT.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:830) [?:?]

Environment and Version Information

Please indicate relevant versions, including, if relevant:

An external links for reference

If you think any other resources on internet will be helpful to understand and/or resolve this issue, please share them here.

Contributing

If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!

I have upped the limits of max number of processes to unlimited and after checking the system while the crawl was in process there were 27302 processes with 26540 of them being sparkler, this looks like there is a leak somewhere.