USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Add content limit in the default fetcher #78

Closed karanjeets closed 6 years ago

karanjeets commented 7 years ago

No content limit encourages fetcher to retrieve all the content and keep it in memory. This results in OOM issues.