USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Read large files using Tika #149

Closed voltek62 closed 6 years ago

voltek62 commented 6 years ago

Tika is stopping the extraction at 100KB limit. We might loose some content in larger web pages and documents.

Can you increase the limit ?


WARN ParseFunction$:75 [Executor task launch worker-0] - Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit.

thammegowda commented 6 years ago

@voltek62 Thanks for describing the issue. I recently Increased to 100MB. Closing this