USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

silly question #187

Closed vwoloszyn closed 3 years ago

vwoloszyn commented 4 years ago

Hi Guys,

It's everything working fine.. However, I cannot find the HTML content stored on SOLR... what would be the best way to access the HTML content of the crawled webpages?

All the best, Vinicius

ravituduru commented 3 years ago

you can find fields in solr as raw_text and extracted text where statua:Fetched.I think this is what you are looking.

vwoloszyn commented 3 years ago

Hi @ravituduru thank you very much. However, it only provides "extracted_text", which is raw text, instead of HTML... Is there a way to enable the extraction of HTML as well? Thank you in advance...