USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Integrate Tika Parser to Parse Function for extracting Text and metadata #10

Closed thammegowda closed 7 years ago

thammegowda commented 8 years ago

The current parse function is limited to outlinks

thammegowda commented 8 years ago

TODO: Parse should also produce few metadata fields and plain text content along with links.

MuhammadTalhaAfzal commented 8 years ago

Did anybody started to implement plugin for plain text content and title indexing in solr?

karanjeets commented 8 years ago

@MuhammadTalhaAfzal,

I am yet to start this and will probably do over the weekend. Please let me know if you are looking to collaborate.