USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Extractor of fields using xpath or css selectors and map them to Solr fields? #104

Open mzeidhassan opened 7 years ago

mzeidhassan commented 7 years ago

First, thanks for the project. Sounds great. I am wondering if there is any chance to extract particular text items and images from web pages and map these extracted fields to Solr filed right away. This way, you can build a front-end application that fetch results from Solr and display these fields to end users?

Is this possible? If not, can you add it to your enhancement list?

Thanks, Mohamed

karanjeets commented 7 years ago

@mzeidhassan We are glad that you liked the project and thank you for your suggestion. I envision it as a CSS/XPath extractor added to Sparkler plugins. We can definitely add it as an enhancement.

Please let us know if you are interested to work on this and we can discuss more on the implementation.

mzeidhassan commented 7 years ago

Hi @karanjeets,

Thanks for your reply and adding it to your enhancement list. I can help with testing.

Thanks, Mohamed

karanjeets commented 7 years ago

Cool, @mzeidhassan :+1: