USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Parser Extension Points and Interfaces #164

Closed micheladennis closed 3 years ago

micheladennis commented 6 years ago

What changes were proposed in this pull request?

Create Extension Point for the different aspects of Parsing, in addition, a Default Parser. The Extension Points are: Metadata Header Outlinks Text

Is this related to an already existing issue on sparkler?

20

https://github.com/USCDataScience/sparkler/issues/20

How was this patch tested?

We are particularly interested in unit tests, integration tests, manual tests you did to ensure that the patch works as expected, so briefly describe them.

Tested with a sample Plugin and Default Parser, both are apart of this Branch

Please review https://github.com/USCDataScience/sparkler/blob/master/.github/CONTRIBUTING.md before opening a pull request.

chrismattmann commented 3 years ago

been a few years and hasn't been integrated ... closing out. If there is interest in bringing up to date, I'll be happy to review