Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

BoilerPipe integration #318

Closed angelo337 closed 7 years ago

angelo337 commented 7 years ago

hi there I am wondering if is possible to integrate BoilerPipe (https://github.com/kohlschutter/boilerpipe) into this collector, would you please point me out some direction? thanks Angelo

essiembre commented 7 years ago

You should be able to integrate it. It depends what you want to use it for. Can you please give more details?

I am not familiar with this library but I understand it is mainly used to extract text from HTML documents? Text extraction from HTML is already handled by the HTTP Collector (relying on Tika HTML parser). Maybe the collector already provides what you are looking for? Do you want to use boilerpipe for something else?

If you want to replace the default HTML parsing with one that uses boilerpipe, I would look at implementing your own IDocumentParser that uses boilerpipe. You add it via configuration under your <importer> section, like this:

<importer>
...
  <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
      <parsers>
          <parser contentType="text/html" class="your.own.Class" />
      </parsers>
  </documentParserFactory>
...
</importer>

Let me know if that works for you. I would be interested to know what exact features you are after, in case we can improve the HTTP Collector to support them (if not already).

angelo337 commented 7 years ago

hi there my objective is extract a big chunk of text with no need to open any HTML, you can see this library working in here: http://boilerpipe-web.appspot.com/

as you can see in this example: original page: https://www.nytimes.com/2017/01/24/us/politics/keystone-dakota-pipeline-trump.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news&_r=0

http://boilerpipe-web.appspot.com/extract?url=https%3A%2F%2Fwww.nytimes.com%2F2017%2F01%2F24%2Fus%2Fpolitics%2Fkeystone-dakota-pipeline-trump.html%3Fhp%26action%3Dclick%26pgtype%3DHomepage%26clickSource%3Dstory-heading%26module%3Dfirst-column-region%26region%3Dtop-news%26WT.nav%3Dtop-news%26_r%3D0&extractor=ArticleExtractor&output=htmlFragment&extractImages=&token=

you can read all page content and extract most of the relevance text with no navigation bars and no external links this made so much easy to process and have a better relevance on the text. I will try to test your solution later next week. thanks a lot

essiembre commented 7 years ago

If you want to use it to extract text from HTML that is already downloaded by the crawler document fetcher, then writing your own parser like suggested should work. But if you would rather use BoilerPipe to also download the document, you may want to look at implementing your own IHttpDocumentFetcher instead (the default implementation is GenericDocumentFetcher).