How to exclude common headers and footers available on all pages?

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 68 forks source link

How to exclude common headers and footers available on all pages? #418

Closed essiembre closed 6 years ago

essiembre commented 6 years ago

Copied from https://github.com/Norconex/collector-http/issues/412#issuecomment-340241616, by @krishnateja-ravipati :

I have a question regarding extracting content from a document.

I would like to pick up only the content available in the body of the page and exclude common headers and footers available on all pages. The body is separated from rest of the page by
HTML tag.
HTML tag may also contain class and id in few pages.

Can you suggest which tagger would be suitable to implement my requirement?

essiembre commented 6 years ago

@krishnateja-ravipati, you have a few options. I suggest you keep the <head> section if you want metadata fields extracted by the Importer parser. So I would remove everything from <body... to <main... and strip everything after </main> up to </body>. Example using the StripBetweenTransformer pre-parse handler (not tested):

  <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer">
      <stripBetween>
          <start><![CDATA[<body.*?>]]></start>
          <end><![CDATA[<main.*?>]]></end>
      </stripBetween>
      <stripBetween>
          <start><![CDATA[</main>]]></start>
          <end><![CDATA[</body>]]></end>
      </stripBetween>
      <restrictTo field="document.contentType">text/html</restrictTo>
  </transformer>

I suggest you make sure to restrict this to HTML documents only to avoid issues with non-html documents (like done in above example).

If you want more flexibility you can also have a look at ReplaceTransformer.

Please confirm this works for you.

krishnateja-ravipati commented 6 years ago

@essiembre

Thanks for the solution. It works according to my expectations.

I use ReplaceTransformmer to replace special characters to text format, for example, &(i.e. &) is replaced by and. Will also explore the functionality of this transformer to strip header and footer from my pages.

Thank You Krishna Teja

krishnateja-ravipati commented 6 years ago

Hello @essiembre ,

In continuation of above scenario, I would like to understand the default implementation if the document parser doesn't find a match to the expression given in StripBetweenTransformer.

Thank You Krishna Teja

essiembre commented 6 years ago

If it does match anything, it should leave the content as is. Are you witnessing something different?

One possible exception is if you try to read a binary file as text, then it may mess it up. That's where the <restrictTo ...> comes in to play.

krishnateja-ravipati commented 6 years ago

No, I haven't yet encountered such a situation. I just wanted to understand default functionality if a page doesn't carry the regex matching tag.

I am sure we don't have any binary files on our websites. It's all HTML pages.

Thank you Krishna Teja