Closed essiembre closed 6 years ago
@krishnateja-ravipati, you have a few options. I suggest you keep the <head>
section if you want metadata fields extracted by the Importer parser. So I would remove everything from <body...
to <main...
and strip everything after </main>
up to </body>
. Example using the StripBetweenTransformer pre-parse handler (not tested):
<transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer">
<stripBetween>
<start><![CDATA[<body.*?>]]></start>
<end><![CDATA[<main.*?>]]></end>
</stripBetween>
<stripBetween>
<start><![CDATA[</main>]]></start>
<end><![CDATA[</body>]]></end>
</stripBetween>
<restrictTo field="document.contentType">text/html</restrictTo>
</transformer>
I suggest you make sure to restrict this to HTML documents only to avoid issues with non-html documents (like done in above example).
If you want more flexibility you can also have a look at ReplaceTransformer.
Please confirm this works for you.
@essiembre
Thanks for the solution. It works according to my expectations.
I use ReplaceTransformmer to replace special characters to text format, for example, &(i.e. &)
is replaced by and. Will also explore the functionality of this transformer to strip header and footer from my pages.
Thank You Krishna Teja
Hello @essiembre ,
In continuation of above scenario, I would like to understand the default implementation if the document parser doesn't find a match to the expression given in StripBetweenTransformer.
Thank You Krishna Teja
If it does match anything, it should leave the content as is. Are you witnessing something different?
One possible exception is if you try to read a binary file as text, then it may mess it up. That's where the <restrictTo ...>
comes in to play.
No, I haven't yet encountered such a situation. I just wanted to understand default functionality if a page doesn't carry the regex matching tag.
I am sure we don't have any binary files on our websites. It's all HTML pages.
Thank you Krishna Teja
Copied from https://github.com/Norconex/collector-http/issues/412#issuecomment-340241616, by @krishnateja-ravipati :