Closed Navaminavu closed 6 years ago
This is somewhat related to #365 where your content is dynamically loaded/generated by JavaScript. So in order to know what to strip, you need to look a the generated HTML DOM. Since the file you share does not have the dynamically loaded content.
Still, I am guessing your side navigation will be loaded before <div class="contents">
. If that's the case, you can try stripping everything before that tag. If you want to keep the HTML <head>
section so metadata are extracted, you can start at <body>
. Somewhat like this:
<transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer">
<stripBetween>
<start><![CDATA[<body>]]></start>
<end><![CDATA[<div class="contents">]]></end>
</stripBetween>
</transformer>
I use CDATA to make it easier since you do not have to worry about encoding your XML.
Hi Pascal,w eare not able to avoid the heade and footer unless we strip between
andcan you suggest a way?
The page HTML header (design visible to users) and the <head>
tags are two separate things. Have you tried the example I provided? If you want to skip the page header "content" you should start at <body>
, not <head>
.
@Navaminavu, can we close this issue?
yes you can close this.
Thanks for confirming.
Hi Pascal, Can you help me to avoid the header and footer data from a page being crawled
Please find below the htmlfile _l2tm.txt config file <?xml version="1.0" encoding="UTF-8"?>
attaching the html file of the page and screen shot where in the area to be removed from crawl is highlighted
htmlfile _l2tm.txt