Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Eliminate header and footer data form crawled data. #370

Closed Navaminavu closed 6 years ago

Navaminavu commented 7 years ago

Hi Pascal, Can you help me to avoid the header and footer data from a page being crawled

Please find below the htmlfile _l2tm.txt config file <?xml version="1.0" encoding="UTF-8"?>

#set($http = "com.norconex.collector.http") #set($core = "com.norconex.collector.core") #set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer") #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter") #set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter") #set($http = "com.norconex.collector.http") #set($committer = "com.norconex.committer") #set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer") #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter") #set($sitemapFactory = "${http}.sitemap.impl.StandardSitemapResolverFactory") 2 DELETE jpg,gif,png,ico,css,js https://lmr-doxygen.rnd.ki.sw.ericsson.se/l2tm/ ./output/lmr 2 -1 ./output/lmr/progress ./output/lmr/logs C:\Users\enavami\Documents\norconex-collector-http-2.6.2\phantomjs-2.1.1-windows\bin\phantomjs.exe 5000 ^.*\.html$ <div class="title"> </div> <div class="contents"> </div> \s \n \s\n ./output/lmr/crawledFiles

attaching the html file of the page and screen shot where in the area to be removed from crawl is highlighted

screenshot htmlfile _l2tm.txt

essiembre commented 7 years ago

This is somewhat related to #365 where your content is dynamically loaded/generated by JavaScript. So in order to know what to strip, you need to look a the generated HTML DOM. Since the file you share does not have the dynamically loaded content.

Still, I am guessing your side navigation will be loaded before <div class="contents">. If that's the case, you can try stripping everything before that tag. If you want to keep the HTML <head> section so metadata are extracted, you can start at <body>. Somewhat like this:

<transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer">
  <stripBetween>
    <start><![CDATA[<body>]]></start>
    <end><![CDATA[<div class="contents">]]></end>
  </stripBetween>
</transformer>

I use CDATA to make it easier since you do not have to worry about encoding your XML.

Navaminavu commented 7 years ago

Hi Pascal,w eare not able to avoid the heade and footer unless we strip between and

but the problem is that it will make the Page title also striped and we dont have a title for the content which is weird. Can you help me to strip between and
and still have the title the current config goes as below:- <?xml version="1.0" encoding="UTF-8"?>

#set($http = "com.norconex.collector.http") #set($core = "com.norconex.collector.core") #set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer") #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter") #set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter") #set($http = "com.norconex.collector.http") #set($committer = "com.norconex.committer") #set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer") #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter") #set($sitemapFactory = "${http}.sitemap.impl.StandardSitemapResolverFactory") 2 DELETE jpg,gif,png,ico,css,js https://lmr-doxygen.rnd.ki.sw.ericsson.se/l2tm/ ./output/lmr 2 -1 ./output/lmr/progress ./output/lmr/logs C:\Users\enavami\Documents\norconex-collector-http-2.6.2\phantomjs-2.1.1-windows\bin\phantomjs.exe 5000 ^.*\.html$ ]]> ]]> \s \n \s\n ./output/lmr/crawledFiles
Navaminavu commented 7 years ago

can you suggest a way?

essiembre commented 7 years ago

The page HTML header (design visible to users) and the <head> tags are two separate things. Have you tried the example I provided? If you want to skip the page header "content" you should start at <body>, not <head>.

essiembre commented 7 years ago

@Navaminavu, can we close this issue?

Navaminavu commented 6 years ago

yes you can close this.

essiembre commented 6 years ago

Thanks for confirming.