Closed paulescom closed 9 years ago
Hi Paul,
If by compressed you mean gzip, sure the nutch http client already deflates gz requests. But if you mean other formats, I think you need a custom parser that can read that format.
And about extracting outlinks from the sitemap, yes you can do this. You should configure the plugin in mode 2 (so it can read xml files) and then extract the links. The extracted links are inserted in crawl db like other outlinks. More explanation and a sample that extracts links from a sitemap document is given in section "Parsing XML documents" of readme.
Hi, and first: very good plugin!!, I'm using for a project and works very well. Can you tell me if I can use it when sitemap file come in compressed format? And the outlinks extracted are inserted in url db? We need crawl a site using sitemap like a seed.
Thanks,
Paul