BayanGroup / nutch-custom-search

65 stars 34 forks source link

sitemap protocol #10

Closed paulescom closed 9 years ago

paulescom commented 9 years ago

Hi, and first: very good plugin!!, I'm using for a project and works very well. Can you tell me if I can use it when sitemap file come in compressed format? And the outlinks extracted are inserted in url db? We need crawl a site using sitemap like a seed.

Thanks,

Paul

tahagh commented 9 years ago

Hi Paul,

If by compressed you mean gzip, sure the nutch http client already deflates gz requests. But if you mean other formats, I think you need a custom parser that can read that format.

And about extracting outlinks from the sitemap, yes you can do this. You should configure the plugin in mode 2 (so it can read xml files) and then extract the links. The extracted links are inserted in crawl db like other outlinks. More explanation and a sample that extracts links from a sitemap document is given in section "Parsing XML documents" of readme.