asepaprianto / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Sitemaps that are gziped are ignored #318

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. create a basic crawler
2. point the seed url to any sitemap.gz
3. start the crawler

What is the expected output? What do you see instead?
Gziped sitemaps are plain xml files that should be supported by the crawler, 
they're not.
When issue 317 get fixed this will happen even more often.

What version of the product are you using?
latest from trunk/master

Please provide any additional information below.

I have created a fix in my crawler, just wraping the content with a 
GZipinputstream (java common package):

if (page.getWebURL().getURL().endsWith(".gz")) {
                is = new GZIPInputStream(new ByteArrayInputStream(page.getContentData()));
            } else {
                is = new ByteArrayInputStream(page.getContentData());
            }

Original issue reported on code.google.com by panthro....@gmail.com on 16 Nov 2014 at 5:24

GoogleCodeExporter commented 9 years ago
Nice Catch Rafael - Thank you very much!

Original comment by avrah...@gmail.com on 16 Nov 2014 at 5:49