AgenteFarron / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

Robots.txt parser should not lowercase sitemap URLs #25

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
See http://www.amazon.com/robots.txt

contains entries such as 

Sitemap: http://www.amazon.com/sitemaps.f3053414d236e84.SitemapIndex_0.xml.gz
Sitemap: http://www.amazon.com/sitemaps.1946f6b8171de60.SitemapIndex_0.xml.gz
Sitemap: http://www.amazon.com/sitemaps.bbb7d657c7e29fa.SitemapIndex_0.xml.gz

which are returned by the parser in lowercase

http://www.amazon.com/sitemaps.bbb7d657c7e29fa.sitemapindex_0.xml.gz

the trouble being that these URLs return a 404 when lowercased but work fine 
with the original form.

Original issue reported on code.google.com by digitalpebble on 16 May 2013 at 9:02

GoogleCodeExporter commented 8 years ago
Issue 28 has been merged into this issue.

Original comment by digitalpebble on 10 Jun 2013 at 8:25

GoogleCodeExporter commented 8 years ago
Committed revision 76.

The lowercasing was still needed to get the parsing to work so what I did was 
to add a special case for sitemap entries and use the original case then. 
Modified the test data and code accordingly.

Original comment by digitalpebble on 6 Sep 2013 at 12:34

GoogleCodeExporter commented 8 years ago

Original comment by digitalpebble on 6 Sep 2013 at 12:36