amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

More robust parsing of sitemap index files #29

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
http://activision.taleo.net/careersection/sitemap.jss

contains entries such as 

<sitemap>
http://activision.taleo.net/careersection/sitemap.jss?portalCode=2&lang=en
</sitemap>

i.e. the mandatory <loc> element is missing

Ideally the site should produce the correct content following the schema but we 
should make the parser a bit more robust and produce outlinks even if the loc 
element is missing

Original issue reported on code.google.com by digitalpebble on 6 Sep 2013 at 9:28

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Committed revision 77.

Original comment by digitalpebble on 2 Oct 2013 at 1:41