amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

Sitemap Parser to normalize entries #24

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
The attached sitemap contains the following entry which is causing problems 
(for my Firefox browser and for our Sitemap parser). This is due to the 
presence of the ampersands.

<url>
  <lastmod>2011-10-28</lastmod>
  <loc>http://www.tricae.com.br//Triciclo-Meu-1&Acirc;&ordm;-Tico-Tico-Europa--Bandeirante-1302.html</loc>
  <changefreq>monthly</changefreq>
  <priority>0.8</priority>
</url>

A solution would be to normalize all sitemap entries, however first we need to 
port some code to CC (possibly from Nutch).
In the meantime a targeted hack of SiteMap parser would suffice but it is 
certainly not ideal.

Original issue reported on code.google.com by lewis.mc...@gmail.com on 9 May 2013 at 5:23

Attachments:

GoogleCodeExporter commented 8 years ago
And just quickly, the stack trace which is thrown looks like this

[Fatal Error] :3890:57: The entity "Acirc" was referenced, but not declared.
crawlercommons.sitemaps.UnknownFormatException: Error parsing XML for 
http://www.tricae.com.br/sitemap.xml
at crawlercommons.sitemaps.SiteMapParser.processXml(SiteMapParser.java:212)
at crawlercommons.sitemaps.SiteMapParser.processXml(SiteMapParser.java:121)
at crawlercommons.sitemaps.SiteMapParser.parseSiteMap(SiteMapParser.java:96)
at com.indekse.Teste.main(Teste.java:51)

Original comment by lewis.mc...@gmail.com on 9 May 2013 at 5:24

GoogleCodeExporter commented 8 years ago
While it would be appropriate (as Fuad noted) for the website owner to fix up 
their sitemap, we also should follow the basic rule of Internet data processing 
- be very forgiving about what you accept.

So I agree that adding some code to try to fix up bad URLs would be appropriate.

Original comment by kkrugler...@transpac.com on 12 May 2013 at 7:57