commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
323 stars 35 forks source link

NewsSiteMapParserBolt fails to parse valid XML sitemap #23

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 6 years ago

NewsSiteMapParserBolt fails to parse some valid XML sitemaps, e.g.,

2018-03-09 18:14:13.924 o.c.s.n.NewsSiteMapParserBolt Thread-30-sitemap-executor[10 11] [INFO] http://www.pjstar.com/section/google-news-sitemap detected as news sitemap based on content
2018-03-09 18:14:13.924 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [<?xml version="1.0" encoding="UTF-8"?>]
2018-03-09 18:14:13.924 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"]
2018-03-09 18:14:13.926 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [     xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [     xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [   <url>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [           <loc>http://www.pjstar.com/news/20180309/chosen-family-portrait-group-that-needed-each-other</loc>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [           <news:news>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [           <news:news>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [           <news:news>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                   <news:publication>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                           <news:name>Peoria Journal Star</news:name>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                           <news:language>en</news:language>]
2018-03-09 18:14:13.928 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                   </news:publication>]
2018-03-09 18:14:13.928 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                   <news:publication_date>2018-03-09</news:publication_date>]

For this sitemap the server responds Content-Type: text/html; charset=ISO-8859-1 which seems to cause that it's not even tried to parse as XML.

sebastian-nagel commented 5 years ago

The show news sitemap is now processed properly although the server still responds with text/html; charset=ISO-8859-1. Since 098a38b the content is always verified which also allows sitemaps to change from "ordinary" sitemaps to news sitemaps.