evanderkoogh / node-sitemap-stream-parser

A streaming parser for sitemap files. Is able to deal with deeply nested sitemaps with 100+ million urls in them.
Apache License 2.0
38 stars 18 forks source link

Urls in <![CDATA[...]] not handled #36

Open WillBrindle opened 4 years ago

WillBrindle commented 4 years ago

Example of a sitemap that will not return any urls: https://www.parashop.com/1_fr_0_sitemap.xml

Extract:

<url>
  <loc><![CDATA[https://www.parashop.com/meilleures-ventes]]></loc>
  <priority>0.1</priority>  
  <changefreq>daily</changefreq>
</url>

Reason for this not working is because sax does not trigger a text event when it is inside CDATA, but instead a specific cdata event. I believe you can just use the same handler for cdata & text events and it processes fine.