GateNLP / ultimate-sitemap-parser

Ultimate Website Sitemap Parser
https://mediacloud.org/
Other
181 stars 64 forks source link

Prevent XML parser from parsing gzipped XMLs that it's unable to decompress #6

Closed pypt closed 5 years ago

pypt commented 5 years ago
2018-11-26 12:59:27,847 INFO mediawords.util.sitemap.fetchers
[194712/MainThread]: Fetching level 1 sitemap from
https://www.iberlibro.com/sitemap.bdp31.xml.gz...
2018-11-26 12:59:27,848 INFO mediawords.util.sitemap.helpers
[194712/MainThread]: Fetching URL
https://www.iberlibro.com/sitemap.bdp31.xml.gz...
2018-11-26 12:59:28,433 ERROR mediawords.util.sitemap.helpers
[194712/MainThread]: Unable to gunzip response
<mediawords.util.web.user_agent.response.response.Response object at
0x7f3485abfcc8>: Unable to gunzip data: Not a gzipped file (b'<?')
2018-11-26 12:59:28,437 INFO mediawords.util.sitemap.fetchers
[194712/MainThread]: Parsing sitemap from URL
https://www.iberlibro.com/sitemap.bdp31.xml.gz...
probar commented 4 years ago

This issue causes some partial fetching of some sitemaps.

Shouldn't that be an open bug ?

pypt commented 4 years ago

It should have been fixed in 3c2b076.

Please reopen / create a new issue if you still encounter this. Examples with live websites would be helpful too!