GateNLP / ultimate-sitemap-parser

Ultimate Website Sitemap Parser
https://mediacloud.org/
Other
181 stars 64 forks source link

Provide a simple mechanism to parse raw sitemap content #26

Open dsoprea opened 3 years ago

dsoprea commented 3 years ago

I had to use XMLSitemapParser directly in order to accomplish this. It's more of a hack since the project seems geared to only work with HTTP URLs and the parser classes always want both URLs and the HTTP client objects. However, the URLs are only for logging and the client objects are only used in very narrow use-cases (which will never apply to me/us). So, requiring HTTP seems like it'd be an arbitrary requirement most of the time.

You might just add general support for "file:" schemes and resolve both issues (the validation that only allows HTTP URLs, and keeping us from having to use the interior classes directly since they don't appear to have been meant to be used that way).