DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

xmlPaths in .opml feed definition files are unescaped #96

Open agodbehere opened 6 years ago

agodbehere commented 6 years ago

Issue

When specifying an RSS feed path with special characters such as &, baleen fails to find posts.

I have confirmed that this can be corrected by manually escaping xmlPaths.

Resolution

For example, when specifying a url like http://foobar.com/RSS.ashx?max=100&ContentType=1Site=727 in an .opml file, it needs to be modified to http://foobar.com/RSS.ashx?max=100&ContentType=1&Site=727 in order to work correctly. baleen should auto-escape URLs to make such a manual correction unnecessary.

bbengfort commented 6 years ago

@agodbehere thanks for the bug report -- I'll take a look at this when I get the chance; we're using BeautifulSoup for the XML parsing, and that's my current theory of where the escaping is happening. Pull requests to the develop branch are welcome!