aaronpk / Watchtower

🏰 a minimal API for watching web pages for changes, roughly speaks the WebSub protocol
Apache License 2.0
42 stars 4 forks source link

Compare only text content of XML feeds #11

Open aaronpk opened 2 years ago

aaronpk commented 2 years ago

Currently diffs of HTML feeds are compared only with the text contents to avoid meta tags and other invisible markup causing it to think feeds are changing. However, this logic isn't yet applied to XML feeds.

This was triggered based on the contents of someone's feed changing on each fetch causing it to poll every 5 minutes. However the only change was the mailto value of an href, some sort of randomization of an email obfuscation tool.

Watchtower instead should compute the text content of the XML feed to avoid things like this. Note that in this instance the description element is a CDATA, so this will need to bring in the HTML parser here too.