aaronpk / Watchtower

🏰 a minimal API for watching web pages for changes, roughly speaks the WebSub protocol
Apache License 2.0
42 stars 4 forks source link

self::strip_html not correctly stripping html tags #7

Open swentel opened 6 years ago

swentel commented 6 years ago

My timeline (https://realize.be/timeline) never updates in aperture. I've been able to install Watchtower locally, and when looking in the logs, it always fills with 'No change'.

# Beginning job: Jobs\CheckFeed::poll
Checking feed 14 https://realize.be/timeline 'text/html; charset=UTF-8'
No change

# Job Complete

Even when I put a totally different content for this feed in {feed_id}.txt, it still would say that there is no change. So when debugging, I figured out that there's a problem with the self::strip_html($previous_content); function.

I've added some extra logging, and this is what happened - the $previous_content and $current_content were exactly the same, hence, no change. This is data which is added by newrelic at server level.

===============================================
# Beginning job: Jobs\CheckFeed::poll
Checking feed 14 https://realize.be/timeline 'text/html; charset=UTF-8'
Hash: ef72a42ebe07c77352d8c380aaadbf52
Content type: text/html; charset=UTF-8 
Calculating changed by checking the diff between previous body and current body

window.nreum||(nreum={}),__nr_require=function(e,t,n){function
r(n){if(!t[n]){var
o=t[n]={exports:{}};e[n][0].call(o.exports,function(t){var
o=e[n][1][t];return
r(o||t)},o,o.exports)}return
t[n].exports}if("function"==typeof
__nr_require)return
__nr_require;for(var
o=0;o
---------------------------------

window.nreum||(nreum={}),__nr_require=function(e,t,n){function
r(n){if(!t[n]){var
o=t[n]={exports:{}};e[n][0].call(o.exports,function(t){var
o=e[n][1][t];return
r(o||t)},o,o.exports)}return
t[n].exports}if("function"==typeof
__nr_require)return
__nr_require;for(var
o=0;o
No change
swentel commented 6 years ago

So one option to fix this is to use tidy since strip_tags is not good at stripping malformed html, but that adds another server dependency, which isn't ideal.