Closed b5 closed 7 years ago
A quick question before I dive in -- is this fully from scratch or using some existing HTML-parsing lib (like the 'go' equivalent of BeautifulSoup or NLTK)?
definitely the latter, drop-dead-simple approach. it uses this library: https://github.com/PuerkitoBio/goquery
Function that does the "work": https://github.com/edgi-govdata-archiving/go-calc-diff/pull/2/files#diff-7590010fb56ca464189010809411b736R12
Ok I'm going to merge this just for the sake of cleanup. Text differ works, but my guess is web monitoring will be growing past this implementation in the near future ;)
Initial stab at text-based content diffing. This adds one new query param to the
/diff
endpoint: passinghtml_text=true
will convert the html into text before performing the diff, all other params remain the same. I've pushed this to the heroku app for testing, as it's a non-api-breaking change.This is an incredibly naive text-extraction also, and I'm expecting it'll need some tuning before it's ready to use, but the general idea is: select the body of the document, remove any "script-like" tags, print the textual content. I've written an extensible test for this function b/c I'm assuming we'll want to collect strange examples from the wild.