Text content differ - Githubissues

b5 commented 7 years ago

Initial stab at text-based content diffing. This adds one new query param to the /diff endpoint: passing html_text=true will convert the html into text before performing the diff, all other params remain the same. I've pushed this to the heroku app for testing, as it's a non-api-breaking change.

This is an incredibly naive text-extraction also, and I'm expecting it'll need some tuning before it's ready to use, but the general idea is: select the body of the document, remove any "script-like" tags, print the textual content. I've written an extensible test for this function b/c I'm assuming we'll want to collect strange examples from the wild.

danielballan commented 7 years ago

A quick question before I dive in -- is this fully from scratch or using some existing HTML-parsing lib (like the 'go' equivalent of BeautifulSoup or NLTK)?

danielballan commented 7 years ago

b5 commented 7 years ago

definitely the latter, drop-dead-simple approach. it uses this library: https://github.com/PuerkitoBio/goquery

Function that does the "work": https://github.com/edgi-govdata-archiving/go-calc-diff/pull/2/files#diff-7590010fb56ca464189010809411b736R12

b5 commented 7 years ago

Ok I'm going to merge this just for the sake of cleanup. Text differ works, but my guess is web monitoring will be growing past this implementation in the near future ;)

edgi-govdata-archiving / go-calc-diff

Text content differ #2