DocNow / diffengine

track changes to the news, where news is anything with an RSS feed
MIT License
177 stars 30 forks source link

User-configurable deletions for content normalization #21

Open ryanfb opened 7 years ago

ryanfb commented 7 years ago

It might be nice for users to be able to put an array of strings or regexes in config.yaml that can be used to normalize content before diffing.

For example, I could put 'Scroll down for video' in for deletion for dailymail_diff, or with regexes globemail_diff might be able to remove stock price changes.

Related to #10, there might be a tradeoff for where to put such an array in the YAML hierarchy. Putting it as a top-level key would mean less repetition for people using one config per news source, putting it as a key under each feed would allow people using one config for multiple news sources to have different ones for each.

See also: #14

ruebot commented 7 years ago

Happens a lot for hockey scores on CBC and La Presse :smile:

Might kinda be related to #7 too? Or that just might be TorStar's wretched "digital platform".

edsu commented 7 years ago

I had to disable breitbart_diff because diffengine went crazy tweeting when they removed their email subscription link from the body of the story. So this is kind of an important feature to add.

ruebot commented 7 years ago

Yeah, canadaland_diff did something similar to that recently, and I have a lot of false positives.