Open struan opened 8 years ago
Thinking about this a bit more the main problem will be creating a clean interface to this that isn't burdensome. My initial thought would be to have an ignore_diffs
array that you can configure on ScrapedPagesArchive
that is a set of regular expressions to match against and the page is only saved if the diff has non matching changes. e.g:
ScrapedPagesArchive.ignore_diffs = ['class="form-actions form-wrapper', 'class="view view-representative-page']
Not really sure how the implementation of this would work under the hood though.
I'm not so sure that that's much of a problem. As you say, the interface to doing this is going to difficult, and seems like it could carry a high risk of people discovering that they've accidentally said to ignore lots more than they wanted. Or, from another angle, if the problem is just in browsing the diffs, it seems like that would be better solved when browsing, rather than storing less information in the first place.
E.g on the myanmar scraper there are pages that change every time you load them due to what look like MAC type form security things:
https://github.com/struan/myanmar_house_of_representatives/commit/22cdd6ab5d90ca9eace2e365d9317cd78be8da85#diff-0e75d5c1604611ca256c471f053f2084
It might be nice if there was a way to exclude these changes to make perusing history a bit easier. I guess git blame resolves this as a serious problem but seeing how the page evolves over time is going to be hard.