everypolitician / scraped_page_archive

Create an archive of HTML pages scraped by a Ruby scraper
MIT License
1 stars 0 forks source link

Avoid storing file changes that are due to always changing page elements #20

Open struan opened 8 years ago

struan commented 8 years ago

E.g on the myanmar scraper there are pages that change every time you load them due to what look like MAC type form security things:

https://github.com/struan/myanmar_house_of_representatives/commit/22cdd6ab5d90ca9eace2e365d9317cd78be8da85#diff-0e75d5c1604611ca256c471f053f2084

It might be nice if there was a way to exclude these changes to make perusing history a bit easier. I guess git blame resolves this as a serious problem but seeing how the page evolves over time is going to be hard.

struan commented 8 years ago

Thinking about this a bit more the main problem will be creating a clean interface to this that isn't burdensome. My initial thought would be to have an ignore_diffs array that you can configure on ScrapedPagesArchive that is a set of regular expressions to match against and the page is only saved if the diff has non matching changes. e.g:

ScrapedPagesArchive.ignore_diffs = ['class="form-actions form-wrapper', 'class="view view-representative-page']

Not really sure how the implementation of this would work under the hood though.

tmtmtmtm commented 8 years ago

I'm not so sure that that's much of a problem. As you say, the interface to doing this is going to difficult, and seems like it could carry a high risk of people discovering that they've accidentally said to ignore lots more than they wanted. Or, from another angle, if the problem is just in browsing the diffs, it seems like that would be better solved when browsing, rather than storing less information in the first place.