everypolitician / scraped_page_archive

Create an archive of HTML pages scraped by a Ruby scraper
MIT License
1 stars 0 forks source link

What should happen when the same url returns different html? #33

Open chrismytton opened 8 years ago

chrismytton commented 8 years ago

Problem

In the uganda-parliament-scraper, most of the html is returned from one of 3 urls. I think this is because the site is using the session to store the parameters for the search and results, so the url doesn't change, but the html being returned does.

Proposed solution

Not entirely sure, this specific case might be solvable by looking at the cookies for the request. It would be good to solve this more generally though. Perhaps we need to provide a way for users to provide a custom response class, which could return a unique identifier for the request so it can be written to the filesystem.

Acceptance criteria

The uganda-parliament-scraper should be able to save a separate page on disk for each person page that we scrape.

chrismytton commented 8 years ago

most of the html is returned from one of 3 urls

I should be more specific, the html with lists of people is all served from one identical url, the hompage from another and the empty results page from another. In the case of Uganda there is actually a different url for each person I think, but in other places there may not be.