Queens-Hacks / qcumber-scraper

Scrapes SOLUS and generates structured data
3 stars 6 forks source link

Cookie-based state #1

Closed mystor closed 10 years ago

mystor commented 10 years ago

Based on preliminary research, Phil and I have determined that SOLUS stores the current state in the form of cookies. We have determined that one cookie is the same between multiple accounts and computers, as long as you are on the same page. We have also determined that changing this cookie (and the post request) can cause you to go to a page which would otherwise be unreachable from your current page.

This cookie is at the domain "saself.ps.queensu.ca" and is called PS_PERSIST

We should use a combination of PS_PERSIST and multiple copies of the cookiejar to allow for parallel scraping (possibly 10+ concurrent connections), this will allow for many less requests to be made to the server, as each index page will only need to be requested once, and multiple pages can be requested concurrently.

mystor commented 10 years ago

Turns out this doesn't work (we found this out a while ago), so I'm closing the issue