Queens-Hacks / qcumber-scraper

Scrapes SOLUS and generates structured data
3 stars 6 forks source link

Threaded Scraper take 1 #2

Closed mystor closed 10 years ago

mystor commented 10 years ago

I took a quick shot at making a threaded scraper. Unfortunately, I was far too optimistic about how independent each server request is (There is too much server-side state :cry:)

So, the code on this branch doesn't completely work. I have functioning login, and requests. I will be refactoring the code over the next few days, and it should be able to scrape SOLUS relatively easily (I will probably do the scraping by-letter, with a separate session for each one. I THINK that that should work).

I have been testing with 10 concurrent request threads, and SOLUS doesn't seem to mind at all. So I will probably stick with that number

Also, about the commit messages, I'm sorry.

pR0Ps commented 10 years ago

The job is currently divided up by letters and (if the number of threads is large enough in comparison to the number of letters,) subjects. Even though each job starts executing from the main page, the overhead of navigating to the start of it's job is at most a single post request. It doesn't really seem worth it to mess with implementation-level things like cookies (that could change at any time since they don't directly affect actual users). It might be better to keep the threading but just let the session handle the cookies in each thread.

EDIT: Ah, I missed a big selling point of the cookie manipulation. It would help even in a single thread. Instead of doing say 3 post-wait-receive-parse cycles to get from a deep scrape back to the subject page and into another course, we could just set the cookie and do the post to go directly there. That would probably have a pretty huge impact on the runtime/bandwidth if used correctly. I still don't know if it should be used for everything, but having cookies in a stack could be extremely useful.

It shouldn't require any structural changes to add cookie storage/replay into an existing system, maybe this is better saved for later in development.

mystor commented 10 years ago

Yes, on Tuesday Phil and I were doing some research and it appeared as though modifying cookies was sufficient to allow for independent requests.

I have since realized that this is not the case through experimentation. I will instead rework the system to split into discrete jobs at the top level, as it may even be faster. On 2013-12-19 4:07 AM, "Carey Metcalfe" notifications@github.com wrote:

I'm working on this as well, check out the master branch. It takes in a job, splits it up into discrete pieces, loads all the pieces into a queue, spins up a few threads and lets them login and start processing all the pieces until the job is done.

As of now it can concurrently navigate through letters and subjects, printing the data it encounters. It doesn't store anything or go any deeper than looking at all the courses in a subject though.

I'm not sure we should really be messing around with the cookies. As long as we have a few discrete sessions, we'll see a huge increase in speed. Keep in mind this is a multi-hour job. Shaving seconds doesn't really matter. What I've done it just split up the main job into a bunch of smaller ones in such a way that if a single thread executed them, it would be almost exactly the same as before, but also works with multiple threads.

The job is currently divided up by letters and (if the number of threads is large in comparison to the number of letters,) subjects. Even though each job starts executing from the main page, the overhead of navigating to the start of it's job is at most a single post request. Not really worth messing with implementation-level things like cookies (that could change at any time since it doesn't affect an actual user).

— Reply to this email directly or view it on GitHubhttps://github.com/Queens-Hacks/qcumber-scraper/pull/2#issuecomment-30914236 .