Queens-Hacks / qcumber-scraper

Scrapes SOLUS and generates structured data
3 stars 6 forks source link

Job Splitting #3

Open mystor opened 10 years ago

mystor commented 10 years ago

I assume that if we are going to split the job at a higher level (say letters) we would like to split them as evenly as possible.

I took a quick count of how many classes there are in each letter:

A = 327
B = 196
C = 1300
D = 196
E = 937
F = 347
G = 467
H = 546
I = 57
J = 31
K = 98
L = 702
M =1360
N = 95
O = 98
P = 1096
Q = 4
R = 187
S = 393
T = 261
U = 31
V = 0
W = 11
X = 14
Y = 0
Z = 0
0 = 0
1 = 0
2 = 0
3 = 0
4 = 0
5 = 0
6 = 0
7 = 0
8 = 0
9 = 0

And I came up with two splits, depending on if we can split letters

# With letter splitting (threadcount == 18)
A + B = 523
D + F = 543
C/2 = 650
C/2 = 650
E/2 = 468.5
E/2 = 468.5
G = 467
H = 546
I + J + K + N + O = 379
L/2 = 351
L/2 = 351
M/2 = 680
M/2 = 680
P/2 = 548
P/2 = 548
Q + R + S = 584
T + U + V + W + X + Y + Z etc… = 317
# Without letter splitting (threadcount == 8)
A + B + D + F = 1,066
C = 1,300
E = 937
G + H = 1,013
I + J + K + L + N + O = 1,081
M = 1,360
P = 1,096
Q + R + S + T + U + V + W + X + Y + Z etc… = 901

I am not sure if we should bother with splitting them up like this, but it may be nice to, if we decide to run a set of jobs by letter, add multiple jobs for the letters C, E, L, M, and P, such that the scraper could run faster.

Graham42 commented 10 years ago

Alternatively, if you had a way to discover all courses and throw them in a synchronized queue, you could spool up any number of threads that would each just get the next item from the synchronized queue and process it.

mystor commented 10 years ago

Yes, that is what I was trying to do in the other branch (see #2) thingy. The problem is that SOLUS stores too much state serverside, so if we wanted to reliably retrieve a course, we would have to make the ~5 requests to login, ~3 requests to get to the course listing page and ~2 more requests to read the course.

My fork tried to copy cookie state and then start new requests with the old cookie state, but unfortunately that doesn't account for the server-side state which SOLUS keeps :cry:.

We simply don't know how to reliably get a single course without first navigating to the course listing page, and doing that for every course would be far too inefficient.

pR0Ps commented 10 years ago

@Graham42: The synchronized queue is exactly what's happening now, but instead of courses, the job is divided up into letters (and if the number of threads is large compared to the letters, subjects as well). The threads then take the jobs out of the queue and process them. Since the lowest level of splitting is by subject, there is a maximum of a single post request wasted per job (to get the correct letter).

Subject splitting is done with the step functionality of range(). If a letter is split between 3 threads, the ranges will look like range(0, num, 3), range(1, num, 3), range(2, num, 3). This ensures no overlap without having to know the actual number of subjects while splitting them. Of course when actually iterating the number of courses is needed (well, not really, but it makes it nicer), but at that point, you're on the page and can easily get that information.

I was thinking of making a scout() function that would look in each letter when dividing up the jobs, but ending up scrapping it. It's still on the table, however, I'd need to see some proof that it actually speeds up the process, even with the overhead before considering it.

The thing is that I'm not expecting the entire scrape to be done in parallel. I'm expecting that the job will be divided up into pieces that are equal enough such that if there is a large job, then a single thread will block on it, but meanwhile the other threads will tick off 5 or 10 other jobs.

The jobs don't all have to be equal for maximum efficiency, they just have to be equal enough such that the largest job is never bigger than the entire scrape job divided by the number of threads.

mystor commented 10 years ago

Indeed. I have an idea for a simple algorithm. I have to drive to Ottawa and pick up my family today, but I'll try to put it together soon.

My algorithm should have minimal impact because it will use estimated values (the ones I got) rather than fetching the actual ones from the server.

I predict that with larger numbers of threads it could greatly increase the speed of the scrape. On 2013-12-19 11:45 AM, "Carey Metcalfe" notifications@github.com wrote:

@Graham42 https://github.com/Graham42: The synchronized queue is exactly what's happening now, but instead of courses, the job is divided up into letters (and if the number of threads is large compared to the letters, subjects as well). The threads then take the jobs out of the queue and process them. Since the lowest level of splitting is by subject, there is a maximum of a single post request wasted per job (to get the correct letter).

Subject splitting is done with the step functionality of range(). If a letter is split between 3 threads, the ranges will look like range(0, num, 3), range(1, num, 3), range(2, num, 3). This ensures no overlap without having to know the actual number of subjects while splitting them. Of course when actually iterating the number of courses is needed (well, not really, but it makes it nicer), but at that point, you're on the page and can easily get that information.

I was thinking of making a scout() function that would look in each letter when dividing up the jobs, but ending up scrapping it. It's still on the table, however, I'd need to see some proof that it actually speeds up the process, even with the overhead before considering it.

The thing is that I'm not expecting the entire scrape to be done in parallel. I'm expecting that the job will be divided up into pieces that are equal enough such that if there is a large job, then a single thread will block on it, but meanwhile the other threads will tick off 5 or 10 other jobs.

The jobs don't all have to be equal for maximum efficiency, they just have to be equal enough such that the largest job is never bigger than the entire scrape job divided by the number of threads.

— Reply to this email directly or view it on GitHubhttps://github.com/Queens-Hacks/qcumber-scraper/issues/3#issuecomment-30944475 .

pR0Ps commented 10 years ago

I'm wary about using hardcoded numbers for anything, even for an estimate. It means that if something drastically changes or we wanted to scrape a different catalog, then we have to manually count everything. Having to manually update the system when the data changes is not something I'm crazy about.

The scrape doesn't have to complete after each thread finishes a single job. As long as jumping between jobs has minimal overhead (as with splitting by letter), then it's better to split it up as much as possible and let the system figure out what thread does what. This is why I split each letter into a different job, giving 26+ jobs, even when there are only a few threads. Even if one thread is taking forever, the others can process multiple easier letters, making it roughly even out. The subject splitting is there in case it's configured to run something like 5 threads on a single letter, in which case it splits the subjects in the letter into different jobs too.

If we want to be more correct in dividing up tasks, I think the best way is to do it dynamically, either by having a scouting session that looks at each letter and figures out the number of subjects (and possibly courses) in it, which is a ton of overhead, or dynamically adding jobs as an in-progress scrape discovers things. The latter sounds more promising. This goes with what you were saying in the daemon threads pull request (adding jobs dynamically and not having the threads exit).

mystor commented 10 years ago

I am not suggesting that the program has to finish after each thread finishes a single job. I know that we don't have to merge letters into single jobs.

With regard to "hard coding estimates", we could write out the count during our scrape and use it as an estimate for the next time the scraper is run.

I think that making estimates is an acceptable compromise that could greatly improve performance. On 2013-12-19 12:14 PM, "Carey Metcalfe" notifications@github.com wrote:

I'm extremely wary about using hardcoded numbers for anything, even for an estimate. It means that if something drastically changes or we wanted to scrape a different catalog (western maybe), then we have to manually count everything. Having to manually update an automated system when the data changes (not the data format, which always requires manual intervention) is not something I'm crazy about.

Unless you can convince me otherwise, as I see it, if we want to be more correct in dividing up tasks (which I don't think is needed, see above), the only way is to have a session that looks at each letter and figures out the number of subjects/courses in it dynamically, which is a lot of overhead.

Also, keep in mind that there is no performance penalty at all for dividing up the tasks by letter, even if there is only a single thread. The only overhead comes from dividing up the subjects and courses, because if there are 2 jobs for the same letter, the session has to start at the letter instead of just moving to the next course. Ex: letter, subject, course, subject, course, [new job] letter, subject, course, subject, course instead of letter, subject, course, subject, course, [same job], subject, course, subject, course. The difference between them is a single post request.

As I said before, the scrape doesn't have to complete after each thread finishes a single job. As long as jumping between jobs has minimal overhead (as with splitting by letter), then it's better to split it up as much as possible and let the system figure out what thread does what. This is why I split each letter into a different job, giving 26+ jobs, even when there are only a few threads. Even if one thread is taking forever, the others can process multiple easier letters, making it roughly even out. The subject splitting is there in case it's configured to run something like 5 threads on a single letter, in which case it splits the subjects in the letter into different jobs too.

— Reply to this email directly or view it on GitHubhttps://github.com/Queens-Hacks/qcumber-scraper/issues/3#issuecomment-30947165 .

pR0Ps commented 10 years ago

I guess I misunderstood. It just looked that way from the way you divided up the jobs in your examples.

With regards to writing out the file, another way to get data from previous runs would be to query the API.

I'm open to other methods if they speed up the overall performance, but I don't know if I see much of an advantage coming from having more accurate job splitting.

mystor commented 10 years ago

I suppose that could work as well. For now I will write something which directly connects to SOLUS before jobs are run, hopefully it will work out OK. We can compare speeds with a simpler delegator and a more complex one.

On Thu, Dec 19, 2013 at 12:43 PM, Carey Metcalfe notifications@github.comwrote:

I guess I misunderstood. It just looked that way from the way you divided up the jobs in your examples.

With regards to writing out the file, another way to get data from previous runs would be to query the API.

— Reply to this email directly or view it on GitHubhttps://github.com/Queens-Hacks/qcumber-scraper/issues/3#issuecomment-30949555 .