Queens-Hacks / qcumber-scraper

Scrapes SOLUS and generates structured data
3 stars 6 forks source link

Scraping from the course catalogue misses some sections #29

Open Graham42 opened 8 years ago

Graham42 commented 8 years ago

Sometimes SOLUS will have sections that are viewable if you use the search, but not if you look from the course catalog. This is really a bug with SOLUS itself, but it would be great if we could somehow get all the data. This might require a step back and thought about how we could scrape sections from the search instead of the course catalog.

At the time of writing, one such course is CISC 101

mystor commented 8 years ago

This has been a problem for a long time, see #27 for some context. The CISC 101 problem specifically might be related to #25, which I believe was related to SOLUS getting confused, and putting 121 only under distance studies, even though it is also offered as a course on campus.

One of the problems with performing a scrape using the search feature, rather than the course catalog, is the 200 section limit imposed on search. Unfortunately, there is no convenient set of criteria which we can choose to consistently search for <200 sections, (problem sections include first year engineering, which often has >200 sections, for example).

That being said, if you come up with a way to consistently perform search scraping instead of course catalog scraping, I'm open to hear more. I'm just not sure that it's a practical goal to have.

Graham42 commented 8 years ago

This has been reported to timetabling through some trusted channels, will update if/when I hear more.

mystor commented 8 years ago

I'm pretty sure that the problem is a technical one on solus' side where courses which are in multiple course careers are only listed once in one course career (presumably the first one alphabetically: "Distance Studies"), which means we don't get everything. I'm not sure how much timetabling can do to fix that without IT's help.

Graham42 commented 8 years ago

I'm hoping it will escalate to IT.