Open mystor opened 11 years ago
Awesome! I have heard whispers of ArtSci putting out an impossible-to-find pdf of all their courses with requisite data in a way that might not be too bad to parse. Looks to be pretty well-formatted, hopefully generated from somewhere.
Not to be too much of a downer, but I think it will be hard to get this data from solus. We'd be showing the requisite info in a more-structured and consisted way right now if we could, but the brain-deadsilly developer who set up the display of that info on solus made it a free-form text-field (as far as I can tell). In that single box, whoever is inputting the course is responsible for recording all of prerequisites, co-requisites, exclusions, and anything else like that. So you have to be able to tell what the course-code your parsing actually is (natural language processing?).
The formatting is very inconsistent, and there are even a lot of typos in the course codes. Like, 30%.
The most frustrating part of this is that they must also be actually categorizing requisites as pre-reqs, co-reqs, and exclusions somewhere, because solus does throw an error if you try to register for something you don't qualify for in those terms. Why they don't use that data directly when showing that data... well there are lots of questions like that about solus.
d3 d3 d3 d3!! This visualization would be so much fun to script in d3.js if we can get the data! It's not too heavy.
Personally, I'd like to see some wiki-like features added to a lot of places in Qcumber. From just flagging wrong content to crowd-sourcing the fixes. A user account system has been on the way forever (@ChrisCooper @ChrisCooper @ChrisCooper @ChrisCooper @ChrisCooper). So much cool stuff can be added then.
As I was saying above, Other departments also put out this data in ways that might be useful...
Can we empirically test this by making the scraper enrol me in different combinations of courses and seeing which ones make Solus complain? haha...
Chris keeps updating the wiki and has welcomed contributions. Sadly I can't say I've done much there, but it seems like a decent place for this kind of brainstorm. At the very least it's worth checking out all Chris' ideas and plans that he's recorded there.
Hmm, that mysterious artsci pdf may be a good solution, but it wouldn't work for other faculties unless we can find similar documents for them, and even then we would have to manage as many as 9 different scrapers... sounds like a nightmare waiting to happen.
I still haven't gotten my scraper to work yet (I have to set up my VM, which I don't really want to do yet, as my computer has been acting up and virtualbox refuses to start more than half of the time) so is there any chance that you could send me a development copy of the sqlite database (email is on my profile page).
I think that crowdsourcing changes could be awesome, but we can't just let people loose on the webpages wiki-style, that would be chaotic. I think something as simple as a modal which appears when you click a "report error" button could work. Just make sure that it is very clear how you submit the bug, and that it doesn't take you away from the site.
Assuming that we can do basic parsing of the freeform text, we can use crowdsourcing to fix up the problems with it. We would probably have to store these crowd-sourced fixes in a separate table from the scraped information, so that we wouldn't lose it if we scraped again, and resolving conflicts between the scraped version and the crowdsourced version could be a royal pita.
Unfortunately I don't think solus would ever let you enrol in certain types of courses, meaning that I don't think we could do that.
I had never thought of using D3. That is a good idea, and it seems like it should be able to handle a directed graph just fine. Also the output might be loads of fun to play with which can help reduce stress during course picking (damn you silly course, why do you have to conflict with everything! -- shake --).
The api suggested in #21 could be very useful for D3.js, we could load in the data which we need asynchronously and then display it. I suppose embedding the data in the page could be an option as well, but it could be nice to kill two birds with one stone.
For an alternate source for prerequisite data, what about the course list found here? I believe it's just undergraduate courses and not sure if it's just ArtSci. I haven't got a chance to look at the data coming from SOLUS, but some regex magic on that PDF might be an better option? Also, on that Academic Calendars page there's also a degree plans document which might be what we're looking for in regard to program requirements data
I haven't tried it on very much data, but I threw together a prerequisite text parser using jison. I uploaded the code in a gist. It works for all of the 5 examples I pulled off of qcumber, however I need to test it on many more. There is also the problem that jison throws out a ton of warnings when generating the parser, I don't know exactly what is causing them but my jison grammar is apparently ambiguous, so they could be a problem.
If we used this system, I would recommend surrounding the parse call with a try/catch block, as there is a very high chance that it will fail on some of the strings, and we don't want it to blow up too badly.
Also, I don't know if there is a jison like thing for python, if there is this should probably be ported to that if we decide to use something like this inside of qcumber.
Finally, please don't judge the code quality, I wrote the entire thing from scratch in about 2 hours to see if I could get some good results. I recognize that the code is ugly, and would need to be fixed up before it was used in any serious way.
Made a quick python port using ply. Check out the gist.
I am going to be at a cabin, and thus with almost no access to internet, for the next 10 days, so I'll work on this stuff again when I get back.
Hey @Graham42 and @Mystor, glad you guys are interested in this! Yeah, it's a fair bit frustrating, but some basic stuff seems doable. It's odd because it seems some portion of the prerequisite blurbs are computer generated and very regular and non-ambiguous, but some, as @uniphil said, are just plain perplexing.
By the way, to the three of you and @pR0Ps, I've been scaling back my Qcumber time since starting work it seems. The discovery about the (frequency of discrepancies)[http://qcumber.ca/message] between the catalog and search was a big hit to my morale lol. Anyway, if any of you want to talk about getting access to perform administration tasks like running the scraper or updating the actual site, I'd be interested!
I've done a bit of work on the prerequisite chart thing, The work I have done is available here.
I haven't actually tested it against the full data set yet, as I have had no internet and couldn't scrape...
I'll take a look at it with a full data set in a day or two.
This is my current progress on the prerequisite charts:
I generate one for each of the class types. Unfortunately, it uses Viz.js, so it has a large payload which needs to be sent with the page request. Fortunately, we only need to send this payload on the prerequisite chart page. I am hoping to have the circles link to the page for the course.
By the way, some courses have stupidly complicated prerequisite trees :stuck_out_tongue:
Need to make the graph checker check if there is an A/B version of the course which is being looked for, if there is, I need to use prerequisite data from that course as well. AFAIK, no course prerequisite explicitly references an A/B part of a course. Should be a simple fix
Viz.js is a prerequisite right now for the renderer. This is a problem as Viz.js is a very large payload to send down the line. As Viz.js is generated by emscripten, there is little point in generating it, and we need to ensure that it is served Gzipped, otherwise it will make for an unusable experience. Another option is to rewrite the renderer using something like D3.js, as @uniphil suggested, however that could take a lot of work, and may not produce as nice results as GraphViz. Or, if you guys feel up to it, you could generate the SVG graphics on the server by running the native C version of GraphViz, and then serve these up. That would work, but it means making the installation a bit more complicated (GraphViz will need to be installed as well, and we will probably install a python wrapper like pydot). This may be the best solution. I have made the change to pydot on the prereq-graph-native branch.
I am not fully certain how to link to the prereq graph. Right now I have added a small link next to the prerequisite text, I would appreciate other suggestions
I haven't written any tests yet. I should probably do that at some point.
As I am getting relatively close to a usable product I think it would be good to open a pull request up. I'll probably do that tomorrow unless you would prefer that I don't until the code is ready to merge.
One of the nice things which the Queen's computing department has which, AFAIK, no other department has is a visual prerequisite chart. This is nice for planning out which courses you need to take and when.
As Qcumber is already parsing the prerequisite strings, it shouldn't be too difficult (I plan to take a shot at it after work finishes in mid August) to figure out which of these courses are prerequisites & which are corequisites with decent accuracy (assuming you don't already do that, I haven't actually had a chance to look at the code yet :P).
Once we finish that, we can scan through each of the subjects and generate a directional graph of prerequisites, corequisites & recommended prerequisites. If we pipe this into a program like graphviz (through something like pydot), we should get out a nice prerequisite chart with arrows and such.
We could theoretically pregenerate these at scraping time, and then simply serve them up from a static directory whenever they are requested. I would suggest making this a separate fixture on the scrape page.
Potential Problems
There are a few potential problems with doing this:
Extra fun features