didriksi / course-search

Scrape the websites of the University of Oslo to get data on course dependencies, and make it searchable.
1 stars 0 forks source link

course-search

Scrape the websites of the University of Oslo to get data on course dependencies, and make it searchable.

Quick breakdown of the modules:

Data storage

The data is stored in a Pandas dataframe, which works reasonably well. I expect performance and readability could be increased by moving to SQL, but since performance isn't a concern due to small data sizes, and SQL set up would require people cloning this repo to set up a server, pandas does well enough.

Data structure

A repeating data pattern is nested lists for course precursors. Here, each element is required, but while some elements are strings representing a single course, others are lists. In the latter case, the courses in the list are interchangeable requirements. This covers most cases, but it is impossible to represent the occassionally occuring "two among this list of courses", so it isn't perfect.

To do the latter, the CourseList module has two classes that together allow much more complicated and nested relationships, having some list of parameters select a list of courses, specifying how many of these courses need to be taken to fulfill the requirements, and then organising these requirements in a directed acyclic graph.

Inaccuracies

Scraping the websites have shown that they are quite inconsistent. The precursor course property interchangeability is especially hard to find, because it is sometimes marked with 'or' and commas, sometimes with a header like 'One of these courses' and then a list, and sometimes implied even more indirectly. There are also at least one example of it not being indicated at all, even though it is obvious through what courses were listed that they were in fact interchangeable. This means that there are some inaccuracies in the data gathered.

Further work

I am now working on features for handling not only single courses, but sets of courses. In particular, scraping for requirements for different Master's programs, and checking if a set of courses satisfy those requirements.