Our data source is the UoN Course Catalogue. The catalogue has undergone some big changes this year, but it's still far less good than Nott Course and people are asking me to update the data, so the crawler is here. I personally prefer Python so I ditched the old JS crawler and rewrote everything.
A big thanks to...
module
crawler (see issue #1);This crawler has two parts: one for the (course) modules and one for the academic plans and they are written as two (Python) modules, module
and plan
.
The module
module requires Selenium, requests and BeautifulSoup whereas the plan
module only relies on the latter two.
An overview of the work flow:
module
module, it will first obtain a list of schools of the three campuses and store it in a JSON file. Then it will obtain a list of modules of each school (like the information you see here). Then a POST request (with session information) is sent to fetch the link for the page of each module (see issue #1 for details), and then a GET request to fetch the module details (like the information you see here). We store the data in a SQLite database, where each column is TEXT -- for dictionaries or lists, they are json.dumps
-ed into a string.plan
module, it will first obtain a list of academic plans for each campus, and store these pieces of 'plan brief' in a JSON file. Then it will fetch the detail of each plan (not using Selenium this time, so faster), and again store the data in a SQLite database.Check schemas
for the JSON schemas of the plan and module objects stored in the SQLite database.
First you need a venv
environment which I assume you know how to set up.
Also modify other variables in module/config.py
and plan/config.py
if needed.
pip install -r requirements.txt
mkdir res
python -m plan.main
python -m module.main
If anything went wrong in the process of crawling, you can always just restart the script and it will resume downloading by skipping what has been fetched in the database. Then you should produce a data.db
file in the res
directory (if you didn't change the relevant config
fields), which is used by the backend server.
module
modulemodule
more stable (stable now after removing selenium
dependency)The output data format has changed so nott-course also changed a bit.
campus
should always be a single letter in ['C', 'M', 'U'], not the full name!!!
You don't need to read this section now.
Change of course fields:
Change of plan fields: