EricWay1024 / nottCrawlerNew

New Crawler for University of Nottingham Course Catalogue (as of October 2024)
1 stars 0 forks source link

New (as of October 2024) Crawler for University of Nottingham Course Catalogue

Our data source is the UoN Course Catalogue. The catalogue has undergone some big changes this year, but it's still far less good than Nott Course and people are asking me to update the data, so the crawler is here. I personally prefer Python so I ditched the old JS crawler and rewrote everything.

Acknowledgements

A big thanks to...

Overview

This crawler has two parts: one for the (course) modules and one for the academic plans and they are written as two (Python) modules, module and plan. The module module requires Selenium, requests and BeautifulSoup whereas the plan module only relies on the latter two.

An overview of the work flow:

Check schemas for the JSON schemas of the plan and module objects stored in the SQLite database.

Features

Get Started!

First you need a venv environment which I assume you know how to set up. Also modify other variables in module/config.py and plan/config.py if needed.

pip install -r requirements.txt
mkdir res
python -m plan.main
python -m module.main

If anything went wrong in the process of crawling, you can always just restart the script and it will resume downloading by skipping what has been fetched in the database. Then you should produce a data.db file in the res directory (if you didn't change the relevant config fields), which is used by the backend server.

To-dos

The output data format has changed so nott-course also changed a bit.

Important note

campus should always be a single letter in ['C', 'M', 'U'], not the full name!!!

Change of output fields compared to the previous crawler

You don't need to read this section now.

Change of course fields:

Change of plan fields: