Closed CharlieMahana closed 3 years ago
I created a very primitive web scraper that technically works, although inconsistently. This could be my fault, coursebook's fault, or a combination of the two. Regardless, the scraper needs to be reformatted to be both more efficient and more maintainable. Please improve and complete the web scraper while I focus on other sections of the project.
It can be found here c4c78f0043c27ed4eb1f2148a26df0947db5ebb9
Some details on how the scraper works:
The initialize() function simply scrapes each input field on the coursebook main page for all the possible input options to populate a dictionary of all search arguments.
The menu() function spawns a terminal menu to make interacting with the scraper slightly more bearable. There are some checks in the bottom to make sure that a term, at least one other field, and a PTGSESSID are provided.
The downloadData() function executes a search query on coursebook, waits for the download link, follows the download link, and then exports the data as a json file which it then saves in the data directory.
The runScraper() function simply executes multiple instances of the downloadData() function across several threads to improve the time it takes to scrape coursebook.
Finally, the PTGSESSID is a cookie that is required to make requests. The scraper only works if you sign in with your UTD NetID and then copy the PTGSESSID
@CharlieMahana can you go ahead and draft (tracking) PRs for the branches that mention this and the other issues?
Design and implement a web scraper that can extract course and section data from coursebook for uploading to our database.
There are no restrictions or requirements on how this is to be completed so long as it accomplishes the task of extracting the requisite data form Coursebook. Python 3 has many libraries that may be of use such as urllib, requests, and beautiful soup so this is probably a good place to start.