Queens-Hacks / qcumber-scraper

Scrapes SOLUS and generates structured data
3 stars 6 forks source link

Automated testing #23

Open pR0Ps opened 9 years ago

pR0Ps commented 9 years ago

We have different Python versions (2.7, 3.3, 3.4, etc), as well as different parsing libraries (lxml, html5) to support. It's pretty much impossible to manually test all the combinations so we should create a test suite to do it for us.

I'm thinking that we could dump a few different examples of each page type (letter, expanded subject dropdown, course, term, etc), into local files then make some tests that open them and make sure that we can parse everything out of them properly using the scraper.

I'm thinking we should have a file that specifies which subjects, courses, etc that we want to test, as well as an updater script to deal with updating them.

The updater script could use the scraper to grab the pages and store the HTML of them locally, as well as store the actual scraped data. Tests run against the data could test their output and make sure it's correct. This way we can keep track of which pages we're testing against, as well as have an easy way to keep them all updated. We could then add problematic pages as we come across them (Ex: the CISC subject, see #18 ) to prevent regressions.

Some work will have to be done to the scraper to make sure that we can pull out individual courses by name (not id). The rest of the types (subject, section, etc) already pull enough attributes out of the page that we can use to specify them by name, but the all_courses function only returns the _unique of the course, not anything non-volatile (like the course number). I think this was done just to avoid duplicating information that would be scraped out of the course page (once it was loaded), not for any technical reason so it should be relatively easy to change.

Obviously if the SOLUS html changes and scraper breaks the updater script won't be able to update the stored pages and scraped data, but in those cases we can manually update the files for local testing, then when the problem is fixed, run the updater again.

I don't know if we want to actually store the HTML files in the repo or not as we might run into legal issues (putting private data in a public Git repo).

A huge benefit of this is that we won't need to actually touch SOLUS except when updating the local tests, fixing authentication issues, or actually scraping. This will speed up development, as well as allow us to not hammer SOLUS (and risk getting banned) while testing.

Anyway, this is just a huge ideas dump, let me know your thoughts on it.

mystor commented 9 years ago

I think that directly storing the HTML from SOLUS in this repository is a bad idea, because of all of the landmines related to private information on SOLUS pages. For example, if I were to drop the HTML from my SOLUS onto the page, you would be able to see a lot of information about my schedule, how much money the university owes me/I owe the university, exam times, enrollment dates etc.

That being said, good test cases aren't always real-world test cases. I think that every time there is a change to the scraper, we could create artificial example pages which have similar structure to the ones provided by SOLUS, and which allow us to ensure that there are no selector regressions. It'll be a lot more work than the actual case, but it'll give us much more control.

Unfortunately, I'm not convinced that the scraper code is modular enough right now for that to be practical.

(NB: I don't think that we should support all of those python versions. I think we should only support Python 3.3+, and drop support for python2. It'll simplify testing & development with almost no drawbacks.)

pR0Ps commented 9 years ago

Currently the HTML and scraped data is being dumped into some files.

The remaining work is making a fake session that reads the files instead of doing a request for the page and knows how to transition from file to file (ex. from A -> ANAT - > ANAT 400) based on data scraped out of the downloaded files.