cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.
https://pypi.python.org/pypi/uoftscrapers
MIT License
48 stars 14 forks source link

Exams scraper #35

Closed qasim closed 8 years ago

qasim commented 8 years ago

Scraper should check the current year to determine what pages to scrape. Example:

etc. if the page returns a 404, then just skip it and try the rest.

qasim commented 8 years ago

Also, @kshvmdn made this tool which may be help.

kashav commented 8 years ago

Yeah, the scraper is basically complete, just have to adapt it to Cobalt design.

Also, I think each scrape should only include schedules for December to August of the provided year.

So for 2015:

And 2016:

qasim commented 8 years ago

I agree, that makes more sense. Then, year calculating logic would be:

etc.

kashav commented 8 years ago

Started working on this. Current schema looks like:

{
  "date": String,
  "start_time": Number,
  "end_time": Number,
  "sections": [
    {
      "section": String,
      "location": String
    },
    ...
  ]
}

Currently, date is in ISO 8601 format with time as 00:00:00, would it be better to merge date + time and just have start_time and end_time (both in ISO 8601)?

Here's some example output:

"MAT137Y1Y": {
  "date": "2016-04-21T00:00:00",
  "start_time": 9.0,
  "end_time": 12.0,
  "sections": [
    {
      "section": "A - H",
      "location": "EX 100"
    },
    {
      "section": "I - R",
      "location": "EX 200"
    },
    {
      "section": "S - TH",
      "location": "EX 300"
    },
    {
      "section": "TI - WO",
      "location": "EX 310"
    },
    {
      "section": "WU - YE",
      "location": "EX 320"
    },
    {
      "section": "YI - ZHAN",
      "location": "SF 2202"
    },
    {
      "section": "ZHAO - ZZ",
      "location": "SF 3202"
    }
  ]
}

Also, how should data be separated – should each course have its own output file (MAT137Y1Y.json, MAT138H1S.json, ...) or should it be split up by month (DEC15.json, APR16.json, ...) with each file containing data for all courses from that semester?

qasim commented 8 years ago

I think start_time and end_time is a better idea, which would be ISO 8601 datetimes. You could keep date as well and have it as just an ISO 8601 date (2016-04-21).

As for the data, I think it will be worth separating data by course . A course_id parameter could be added, like CSC165H1S20161, so if someone using this course API wanted to, they could reference the course API to get more info.

qasim commented 8 years ago

Another reason for the course being top level is that a person using the API is more likely to search for their course's exam times than a date's exams, I'd say.

arkon commented 8 years ago

I think most people would be looking for a particular exam period (e.g. apr16), and then some course(s) within it.

qasim commented 8 years ago

In that case, user would filter for exam_session and then their course, returning the course_id for their session and it's times.

kashav commented 8 years ago

Based on the data we have access to, I think it'd be easier to have a course_id that matches a course_code from the course API (so CSC165H1S instead of CSC165H1S20161).

The way the scraper currently works is that you give it a year and it'll scrape data for each exam period of that year. So 2016 will scrape dec15, apr16, june16, and aug16 and then create an individual output file for each of them.

If we were to create a separate file for each course_id, how would we differentiate between the exam periods? Would adding a period key work? I guess the schema would look something like:

{
  "course_id": String
  "date": String,
  "start_time": String,
  "end_time": String,
  "period": String,
  "sections": [
    {
      "section": String,
      "location": String
    }
    ...
  ]
}
kashav commented 8 years ago

Some full year courses have exams in both December and April, so it might be better to give each section that key instead.

{
  "course_id": String
  "date": String,
  "start_time": String,
  "end_time": String,
  "sections": [
    {
      "section": String,
      "location": String,
      "period": String
    }
    ...
  ]
}
qasim commented 8 years ago

This is what I was picturing: The mapping to the end of the course IDs are:

So if we assume, say CSC165 is taught in every single semester of the school year, the IDs are:

which are all unique and belong to a single exam period.

qasim commented 8 years ago
{
  "course_id": "MAT137Y1Y20159",
  "course_code": "MAT137Y1Y",
  "period": "apr16",
  "date": "2016-04-21",
  "start_time": "2016-04-21T09:00:00-04:00",
  "end_time": "2016-04-21T12:00:00-04:00",
  "sections": [
    {
      "section": "A - H",
      "location": "EX 100"
    },
    {
      "section": "I - R",
      "location": "EX 200"
    },
    {
      "section": "S - TH",
      "location": "EX 300"
    },
    {
      "section": "TI - WO",
      "location": "EX 310"
    },
    {
      "section": "WU - YE",
      "location": "EX 320"
    },
    {
      "section": "YI - ZHAN",
      "location": "SF 2202"
    },
    {
      "section": "ZHAO - ZZ",
      "location": "SF 3202"
    }
  ]
}

So now if a user wants to find this class, they'd filter course_code:"MAT137Y1" AND period:"apr16"

kashav commented 8 years ago

Nice, that's a lot cleaner. How do we deal with December exams for full year courses?

qasim commented 8 years ago

We could introduce an id for exams? I don't know if this is the cleanest or not :P

{
  "id": "MAT137Y1Y20159DEC15"
  "course_id": "MAT137Y1Y20159",
  "course_code": "MAT137Y1Y",
  "period": "dec15",
  "date": "2015-12-17",
  "start_time": "2015-12-17T09:00:00-04:00",
  "end_time": "2015-12-17T12:00:00-04:00",
  "sections": [
    ...
  ]
}

(if MAT137 had december exams)

kashav commented 8 years ago

Made suggested changes – went with this id format for the time being (I think it looks fine, maybe an underscore between the period and the course id?)

Here's some example output:

{
  "id":"CSC165H1F20159DEC15",
  "course_id":"CSC165H1F20159",
  "course_code":"CSC165H1F",
  "period":"dec15",
  "date":"2015-12-17",
  "start_time":"2015-12-17T19:00:00-04:56",
  "end_time":"2015-12-17T22:00:00-04:56",
  "sections":[
    {
      "section":"A - KAP",
      "location":"HA 403"
    },
    {
      "section":"KAY - SH",
      "location":"UC 266"
    },
    {
      "section":"SI - Z",
      "location":"UC 273"
    }
  ]
}
{
  "id":"CSC165H1S20161APR16",
  "course_id":"CSC165H1S20161",
  "course_code":"CSC165H1S",
  "period":"apr16",
  "date":"2016-04-14",
  "start_time":"2016-04-14T19:00:00-04:56",
  "end_time":"2016-04-14T22:00:00-04:56",
  "sections":[
    {
      "section":"A - H",
      "location":"BN 2N"
    },
    {
      "section":"I - O",
      "location":"BN 2S"
    },
    {
      "section":"P - ZHAN",
      "location":"BN 3"
    },
    {
      "section":"ZHAO - ZZ",
      "location":"ST VLAD"
    }
  ]
}
kashav commented 8 years ago

Also, there seems to be a special course, SOC102STU, in the December schedule.

Apparently its a course from the Steps to University program. It's not on CourseFinder, so it's not in the courses API. For the time being I've ignored it and just left it out.

qasim commented 8 years ago

+1 to leaving the special case out.

The id looks good in the current form. Output looks great!

qasim commented 8 years ago

We can .upper() the period key's value so that it's the same as the end of id.

kashav commented 8 years ago

UTSG scrapers are ready to go https://github.com/cobalt-uoft/uoft-scrapers/pull/40.

I've kind of started on the UTSC / UTM scrapers, but they only provide data for the current period (UTSC has PDFs for past schedules) and I'm not too sure what happens when it isn't exam season.

It shouldn't be hard to produce the same schema with this data though, if we decide to do so.

qasim commented 8 years ago

For the case of UTSC / UTM, we can provide current period data only then.

I tested the scraper out and the results look good! Thanks for the contribution :)