cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.
https://pypi.python.org/pypi/uoftscrapers
MIT License
48 stars 14 forks source link

UTM shuttle times scraper #32

Closed qasim closed 8 years ago

qasim commented 8 years ago

https://m.utm.utoronto.ca/shuttleByDate.php?year=2016&month=04&day=10

UTM has a mobile website for their UTM <-> UTSG shuttle. This would fall under transit / transportation. This scraper should scrape the current month's shuttle times (First day of the current month all the way to the last day). The URL makes this an easy ~30 page request scrape.

As for UTSG and UTSC and other UTM transportation, transit is solely TTC and Go. They already have their own open data APIs so we will leave it at that!

qasim commented 8 years ago

Proposed schema (per shuttle stop id):

{
  "id": String,
  "name": String,
  "dates": [{
    "start": String,
    "end": String
   }]
}
arkon commented 8 years ago

I don't really understand what the start and end fields in dates mean?

qasim commented 8 years ago

Woops, start and end don't apply here, do they. ;P

So there is another aspect to this that I hadn't looked at or seen before. Based on other transit APIs I've looked at, they organize data with routes as the top level object and then routes own stops which contain times.

So the following is schema with top level being a route:

{
  "name": String,
  "stops": [{
    "location": String,
    "building_id": String,
    "times": [String]
  }]
}

An example with fictional data for St. George route:

{
  "name": "St. George Route",
  "stops": [
    {
      "location": "Instructional Centre Layby",
      "building_id": "334",
      "times": [
        "2016-04-13T05:55:00-04:00",
        "2016-04-13T07:55:00-04:00",
        "2016-04-14T05:55:00-04:00",
        "2016-04-14T07:55:00-04:00"
      ]
    },
    {
      "location": "Hart House",
      "building_id": "002",
      "times": [
        "2016-04-13T08:55:00-04:00",
        "2016-04-13T10:55:00-04:00",
        "2016-04-14T08:55:00-04:00",
        "2016-04-14T10:55:00-04:00"
      ]
    }
  ]
}
qasim commented 8 years ago

Note: the dates are formatted in the ISO 8601 standard, offset for the Eastern timezone. It balances human readability in a compact form, and of course remains machine readable. I think this is the standard the whole project should take, but if you have an argument for something better than we can discuss that.

arkon commented 8 years ago

Would we have a gigantic list of all times for the month per stop, or would we try to split it up so it's 1 file per day?

qasim commented 8 years ago

Once per day seems appropriate since there would be a /lot/ of times otherwise. I wish the shuttle times were a little more predictable, but on random days it likes to change slightly. :/

If we do days, then the top level would be days:

{
  "date": "2016-04-13",
  "routes": [
      ...
  ]
}
arkon commented 8 years ago

Yeah it's usually schedules that are consistent for Monday - Thursday, then a few are missing for Friday, and Saturday/Sunday have way less. Then there's the special schedules for exam periods, reading weeks, etc.

arkon commented 8 years ago

So it seems like the route ids aren't the same across the days, so we'll need to use the names as the identifiers. Unless you have a better idea, @qasim ?

(I'll probably take a shot at implementing this scraper.)

qasim commented 8 years ago

@arkon that works. The convention so far has been id being all caps alphanumerical. So you could rmove the spaces/special characters, upper() so ids look like this maybe?

STGEORGE SHERIDAN

arkon commented 8 years ago

@qasim Yeah that would probably work. It should be something like:

{
  "date": "2016-04-13",
  "routes": [
    {
      "id": "STGEORGE",
      "name": "St. George Route",
      "stops": [
        {
          "location": "Instructional Centre Layby",
          "building_id": "334",
          "times": [
            "2016-04-13T05:55:00-04:00",
            "2016-04-13T07:55:00-04:00"
          ]
        },
        {
          "location": "Hart House",
          "building_id": "002",
          "times": [
            "2016-04-13T08:55:00-04:00",
            "2016-04-13T10:55:00-04:00"
          ]
        }
      ]
    },
    {
      "id": "SHERIDAN",
      "name": "Sheridan Route",
      "stops": [
        {
          "location": "Deerfield Hall North Layby",
          "building_id": "340",
          "times": [
            "2016-04-13T05:55:00-04:00",
            "2016-04-13T07:55:00-04:00"
          ]
        },
        {
          "location": "Sheridan",
          "building_id": "",
          "times": [
            "2016-04-13T08:55:00-04:00",
            "2016-04-13T10:55:00-04:00"
          ]
        }
      ]
    }
  ]
}

Note that there's no building_id for Sheridan College.

qasim commented 8 years ago

Looks good. Eventually I want the project to start referencing other scraper's IDs as much as possible, there are a few cases where we don't right now. There's no infrastructure for that yet, though (matching building names to IDs in other scrapers). I guess for this one you'll have a manual mapping somewhere of the known stops to building IDs?

arkon commented 8 years ago

Yeah, I guess the manual mapping would work. How are you going it elsewhere right now?

qasim commented 8 years ago

If it's a map.utoronto.ca layer, chances are there is a building_id attached to things. Otherwise, nothing yet.

qasim commented 8 years ago

This should be good to close after https://github.com/cobalt-uoft/uoft-scrapers/pull/41#discussion-diff-59943113 is fixed.