cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.
https://pypi.python.org/pypi/uoftscrapers
MIT License
48 stars 14 forks source link

Add LayersScraper superclass with preliminary Food scraper #21

Closed kashav closed 8 years ago

kashav commented 8 years ago

Current schema is looking something like:

{
  id: String,
  building_id: String,
  name: String,
  short_name: String,
  description: String,
  url: String,
  tags: [String],
  image: String,
  campus: String,
  lat: Number,
  lng: Number,
  address: String,
  hours: {
    sunday: {
      open: Number,
      close: Number
    },
    monday: {
      open: Number,
      close: Number
    }
    tuesday: {
      open: Number,
      close: Number
    },
    wednesday: {
      open: Number,
      close: Number
    },
    thursday: {
      open: Number,
      close: Number
    },
    friday: {
      open: Number,
      close: Number
    }
  }
}
qasim commented 8 years ago

15 #20

The code looks fantastic. I will have a closer look tomorrow and get this merged.

I see you took away the closed boolean you suggested. How are we handling the case for when a restaurant is closed a specific day?

kashav commented 8 years ago

When the restaurant is closed for a given day, open and close are both 0. It felt unnecessary to have a separate boolean to check if the restaurant is open or not, considering we can also check that with a open == close (as far as I can tell, this will only be true for closed restaurants).

Now that I'm thinking about it though, if we wanted a filter for only open restaurants or something similar, a closed key would make things a lot easier. Might be a good idea to re-add that.

qasim commented 8 years ago

It may be a good idea to include the closed key, open == close may also be interpreted as open 24 hrs perhaps.

I tested out the scraper; it looks good. Here's a few things I noticed:

{
  "id": "471",
  "building_id": "056",
  "name": "GSU Pub",
  "short_name": "gsu-pub",
  "description": "Located on the first floor of the GSU building. Pool tables and a big screen t.v.<br />",
  "url": "http://www.utgsu.ca/pubcafe/",
  "tags": [
    "Graduate",
    "Beer",
    "Wine",
    "Pub"
  ],
  "image": "",
  "campus": "UTSG",
  "lat": 43.66085,
  "lng": -79.40029,
  "address": "16 Bancroft Ave,  Toronto, ON M5S 1C1 ",
  "hours": {
    "sunday": {
      "open": 0,
      "close": 0
    },
    "monday": {
      "open": 0,
      "close": 0
    },
    "tuesday": {
      "open": 0,
      "close": 0
    },
    "wednesday": {
      "open": 0,
      "close": 0
    },
    "thursday": {
      "open": 0,
      "close": 0
    },
    "friday": {
      "open": 0,
      "close": 0
    },
    "saturday": {
      "open": 0,
      "close": 0
    }
  }
}

I'll definitely merge this after those 3 things. Thanks a lot for contributing.

kashav commented 8 years ago

Made changes; should be good to go. Ended up using BeautifulSoup to remove tags, seemed cleaner than using regex (not sure how it compares in terms of efficiency though).

qasim commented 8 years ago

In terms of efficiency, these scraper's speed don't matter too much, since they run at most once a week at odd times of the night. Elegance over efficiency. On the other hand, something like cobalt-uoft/cobalt is where we care a little more about speed.