kashav commented 8 years ago

Current schema is looking something like:

{
  id: String,
  building_id: String,
  name: String,
  short_name: String,
  description: String,
  url: String,
  tags: [String],
  image: String,
  campus: String,
  lat: Number,
  lng: Number,
  address: String,
  hours: {
    sunday: {
      open: Number,
      close: Number
    },
    monday: {
      open: Number,
      close: Number
    }
    tuesday: {
      open: Number,
      close: Number
    },
    wednesday: {
      open: Number,
      close: Number
    },
    thursday: {
      open: Number,
      close: Number
    },
    friday: {
      open: Number,
      close: Number
    }
  }
}

qasim commented 8 years ago

15 #20

The code looks fantastic. I will have a closer look tomorrow and get this merged.

I see you took away the closed boolean you suggested. How are we handling the case for when a restaurant is closed a specific day?

kashav commented 8 years ago

When the restaurant is closed for a given day, open and close are both 0. It felt unnecessary to have a separate boolean to check if the restaurant is open or not, considering we can also check that with a open == close (as far as I can tell, this will only be true for closed restaurants).

Now that I'm thinking about it though, if we wanted a filter for only open restaurants or something similar, a closed key would make things a lot easier. Might be a good idea to re-add that.

qasim commented 8 years ago

It may be a good idea to include the closed key, open == close may also be interpreted as open 24 hrs perhaps.

I tested out the scraper; it looks good. Here's a few things I noticed:

The address key has values with a space at the end in a few of the JSON files. Call trim() on that value.
The description key has values with some HTML tags still intact. You should add some sort of HTML tag stripper that can take those away.

{
  "id": "471",
  "building_id": "056",
  "name": "GSU Pub",
  "short_name": "gsu-pub",
  "description": "Located on the first floor of the GSU building. Pool tables and a big screen t.v.<br />",
  "url": "http://www.utgsu.ca/pubcafe/",
  "tags": [
    "Graduate",
    "Beer",
    "Wine",
    "Pub"
  ],
  "image": "",
  "campus": "UTSG",
  "lat": 43.66085,
  "lng": -79.40029,
  "address": "16 Bancroft Ave,  Toronto, ON M5S 1C1 ",
  "hours": {
    "sunday": {
      "open": 0,
      "close": 0
    },
    "monday": {
      "open": 0,
      "close": 0
    },
    "tuesday": {
      "open": 0,
      "close": 0
    },
    "wednesday": {
      "open": 0,
      "close": 0
    },
    "thursday": {
      "open": 0,
      "close": 0
    },
    "friday": {
      "open": 0,
      "close": 0
    },
    "saturday": {
      "open": 0,
      "close": 0
    }
  }
}

I'll definitely merge this after those 3 things. Thanks a lot for contributing.

kashav commented 8 years ago

Made changes; should be good to go. Ended up using BeautifulSoup to remove tags, seemed cleaner than using regex (not sure how it compares in terms of efficiency though).

qasim commented 8 years ago

In terms of efficiency, these scraper's speed don't matter too much, since they run at most once a week at odd times of the night. Elegance over efficiency. On the other hand, something like cobalt-uoft/cobalt is where we care a little more about speed.

cobalt-uoft / uoft-scrapers

Add LayersScraper superclass with preliminary Food scraper #21

15 #20