cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.
https://pypi.python.org/pypi/uoftscrapers
MIT License
49 stars 14 forks source link

Implement a food scraper #15

Closed qasim closed 8 years ago

qasim commented 8 years ago

map.utoronto.ca has food information. We can scrape that information and form something like the following:

{
  id: String,
  building_id: String,
  name: String,
  description: String,
  tags: [String],
  image: String,
  campus: String,
  lat: Number,
  lng: Number,
  address: String,
  hours: {
    sunday: {
      open: Number,
      close: Number
    },
    monday: {
      open: Number,
      close: Number
    },
    tuesday: {
      open: Number,
      close: Number
    },
    wednesday: {
      open: Number,
      close: Number
    },
    thursday: {
      open: Number,
      close: Number
    },
    friday: {
      open: Number,
      close: Number
    },
    saturday: {
      open: Number,
      close: Number
    }
  }
}
qasim commented 8 years ago

Schema is open for opinions.

kashav commented 8 years ago

I've started working on this. Current schema is looking like:

{
    id: String,
    building_id: String,
    name: String,
    short_name: String,
    description: String,
    url: String,
    tags: [String],
    image: String,
    campus: String,
    lat: Number,
    lng: Number,
    address: String,
    hours: {
        sunday: {
            closed: Boolean,
            open: String,
            close: String
        },
        monday: {
            closed: Boolean,
            open: String,
            close: String
        },
        tuesday: {
            closed: Boolean,
            open: String,
            close: String
        },
        wednesday: {
            closed: Boolean,
            open: String,
            close: String
        },
        thursday: {
            closed: Boolean,
            open: String,
            close: String
        },
        friday: {
            closed: Boolean,
            open: String,
            close: String
        }
    }
}

Only problem with having open / close as numbers was that you had no way of indicating the period (AM/PM). Some days, the restaurant is closed, which is the reason for the closed boolean (open, close are empty strings in this case). I'm sure this could be simplified if need be.

kashav commented 8 years ago

Also, I ended up having to duplicate get_value from the buildings scraper, might be a good idea to add that to the superclass.

qasim commented 8 years ago

@kshvmdn looked over the class, this looks really good.

The courses JSON uses numbers for time, and it converts all time to be in 24-hour clock format (so a number between [0, 24), with for example 8AM being represented as 8, and 12:30 PM being represented as 12.5).

The initial motivation behind storing time in this format is that it allows for very low-friction querying over time (you can get time greater than or less than some other time by just comparing numbers).

kashav commented 8 years ago

@qasim ahhh such an obvious solution -- i'll work on getting that implemented

arkon commented 8 years ago

@qasim I'm wondering why you didn't go with a string formatted as hh:mm instead? Seems more readable.

qasim commented 8 years ago

@arkon The following is a snippet from the filter endpoint. Basically, with the number format, you can consider time values the same as numbers, and perform operations with them using MongoDB's $ne, $gt, $lt, $gte, and $lte built-in queries.

https://github.com/cobalt-uoft/cobalt/blob/master/src/api/courses/routes/filter.js#L294-L315

  if (['breadth', 'level', 'size', 'enrolment', 'start', 'end', 'duration'].indexOf(key) > -1) {
    // Integers and arrays of integers (mongo treats them the same)

    if (['size', 'enrolment', 'start', 'end', 'duration'].indexOf(key) > -1) {
      response.isMapReduce = true
      response.mapReduceData = part
    }

    if (part.operator === '-') {
      response.query[ABSOLUTE_KEYMAP[key]] = { $ne: part.value }
    } else if (part.operator === '>') {
      response.query[ABSOLUTE_KEYMAP[key]] = { $gt: part.value }
    } else if (part.operator === '<') {
      response.query[ABSOLUTE_KEYMAP[key]] = { $lt: part.value }
    } else if (part.operator === '>=') {
      response.query[ABSOLUTE_KEYMAP[key]] = { $gte: part.value }
    } else if (part.operator === '<=') {
      response.query[ABSOLUTE_KEYMAP[key]] = { $lte: part.value }
    } else {
      // Assume equality if no operator
      response.query[ABSOLUTE_KEYMAP[key]] = part.value
    }
  }
kashav commented 8 years ago

Added time conversion in https://github.com/kshvmdn/uoft-scrapers/commit/8e163c22b29fcafb38436585a9e575379f5b4ef8.

Had an odd case with this location, with (what seems to be) a mistyped Monday opening time. Not sure whether we should ignore that time or just keep the hacky solution that I'm currently using.

qasim commented 8 years ago

@kshvmdn let's stick with the hacky solution so we account for all the restaurants, do you want to email the map people about that typo? Then when they fix it we can change that. 😊