cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.
https://pypi.python.org/pypi/uoftscrapers
MIT License
48 stars 14 forks source link

Library information scraper #31 #63

Closed g3wanghc closed 8 years ago

g3wanghc commented 8 years ago

Includes fixed for scraper.get() 404 limbo #62

kashav commented 8 years ago

For hours, I think it'd be better if each day was a separate object with a closed boolean and open / closed number, something like:

"hours": {
  "sunday": {
    "closed": Boolean,
    "open": Number,
    "closed": Number
  },
  "monday": {...},
  "tuesday": {...},
  ...
}

We do this for the food scraper and it makes it a lot easier to filter data.

Other than that, I think it looks good!

g3wanghc commented 8 years ago

@kshvmdn @qasim is the use of decimals a common thing to denote time?

qasim commented 8 years ago

@g3wanghc storing time (just time, no date information) is generally done by giving a number which represents a unit of time since midnight (in this case, the number is the hours since midnight). We can move this to seconds which would avoid the need to use decimals, but it hasn't been hurting us so far.

g3wanghc commented 8 years ago

@qasim Can I keep the hours as a numeric string in order to keep leading/trailing zeros. (e.g. '09:30') Personally I think storing time in terms of decimals is pretty sketch in the long run.

g3wanghc commented 8 years ago

If this is the standard time-notation, I can convert the start-time and end-time for the Events scrapper as well.

qasim commented 8 years ago

I don't know whether we can move to time being a string, mainly because of how we use the number format. Eventually, when APIs are made for the scrapers, it let's us do numerical queries on time without that much friction (in the filter endpoints). For example, you can do start:>10 to represent "start time after 10AM".

However, I do agree that the decimals are weird. Maybe this is a good time to move to "seconds since midnight" instead of "hours since midnight" as to avoid decimals entirely.

Opinions? @arkon @kshvmdn

qasim commented 8 years ago

I'm also open to there being 2 time keys, sort of like "time":Number and "time_str":String.

g3wanghc commented 8 years ago

I'm would definitely prefer using seconds to midnight consistently. The risk of rounding errors on "time":Number is a bit too weird for me.

kashav commented 8 years ago

I think using hours is cleaner just for the reason that they're more natural to work with (time:10.75 vs time:38700), but I also agree that decimal hours can get messy, esp. with odd times (eg. 10:37 AM, 4:28 PM, etc).

It's probably better to make the switch now, since we'll probably need to work with times like these sometime in the future.

So I guess +1 for seconds.

qasim commented 8 years ago

@kshvmdn I think in the future we can work on something like time:>"10:45" and then our query parser would handle converting it to seconds to put into the filter.

g3wanghc commented 8 years ago

@qasim @kshvmdn done :D

kashav commented 8 years ago

Other than that misspelling, I think it looks good!

Also, do you think it'd be better to use null instead of 0 for hours when the place is closed?

g3wanghc commented 8 years ago

@kshvmdn Depends on the type handling for the end user. Personally I prefer having null but I think it is simpler to keep it as a Number. Either way, those values shouldn't be trusted if the day is marked as closed. ¯(ツ)

qasim commented 8 years ago

LGTM

Thanks for the bug fix + a shiny new scraper!