Important dates scraper

cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.

https://pypi.python.org/pypi/uoftscrapers

MIT License

48 stars 14 forks source link

Important dates scraper #52

Open arkon opened 8 years ago

arkon commented 8 years ago

We should scrape the important dates info off of places like the Faculty of Arts & Science or UTM websites.

EDIT: This is a better list

anderson202 commented 8 years ago

i'd love to give the utm scraper a try, is there anything I should know/read about before I start?

qasim commented 8 years ago

@anderson202 yes please! Give it a go and if you have any questions, we can answer them.

I have a very basic wiki here with information: https://github.com/cobalt-uoft/uoft-scrapers/wiki but it really isn't a lot. Have a look around at other scrapers to see whats up.

For this one, UTMDates as the scraper name sounds appropriate.

We can also discuss the schema format we want to go with. Any ideas?

anderson202 commented 8 years ago

@qasim I'm definitely a newbie to this so I'm not too sure how the format should be like.

Basic info we need would be the date and the detailed information regarding the day. Maybe we can list which academic session the date falls in as well.

A quick question, how should the scraper function? Should it scrape everything it can for upcoming dates, scrape only a specific session or a specific date?

kashav commented 8 years ago

+1 on including the session, I'm thinking something like:

{
  "date":String,
  "session":String,
  "events":[String]
}

It looks like the UTM mobile site has links to two years worth of data. I think the scraper can take a year parameter and then it'll scrape <year>5 and <year>9 for the two sessions available.

For example (year = 2016):

Summer:
- http://m.utm.utoronto.ca/importantDates.php?mode=full&session=20165
Fall/Winter:
- http://m.utm.utoronto.ca/importantDates.php?mode=full&session=20169

Edit:

Looks like they actually have data since the 2010-11 school year - http://m.utm.utoronto.ca/importantDates.php?mode=full&session=20105

anderson202 commented 8 years ago

Wow I didn't even think of using the mobile site. It's so much cleaner.

I'll start working on it and see if I can contribute to this. Thanks.

Edit: @kshvmdn if I follow your format, wouldn't that return a bunch of files corresponding to each day? Would it be better to alter it some way and return a file for each session instead?

For example, would this work? { “session”:String, “dates”: [{“date”:String, “events”:String}, ...] }

kashav commented 8 years ago

I'll take the UTSGDates scraper!

kashav commented 8 years ago

@anderson202 That's actually what we want! Take a look at the athletics and shuttle scrapers, they work the same way.

I got started on the UTSG scraper and I found it might be better to use the following format instead:

"date":String,
"session":String,
"events":[{
  "end_date"String, // some go on for more than a single day (i.e. winter break)
  "campus":String,
  "description":String
}]

This will allow us to merge events across campuses for each date, like we do with the athletics scraper (take a look at this). The API ends up being a lot cleaner this way.

anderson202 commented 8 years ago

I think I have the UTM scraper done. But I'm not sure how the JSON files should be named. The ones I have currently is simply the date (or period) of the event as shown on the mobile site. Should I change it to a specific format before making a pull request?

kashav commented 8 years ago

We use the ISO 8601 format for dates. It isn't too hard to convert regular dates to this format, we do it in a lot of our scrapers, using Python's datetime module.

The files can take this date as the name.