Open arkon opened 8 years ago
i'd love to give the utm scraper a try, is there anything I should know/read about before I start?
@anderson202 yes please! Give it a go and if you have any questions, we can answer them.
I have a very basic wiki here with information: https://github.com/cobalt-uoft/uoft-scrapers/wiki but it really isn't a lot. Have a look around at other scrapers to see whats up.
For this one, UTMDates
as the scraper name sounds appropriate.
We can also discuss the schema format we want to go with. Any ideas?
@qasim I'm definitely a newbie to this so I'm not too sure how the format should be like.
Basic info we need would be the date and the detailed information regarding the day. Maybe we can list which academic session the date falls in as well.
A quick question, how should the scraper function? Should it scrape everything it can for upcoming dates, scrape only a specific session or a specific date?
+1 on including the session, I'm thinking something like:
{
"date":String,
"session":String,
"events":[String]
}
It looks like the UTM mobile site has links to two years worth of data. I think the scraper can take a year
parameter and then it'll scrape <year>5
and <year>9
for the two sessions available.
For example (year = 2016
):
Edit:
Looks like they actually have data since the 2010-11 school year - http://m.utm.utoronto.ca/importantDates.php?mode=full&session=20105
Wow I didn't even think of using the mobile site. It's so much cleaner.
I'll start working on it and see if I can contribute to this. Thanks.
Edit: @kshvmdn if I follow your format, wouldn't that return a bunch of files corresponding to each day? Would it be better to alter it some way and return a file for each session instead?
For example, would this work?
{ “session”:String, “dates”: [{“date”:String, “events”:String}, ...] }
@anderson202 That's actually what we want! Take a look at the athletics and shuttle scrapers, they work the same way.
I got started on the UTSG scraper and I found it might be better to use the following format instead:
"date":String,
"session":String,
"events":[{
"end_date"String, // some go on for more than a single day (i.e. winter break)
"campus":String,
"description":String
}]
This will allow us to merge events across campuses for each date, like we do with the athletics scraper (take a look at this). The API ends up being a lot cleaner this way.
I think I have the UTM scraper done. But I'm not sure how the JSON files should be named. The ones I have currently is simply the date (or period) of the event as shown on the mobile site. Should I change it to a specific format before making a pull request?
We should scrape the important dates info off of places like the Faculty of Arts & Science or UTM websites.
EDIT: This is a better list