cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.
https://pypi.python.org/pypi/uoftscrapers
MIT License
48 stars 14 forks source link

Athletics scraper - merge events from different campuses that are on the same date? #66

Closed qasim closed 8 years ago

qasim commented 8 years ago

As it currently stands, the athletics scraper scrapes with top-level being a date. However, across campuses, the data is still in 2 different files (e.g. 01M and 01SC). I think it would make more sense to concatenate the two and have a schema like follows:

{
  "date":String,
  "events":[{
    "title":String,
    "location":String,
    "building_id":String,
    "campus":String,
    "start_time":String,
    "end_time":String
  }]
}

Looking at how we lay out scrapers, this actually may prove to be non-trivial. Any opinions on this change and if we were to implement it, how to go about doing so?

kashav commented 8 years ago

I like this idea – the data will be a lot cleaner and we won't be repeating ids each month.

It shouldn't be hard to implement either, if we want to preserve the feature of scraping each campus separately, we can add a Boolean parameter to each scrape method which decides whether we save the data or return it. Then in exams.__init__ we can merge the sets and save them.

utm = UTMExams.scrape(location, save=False)
utsc = UTSCExams.scrape(location, save=False)
docs = OrderedDict()
for campus in utm, utsc:
    for date in campus:
        if date not in docs:
            docs[date] = OrderedDict([
                ('date', date),
                ('events', [])
            ])
        docs[date]['events'].extend(campus[date]['events'])
for date, doc in docs.items(): 
    Scraper.save_json(doc, location, date)

There might be a better solution, since this requires each campus scraper to have that same schema.

qasim commented 8 years ago

@kshvmdn that makes sense to me, better than what I was thinking :)

qasim commented 8 years ago

Awesome. the JSON files look even greater now, since they match stuff like the shuttles scraper. Consistency is key!

Thanks again ^_^