Library information scraper

cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.

https://pypi.python.org/pypi/uoftscrapers

MIT License

48 stars 14 forks source link

Library information scraper #31

Closed qasim closed 8 years ago

qasim commented 8 years ago

https://onesearch.library.utoronto.ca/visit

Most libraries at UofT aren't whole buildings, here's a list of libraries that /are/ buildings but also inside other buildings.

Library names, location (we'd have to map these to building IDs to minimize duplicate information), operating hours, photo, website URL, description, "collection strengths", and "how to access".

g3wanghc commented 8 years ago

@qasim I'm down to work on this one.

qasim commented 8 years ago

It's yours!

g3wanghc commented 8 years ago

Invite me to cobalt? :V

g3wanghc commented 8 years ago

Just had a chance to take a look at the actual site. All hours, Collection strengths and How to access are optional fields that may not exist. All hours is actually a link to a calendar, should we just provide the link?

@qasim Do we have a public API key without rate-limiting for GET buildings/search/?q={{address}}&key={{public_key}}?

g3wanghc commented 8 years ago

Libraries have 2 links. e.g. "Health Science Information Consortium of Toronto"

/content/health-science-information-consortium-toronto /library-info/HSICT

And don't necessarily have a Description, and sometimes would contain a Teaser text.

qasim commented 8 years ago

We have no way of requesting API for scrapers yet. It's on the roadmap though.

On average is there enough information? If not, we may just call it here and not scrape libraries.

As for hours, I saw that they had the link to the calendar. If we do continue to scrape, we could do 1 week of the calendar and have similar hours structure as the Food scraper does

g3wanghc commented 8 years ago

On average, there's usually enough information.

I will take your advice on the calendar thing. It doesn't look like too much work. ¯(ツ)/¯