cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.
https://pypi.python.org/pypi/uoftscrapers
MIT License
48 stars 14 forks source link

Add textbook information scraper from UofT Bookstore #18

Closed qasim closed 8 years ago

qasim commented 8 years ago

Work has begun over at uoft-scrapers/add-textbooks

qasim commented 8 years ago

@kshvmdn @arkon @/anyone else who's interested: how's this for a Textbook schema? The scraper over at #25 is currently outputting this type of JSON.

[
  {
    "id":"9781111831776",
    "title":"Understanding Humans",
    "edition":11,
    "author":"Lewis",
    "image":"http://uoftbookstore.com/cover_image.asp?Key=9781111831776&Size=L&p=1",
    "price":140.4,
    "courses":[{
      "id":"ANT101H5S20161",
      "code":"ANT101H5S",
      "required":true,
      "meeting_sections":[
        {
          "code":"L0101",
          "instructors":["F Sherry"]
        },
        {
          "code":"L0102",
          "instructors":["F Sherry"]
        }
      ]
    }]
  }
]

id is also the ISBN of the book. courses contains the meeting section(s) where the course was mentioned, and then whether or not it's actually required (vs. optional).

kashav commented 8 years ago

Scraper looks really nice. Might be a good idea to add a url key. Looks like theres a hidden input with the product numbers, which can be used for the url:

<input type="hidden" name="pf_id-1" id="pf_id-1" value="11078616" class="product-field-pf_id">

http://uoftbookstore.com/buy_book_detail.asp?pf_id=11078616

<input type="hidden" name="pf_id-2" id="pf_id-2" value="11119104" class="product-field-pf_id">

http://uoftbookstore.com/buy_book_detail.asp?pf_id=11119104

Also, I think some books have replacements, so instead of "Required" it'd say "Alternate" (check MAT137). We may need an alternate key for that.

qasim commented 8 years ago

Good catch, I was looking for a unique ID other than the ISBN but couldn't find one. We'll use that product number as the ID and have another field for ISBN. Also added the URL.

Now there's a requirement key, with values required, optional, recommended, or alternate for each course.

qasim commented 8 years ago
{  
    "id":"12433957",
    "isbn":"9780062020444",
    "title":"Is The Internet Changing The Way You Think?",
    "edition":1,
    "author":"Brockman, John",
    "image":"http://uoftbookstore.com/cover_image.asp?Key=9780062020444&Size=L&p=1",
    "price":16.05,
    "url":"http://uoftbookstore.com/buy_book_detail.asp?pf_id=12433957",
    "courses":[  
      {  
        "id":"CCT260H5S20161",
        "code":"CCT260H5S",
        "requirement":"required",
        "meeting_sections":[  
          {  
            "code":"L0101",
            "instructors":[  
              "N Shanta"
            ]
          },
          {  
            "code":"L0102",
            "instructors":[  
              "N Shanta"
            ]
          }
        ]
      },
      {  
        "id":"CCT406H5S20161",
        "code":"CCT406H5S",
        "requirement":"required",
        "meeting_sections":[  
          {  
            "code":"L0101",
            "instructors":[  
              "N Shanta"
            ]
          }
        ]
      }
    ]
  }
arkon commented 8 years ago

I think some books might have multiple authors, so that could be an array of strings. Also, editions might also be strings like "5th Canadian Edition" or something similar.

qasim commented 8 years ago

@arkon I'm working on the multiple author support now. 👊

I can't seem to find an example for the edition string, have you found any?

arkon commented 8 years ago

@qasim Hmm maybe not then. I only vaguely remember it when looking for textbooks in recent years, but maybe it was just part of the title.