cobalt-uoft / uoft-scrapers

Public web scraping scripts for the University of Toronto.
https://pypi.python.org/pypi/uoftscrapers
MIT License
48 stars 14 forks source link

Improve the library's API #14

Closed qasim closed 8 years ago

qasim commented 8 years ago

The following is kind of wonky:

m = uoftscrapers.Buildings("./some/path")
m.run()

Why don't we change this to something like the following?

m = uoftscrapers.Buildings()
m.scrape("./some/path")
qasim commented 8 years ago

Open to suggestions.

qasim commented 8 years ago

Come to think of it, these scrapers don't really have to be instantiated. Can call them statically:

uoftscrapers.Buildings.scrape()
kashav commented 8 years ago

Definitely prefer this. How would it work though – scrape has to be a class method, right? Otherwise, wouldn't it still require a self argument (which requires the class to be instantiated)?

qasim commented 8 years ago

That's true. I guess the best we can get is uoftscrapers.Buildings().scrape('./some/path').

qasim commented 8 years ago

We will change the library's API to the new one upon completion of https://github.com/cobalt-uoft/uoft-scrapers/issues/15 and https://github.com/cobalt-uoft/uoft-scrapers/issues/18.

kashav commented 8 years ago

Are we still making this change? It should probably happen before any new scrapers are added, right?

qasim commented 8 years ago

@We, we should. We can use Python's @staticmethod decorator to be able to call the API like:

uoftscrapers.Buildings.scrape('./some/path')

Which means changing the structure of the scraper classes a little, as output_location is part of the superclass but shouldn't be. Also, instead of having instantiated variables (self.x = x), we will use static class variables. An example of their usage: https://github.com/cobalt-uoft/uoft-scrapers/blob/master/uoftscrapers/scrapers/textbooks/__init__.py

class Textbooks(Scraper):
    """A scraper for UofT's book store.
    UofT Book Store is located at http://uoftbookstore.com/.
    """

    host = 'http://uoftbookstore.com'
    logger = logging.getLogger('uoftscrapers')
    threads = 32

Ultimately, we should remove all cases of the __init__() function in all the classes, which means Scraper and LayerScraper will just have static methods that the actual scrapers call instead of super().__init__(). Also, in each scraper, the top level method should be:

# previously __init__(self, output_location) + run(self)
@staticmethod
def scrape(output_location='.')

I'll take care of this issue by tonight before any of the other scrapers are made.

qasim commented 8 years ago

(And since this is a breaking API change, it will be released on 0.3.0 which if I manage this issue by tonight, will also be released tonight).

qasim commented 8 years ago

Opinions?

qasim commented 8 years ago

What it would look like:

class Textbooks:
    """A scraper for UofT's book store.
    UofT Book Store is located at http://uoftbookstore.com/.
    """

    host = 'http://uoftbookstore.com'
    threads = 32

    @staticmethod
    def scrape(output_location='.'):
        Scraper.ensure_location(output_location)
        Scraper.logger.info('Textbooks initialized.')

        # ... the scraping ...

        Scraper.logger.info('Textbooks completed.')
kashav commented 8 years ago

Looks good to me