datatogether / archivertools

Python package for scraping websites into the Data Together pipeline via morph.io
GNU Affero General Public License v3.0
6 stars 1 forks source link

Port depth-first-search features to archivertools package #4

Open jeffreyliu opened 7 years ago

jeffreyliu commented 7 years ago

@chi-feng wrote a nice demo scraper for enumerating through search forms. We should extract the components of that scraper for easier reuse.

I think the main components to extract would be:

  1. Local test server which provides a toy autocomplete server
  2. Depth-first-search iterator/function, which recursively generates queries as long as a certain condition is not met

My proposal for how to organize:

  1. Create a class named Server, which creates a local webserver that contains common website elements for people to test against. First we'll just port the autocomplete server from @chi-feng's demo, but future versions could include toy versions of other tricky elements (ajax-populated tables, preformatted text, etc)
  2. Create either a class/module to collect a generator for useful iterators, such as the depth-first-search query generator, or a collection of canonical example scripts to be used for reference. Or possibly both?

Rationale/discussion:

  1. Local server for testing keeps us from accidentally overwhelming the site we're trying to scrape during script development and testing phases. It also allows us to write controlled unit tests
  2. Regarding how to approach and package the iterator/other tools
    • A class/module that packages tools would be easier for people developing scrapers (just from archivertools import useful_function but less flexible if people encounter sites that don't look like what we've anticipated. It would also be more work for us in order to abstract the tools to apply beyond a single site.
    • The collection of canonical scripts is less work for us, but possibly more difficult for people writing scrapers, because they'd have to understand how the code works as well as how to modify it to work for their specific problem. Also, it makes things less predictable on our end for testing, because people will probably end up implementing things differently.
    • Doing both would be the best of both worlds, but requires the most work