Problem

We want to scrape lots of Twitter Lists that aggregate all the politicians in a single country (e.g. as used by Politwoops) as a way of discovering the twitter handles for the legislators in that country.

Writing each of those scrapers currently requires a lot of boiler-plate code (see for example https://github.com/everypolitician-scrapers/twitter-colombia-senate-list/blob/master/scraper.rb), so we want to factor that out into an installable gem so that the scrapers become only a line or two long each.

This makes them not only much simpler and easier to create in the first place, but means when we want to do something new with those scrapers, we should only need to update a single gem, rather than end up with hundreds of scrapers all slightly out of sync with each other needing to be updated individually.

Proposed Solution

In https://github.com/everypolitician-scrapers/twitter-argentina/blob/master/scraper.rb, I've already factored the common code out into a class. We should extract that into a separately installable gem.

Not Required

Some things we have discussed, but should ignore for now, and add later if required:

Combining multiple lists
Extra arguments for accounts to ignore or add

There are also obvious commonalities here to scraping other sorts of twitter lists — e.g. follower/friend lists. But we should resist the temptation to try to make this handle those too. We can work out later how common those actually are and create other methods, or even other gems, to handle those variations.

Acceptance Criteria

The twitter-argentina and twitter-colombia-senate-list scrapers above should each reduce to two lines (beyond the imports), roughly equivalent to:

    people = TwitterList.new('lechinoise', 'politic-arg').people
    ScraperWiki.save_sqlite([:id], people)

Due Date

We want to create lots of these lists in w/c 2016-07-18, so as early as possible in that week.

Design?

No

Text?

This library has no direction connection with EveryPolitician — in theory it should be usable by anyone wanting to get data on the people in a Twitter List. So we should make sure it's well enough documented for people to use for entirely unrelated purposes to ours. (That should probably include noting that we're only returning a subset of fields from the Twitter API, and we'll happily accept PRs for others)

Bloggable?

Once we've actually used the library to import lots of new twitter handles to EP, there's probably a good story about that, which could in passing mention the library.

If we had an EP tech blog, the library itself would probably also be a good thing to discuss on it, as other people might find it useful.

tmtmtmtm commented 8 years ago

Once this is done, we should work our way through https://github.com/everypolitician/everypolitician-data/labels/source%3A%20Twitter and the Politwoops lists.

octopusinvitro commented 8 years ago

I think this idea of extracting boilerplate code from the code into a gem could be also applied to other scrapers... maybe we could have a googlesheets gem and an html gem etc. as well?