TechAndCheck / tech-and-check-alerts

Daily tip sheet for fact checkers
MIT License
13 stars 6 forks source link

Sync Twitter account lists from Google Sheet #229

Closed reefdog closed 4 years ago

reefdog commented 4 years ago

The old system allows the team to maintain the list of desired Twitter accounts in a Google Sheet, and fetches it at scrape time. The new system stores the desired accounts locally in Postgres, and we plan to periodically update this local list from the same Sheets. (It's like a different scrape pipeline, hey neat!)

Note that the same account could be in multiple lists (say, National and North Carolina) and so any action we take (adding, deactivating, reactivating) should be specific to their presence/absence on that specific list.

reefdog commented 4 years ago

According to @chriszs on Slack, if we don't mind the core Google Sheet being publicly-readable, a much faster way of fetching data from it than negotiating Google OAuth would be to use the sheet's public CSV URL. There's some good chatter about it (and about navigating separate sheets) on this Stack Overflow question.

slifty commented 4 years ago

We have two decisions to make about the google side of this issue.

Decision 1: Nailing down the google sheet standard

Right now we have two formats for these twitter lists (one for North Carolina and one for National). It looks like North Carolina is a bit more robust than National, so I would recommend we use that format for both.

Here are the north carolina headers:

image

Here are the national headers:

image

If we go with NC, we may want to change it slightly to separate party from the candidate name, creating a new column for "party". We also might want to separate the honorific (Sen. and Rep.) from the name and put that in its own field.

The scraper really only cares about having access to:

Decision 2: what lookup approach should I use

Once the schema is locked in I have an implementation choice between looking up based on either:

  1. index (e.g. "Coumn 3 is the twitter handle")
  2. column name (e.g. "The column named Twitter is the twitter handle").

If we use the index then we can't shift rows around / we can only add rows to the end of the spreadsheet over time.

If we use column name then we can't rename column names in future / we need to be sure multiple columns don't share the same name.

(note that when I say "can't" I just mean we would have to update and deploy code to prevent the sync from failing)

chriszs commented 4 years ago

Re, decision 2: My experience of working in news where we often use spreadsheets with collaborators is that both approaches can fail, but I'd still suggest using names, so that at least if it's renamed it'll fail more decisively.

reefdog commented 4 years ago

Echo @chriszs. If someone accidentally disappears a critical column, it's easier to tell because columns["twitter"] is missing while columns[2] may still exist and have data.

Also it lets us resolve one possible complaint about Decision 1: as long as our critical columns exist and use consistent headers, each list can have additional list-specific columns that we can just ignore.

slifty commented 4 years ago

Agreed with all of the above -- I'm meeting with @markstencel on Friday to discuss in more detail and ensure that we're aligned with the needs of humans.

slifty commented 4 years ago

Note: we need the twitter scraper to be able to handle both @ and url versions of handles (and handles without @).

slifty commented 4 years ago

It occurred to me that we might want to break away from having twitter_1 and twitter_2, and instead create multiple rows for people with multiple twitter accounts.

This would allow folks with more than 2 accounts to have an account for each.

It would, however, result in some duplicate data entry on the part of the human (e.g. they would have to write the display name twice).

Thoughts on this? I can keep it with twitter_1 and twitter_2, it just struck me as potentially more flexibile to change approach.