Closed reefdog closed 4 years ago
According to @chriszs on Slack, if we don't mind the core Google Sheet being publicly-readable, a much faster way of fetching data from it than negotiating Google OAuth would be to use the sheet's public CSV URL. There's some good chatter about it (and about navigating separate sheets) on this Stack Overflow question.
We have two decisions to make about the google side of this issue.
Right now we have two formats for these twitter lists (one for North Carolina and one for National). It looks like North Carolina is a bit more robust than National, so I would recommend we use that format for both.
If we go with NC, we may want to change it slightly to separate party from the candidate name, creating a new column for "party". We also might want to separate the honorific (Sen.
and Rep.
) from the name and put that in its own field.
The scraper really only cares about having access to:
Once the schema is locked in I have an implementation choice between looking up based on either:
If we use the index then we can't shift rows around / we can only add rows to the end of the spreadsheet over time.
If we use column name then we can't rename column names in future / we need to be sure multiple columns don't share the same name.
(note that when I say "can't" I just mean we would have to update and deploy code to prevent the sync from failing)
Re, decision 2: My experience of working in news where we often use spreadsheets with collaborators is that both approaches can fail, but I'd still suggest using names, so that at least if it's renamed it'll fail more decisively.
Echo @chriszs. If someone accidentally disappears a critical column, it's easier to tell because columns["twitter"]
is missing while columns[2]
may still exist and have data.
Also it lets us resolve one possible complaint about Decision 1: as long as our critical columns exist and use consistent headers, each list can have additional list-specific columns that we can just ignore.
Agreed with all of the above -- I'm meeting with @markstencel on Friday to discuss in more detail and ensure that we're aligned with the needs of humans.
Note: we need the twitter scraper to be able to handle both @
and url versions of handles (and handles without @).
It occurred to me that we might want to break away from having twitter_1
and twitter_2
, and instead create multiple rows for people with multiple twitter accounts.
This would allow folks with more than 2 accounts to have an account for each.
It would, however, result in some duplicate data entry on the part of the human (e.g. they would have to write the display name twice).
Thoughts on this? I can keep it with twitter_1 and twitter_2, it just struck me as potentially more flexibile to change approach.
The old system allows the team to maintain the list of desired Twitter accounts in a Google Sheet, and fetches it at scrape time. The new system stores the desired accounts locally in Postgres, and we plan to periodically update this local list from the same Sheets. (It's like a different scrape pipeline, hey neat!)
Note that the same account could be in multiple lists (say, National and North Carolina) and so any action we take (adding, deactivating, reactivating) should be specific to their presence/absence on that specific list.