Love the postgres loader, but we're going to try to keep this repo clean and simple, so I won't merge it into master. Glad to see some cool forks with great new features and tools for analysis though. Thank you for your work.
Of course, it's easy to tell you where the integrity-violations are when you're self-hosting and you have a simple system to ensure integrity Now you've got to still remove duplicates again, and after you do that you'll have to push up a totally new copy of the data files.
If you need examples look for tweet IDs,
psql:load.psql:11: ERROR: duplicate key value violates unique constraint "tweets_pkey"
DETAIL: Key (tweet_id)=(612233279027064832) already exists.
CONTEXT: COPY tweets, line 99245
COPY 233540
psql:load.psql:13: ERROR: duplicate key value violates unique constraint "tweets_pkey"
DETAIL: Key (tweet_id)=(626060785005953025) already exists.
CONTEXT: COPY tweets, line 20505
psql:load.psql:14: ERROR: duplicate key value violates unique constraint "tweets_pkey"
DETAIL: Key (tweet_id)=(669263743784583168) already exists.
CONTEXT: COPY tweets, line 200111
psql:load.psql:15: ERROR: duplicate key value violates unique constraint "tweets_pkey"
DETAIL: Key (tweet_id)=(614858663417802752) already exists.
CONTEXT: COPY tweets, line 103929
The schema, to ensure this never happens, is tucked away in a little folder for anyone who wants to use it and is 40k.
But this is the problem your repo isn't clean. It's massive, and very hard to account for errors in integrity. Take for instance duplicate tweets. I cleaned them all up, and #29 has the commit https://github.com/fivethirtyeight/russian-troll-tweets/pull/29/commits/fb5979762dca592109f919e4c805d0fb985aa9a9
Github won't render it but try,
Of course, it's easy to tell you where the integrity-violations are when you're self-hosting and you have a simple system to ensure integrity Now you've got to still remove duplicates again, and after you do that you'll have to push up a totally new copy of the data files.
If you need examples look for tweet IDs,
The schema, to ensure this never happens, is tucked away in a little folder for anyone who wants to use it and is 40k.