Closed patrick-lee-warren closed 6 years ago
Do you have any experience with a database? I can get you set up with PostgreSQL if you'd like. You should consider giving it a shot. You're version has some improvements, I can help you make it better.
BTW, you should not be distributing text as zips or compressed using github it gets in the way of the delta creation.
No experience with DBs. Tough to see if it's worth the startup cost. Re: .zips, it's odd that github doesn't have a utility to unpack them. From my home connection, it's just a huge pain to send big text files. I'll change them to uncompressed .csvs tomorrow at work.
I can do better. I'll just create a chunked CSV and schema for you. If you want to use PostgreSQL you can, if not you can do whatever. If they're all in the database, you would simply dump them to a single file and use split
I just work with the whole file, but some folks asked it cut in pieces under 100M in a prior issue, so I tried to keep it that way.
I've got my version going up now.
https://github.com/EvanCarroll/russian-troll-tweets
This is self-hosted: there are dump files in the base of the project for people to use. They will load them up in PostgreSQL. Those flies can again be dumped back out using dump.sh
. I'm clustering by date.
If you download PostgreSQL and you want to load the database just jump in the directory and run the run.psql
script. It'll set up the schema, load the data, and configure the indexes.
In this dump there are 28,105 duplicate tweets, such as
593916934229340161
593917011974950912
593918224883941376
593918272791310337
593918312431669248
593918330626510848
593918374264119296
593918387157336065
593918435081461761
593918450155786240
593918485736116224
593918533450473472
593918546473803776
593918573459943424
593918593735229440
593918637175615488
593918685053591552
593918733061595136
593918772496506880
593918803525926913
593918854935490560
593918904856125440
593918923348824064
593918950074937345
593918963010142208
593919059499962370
593925710051344384
593926455676968961
593927023514537984
593927058612428800
593927089008549888
593927102921101312
593927116716187650
593927160391475200
593927241043734529
593927246848630784
593927253693747200
Can you explain that? Or, should I clean it up and delete of each that has a duplicate pair?
One of the pair should be dropped. Duplicated probably arose due to the 50k/day download limit. When we cut into chunks by date, we may have accidentally overlapped our windows by a minute or two.
I tried to change the .zips to .csvs, but it won't let me upload files above 25M.
@patrick-lee-warren I fixed the zips to CSVs for you just pull from https://github.com/patrick-lee-warren/russian-troll-tweets/pull/1
You need a script similar to this to pull out the txt files into 100mb chunks.
https://github.com/EvanCarroll/russian-troll-tweets/blob/version_2/PostgreSQL/dump.sh
I'm new to GitHub, but I hope I did this right. I forked the original repository, and created a new version that fixes many of the problems pointed out, including the rounding problem, as well as adding several new variables. I put in a pull request to ask 538 to update the main branch with it, but ???
It's https://github.com/patrick-lee-warren/russian-troll-tweets/tree/Version_2 if you want to play around with it.