Repo for analyzing tweets from the TweetScraper repo

The DATA directory

This directory contains florida_data, which contains all raw data collected from the TweetScraper. This directory is divided into subdirectories according to whether they are from government, media, utility, or nonprofit source.

The FLORIDA directory

The top level of this directory contains all of the code that is actively being used to do machine learning and visualization. The programs must be run separately for every tweet source (gov, media, utility, or nonprofit), since we are running separate analysis on each source.

crisislex.py: this is a class containing all of the words found in the crisislex. It is imported in lex_dates.py
doc2vec.py: this program builds the model to use when predicting on categories in predictdoc2vec.py. The model is stored in the doc2vecmodels directory. It is loaded in predictdoc2vec.py. This is one of the supervised training methods used in predicting the categories based on the manual coding. The manually coded tweets are loaded into this program to train the model on, and are located in training_data/supervised_data/<tweetsource>
USAGE: python doc2vec.py <tweetsource>
lex_dates.py: this program is for filtering the raw data from the DATA directory and putting into a tab separated and unique delimiter separated format. This program filters data based on whether the tweet contains a word in the crisislex, cleans any dirty text, and only retains tweets made in September. It produces a file in the training_data directory named <tweetsource>_data.txt. which is used in all supervised and unsupervised learning programs, as well as the vizualization programs.
USAGE: python lex_dates.py <tweetsource>
nmf.py: this program is part of the unsupervised learning portion of the project. It will create topics based on the trends it identifies from the data. Currently the number is set to 5. It produces two files. The first is the topics file, which is stored in results/<tweetsource>_topics.txt. The second is the weights column file, stored in results/W_indices_<tweetsource>. It is a file containing the weights related to each tweet from the tf-idf matrix. A more thorough explanation about what the weights column is is in represent.py.
USAGE: python nmf.py <tweetsource>
predictdoc2vec.py: this program loads the model created in doc2vec.py and predicts categories on the unseen data contained in the DATA directory. It creates the file results/<tweetsource>_supervised_doc2vec.csv, which is later scp'd to the EECS website in my home directory as well as used in the sup_graphs.py program. It contains the tweets and their predicted categories, as well as their dates and permalinks.
USAGE: python predictdoc2vec.py <tweetsource>
randomforest.py: this program is part of the supervised learning methods in this directory, and the one used most often. It will build a tf-idf matrix from the manually coded tweets and theirlabels, fit it to a random forest classifier, and then make predictions on the unseen data. It produces the file results/<typeoffile>_supervised_rf.csv, which is later scp'd to the EECS website in my home directory as well as used in the sup_graphs.py program. It contains the tweets and their predicted categories, as well as their dates and permalinks.
USAGE: python randomforest.py <tweetsource>
represent.py: this program is used in conjunction with nmf.py in order to find the relevant tweets that correspond to the topics generated from NMF. An explanation of how the program works can be found in the header of the file. It produces the file results/unsupervised_<typeoffile>_tweets.csv, which is later scp'd to the EECS website in my home directory as well as used in the un_graphs.py program. It contains the 5 topics generated, with the tweets and their categories underneath, as well as their dates and permalinks.
USAGE: python represent.py <tweetsource>
sup_graphs.py: this program is used to create the graphs that come from the supervised training. The second command line argument will either be n or p. Putting n will produce a graph showing frequency count, while putting p will produce a graph showing percentage. The type of visualization will be reflected in the name of the file. The third command line argument is optional, and should only be included if visualizing results from doc2vec. The program loads the predictions from doc2vec or randomforest and generates an HTML file under results/<tweetsource>_<typeofvisualization>_<supervisedmethod>.htmlcontaining an interactive graph. The HTML file is later scp'd to the EECS webhome directory.
USAGE: python sup_graphs.py <tweetsource> <n/p> [doc2vec]
test_doc2vec.py: this program tests the accuracy of the doc2vec supervised learning method on the manually coded tweets. The program will output a percentage correct to the command line.
USAGE: python test_doc2vec.py <tweetsource>
test_randomforest.py: this program tests the accuracy of the random forest supervised learning method on the manually coded tweets. The program will output a correctness score to the command line (the score is the percentage correct if multiplied by 100).
USAGE: python test_randomforest.py <tweetsource>
un_graphs.py: this program is used to create the graphs that come from the unsupervised training (NMF). The second command line argument will either be n or p. Putting n will produce a graph showing frequency count, while putting p will produce a graph showing percentage. The type of visualization will be reflected in the name of the file. The program loads the predictions from NMF and generates an HTML file under results/<tweetsource>_nmf.htmlcontaining an interactive graph. The HTML file is later scp'd to the EECS webhome directory.
USAGE: python un_graphs.py <tweetsource> <n/p>

florida/doc2vecmodels

Stores the models generated from doc2vec.py

florida/results

Stores the results from the unsupervised/supervised methods, the graphs from these methods, and the indices from the tf-idf matrix from NMF

florida/tmp

Random programs that might be useful later on. Contains programs to randomize output files for manual coding and some smaller files to use to test code

florida/training_data

Contains the files used for supervised/unsupervised predicting, generated from lex_dates.py. These files are what we are predicting categories on. Additionally, contains the Crisislex original txt file.

florida/training_data/supervised_data

Contains the manually classified files sent from Xiaojing. Make sure the tweets are in the first column and the manual code is in the second column! (Might have to move columns in file around or adapt code to change depending on format of file sent). The manually coded files are separated into subdirectories depending on their source.

florida/useless

Some useless files that I just didn't have the heart to -rm... They are not being used right now but I kept them around just in case.

florida/webhome

Contains the index.html file I use on my EECS web directory.

Recommendations/Workflow

Right now I am manually running these files, but I would suggest writing a shell script to automate the command line arguments being typed.

With Puerto Rico coming soon, I would recommend redoing the directory structure in a way that makes more sense and is more efficient to reach both Florida and Puerto Rico tweets.

When receiving new data, put it into the DATA directory under its appropriate source.

Then, run lex_dates.py on the source you want to analyze.

After running lex_dates.py, select which method you want (supervised/unsupervised), and run it.

If you chose the unsupervised method, you will have to run represent.py

Then, depending on if you chose unsupervised or supervised, run either sup_graphs.py or un_graphs.py and get your visualizations.

DisasterMasters / TweetAnalysis

readme