DisasterMasters / TweetAnalysis

Repository for storing the code used to analyse the tweets collected from the Twitter scraper
2 stars 3 forks source link

Repo for analyzing tweets from the TweetScraper repo

The DATA directory

This directory contains florida_data, which contains all raw data collected from the TweetScraper. This directory is divided into subdirectories according to whether they are from government, media, utility, or nonprofit source.

The FLORIDA directory

The top level of this directory contains all of the code that is actively being used to do machine learning and visualization. The programs must be run separately for every tweet source (gov, media, utility, or nonprofit), since we are running separate analysis on each source.

florida/doc2vecmodels

Stores the models generated from doc2vec.py

florida/results

Stores the results from the unsupervised/supervised methods, the graphs from these methods, and the indices from the tf-idf matrix from NMF

florida/tmp

Random programs that might be useful later on. Contains programs to randomize output files for manual coding and some smaller files to use to test code

florida/training_data

Contains the files used for supervised/unsupervised predicting, generated from lex_dates.py. These files are what we are predicting categories on. Additionally, contains the Crisislex original txt file.

florida/training_data/supervised_data

Contains the manually classified files sent from Xiaojing. Make sure the tweets are in the first column and the manual code is in the second column! (Might have to move columns in file around or adapt code to change depending on format of file sent). The manually coded files are separated into subdirectories depending on their source.

florida/useless

Some useless files that I just didn't have the heart to -rm... They are not being used right now but I kept them around just in case.

florida/webhome

Contains the index.html file I use on my EECS web directory.

Recommendations/Workflow

Right now I am manually running these files, but I would suggest writing a shell script to automate the command line arguments being typed.

With Puerto Rico coming soon, I would recommend redoing the directory structure in a way that makes more sense and is more efficient to reach both Florida and Puerto Rico tweets.

When receiving new data, put it into the DATA directory under its appropriate source.

Then, run lex_dates.py on the source you want to analyze.

After running lex_dates.py, select which method you want (supervised/unsupervised), and run it.

If you chose the unsupervised method, you will have to run represent.py

Then, depending on if you chose unsupervised or supervised, run either sup_graphs.py or un_graphs.py and get your visualizations.