Prepare dataset for public

abenton commented 4 years ago

Deliverables

Directory containing all data needed to reproduce paper. All files should be under (I no longer have access to the grid): /exp/abenton/twitter_brand_workspace_20190417/
- Table with static user metadata: crowd-sourced fields (both reconciled labels as well as individual crowdworker labels) and from initial sample of Twitter profile
- Dynamic table with all features in our analysis. Each row should contain follower count change by horizon, each covariate by history window.
- Release tweet IDs for each user along with timestamps. If we can figure ,out how to have users download up to 50K tweets per day, then we can also publish the raw statuses (roughly 600K tweets in our data in total, should take a user 12 days to download). Relevant Twitter terms of service: https://developer.twitter.com/en/developer-terms/agreement-and-policy#id8
- README describing what each column in each of the files mean, and how they relate to the paper.

Reproducibility Bonus:

Include code to rerun analyses on table with extracted features. A reader of the paper should be able to pull the data from github, run a shellscript we provide, and reproduce our results table.

bellecarrell commented 4 years ago

Update:

I no longer have a CLSP grid account. I emailed the new admin to see about getting mine restored so I have COE grid access.
I thought GitHub had a size limit on repos and therefore was theoretically and practically best only for source code and not data. Has this changed? If not, where will we host the data?

bellecarrell commented 4 years ago

The content is almost finished. I have attached the current draft. README.txt