Closed bellecarrell closed 5 years ago
@abenton questions:
This is a hard problem, storing and manipulating. I would write all tweets to a table with user ID and date along with tweet text. You will have a separate table with user information to join if necessary. If this table is too big to deal with easily, then I would split it into separate tables by user ID, all sitting in one directory.
Once we have extracted features, the time series can be written to a single compressed numpy file: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.savez_compressed.html This will be the easiest to work with.
worked on test data. running a job on the grid over all data under /exp/abenton/twitter_brand_data/
Collect the following parallel time series for each user: (1) tweets ego made ordered by time, and (2) daily success metric (unnormalized follower count delta, \% change in follower count, Lampos success metric) with ${1, 2, 7, 30}$ day horizon. Hopefully this is generic enough that we can bin tweets in different ways