Build table for fitting regression models

abenton commented 5 years ago

Need to build tables of independent and dependent variables where each column is a feature the regression model is trained on or a dependent variable we are trying to predict. Each row in these tables corresponds to a single user measured on a specific day with features computed over the last XX days. Since we want to vary the number of days we aggregate features over and possibly the frequency we sample users at, we'll need to generate several of these tables. Aggregation window can be {1, 7, 14, 21, 28} days to start and sampling frequency can be daily.

==Independent==

Controls and hypotheses listed in issue #102

==Dependent variables==

Follower count change in the next {1, 2, 3, ..., 7, 14, 21, 28} days -- horizon We may have to adjust this by predicting % change in follower count rather than predicting raw follower count change.

It should be easy to train models if the data is formatted thusly.

We should be able to compute these tables from the big table you generate currently (where each row is a tweet with relevant features). Just need to make sure this big table contains all the information we need to compute features models will be trained on.

bellecarrell commented 5 years ago

/exp/acarrell/twitter_brand/promoting_users/timeline

abenton commented 5 years ago

Tweets filtered and joined with self-promoting users are written here:

/exp/abenton/twitter_brand_workspace_20190417/promoting_user_tweets.merged_with_user_info.noduplicates.tsv.gz

abenton commented 5 years ago

Important things to keep in mind when fitting models:

Need to ensure that our examples contain only users that are still active. We should talk about what criteria to choose for declaring a user non-active.
We can create separate experiments for predicting when a blogger will become inactive, or for predicting those users who have bought followers (and look for clues as to why they did not retain them).

abenton commented 5 years ago

Example:

We are computing features with a 7-day long aggregation window for the date of April 15, with a horizon of 2 days:

Independent -- compute average sentiment etc. from April 8 - April 15
Dependent -- compute change in followers from April 15 - April 17

We will build models to predict the dependent variable (change in future follower count) based on past behavior (independent variables)

bellecarrell / twitter_brand

Build table for fitting regression models #106