bellecarrell / twitter_brand

In developing a brand on Twitter (and social media in general), how does what you say and how you say it correspond to positive results (more followers, for example)?
0 stars 1 forks source link

Predicting follower count #87

Closed abenton closed 5 years ago

abenton commented 5 years ago

Training models to predict follower count

abenton commented 5 years ago

@annabelle: I have started on high vs low follower count by specialization But I was using the percentile value and a cutoff of 70%. Should I calculate the values you had in mind instead (<33,33-66,66+)?

abenton commented 5 years ago

In the Lampos paper, they build a binary prediction task to discriminate between the bottom 25% and the top 10%. It would be great if we could do the same, but we don't have enough users for that.

What were the cutoffs for each of the follower count bins you had?

abenton commented 5 years ago

@annabelle: "what did you have in mind for train dev test split? I wasn't going to stratify by specialization and count unless the prediction task was"

abenton commented 5 years ago

I was thinking of splitting by 60% train/20% dev/ 20% test. You should stratify sample by the follower count bins you had already formed.

bellecarrell commented 5 years ago

@abenton I was just using the 70th percentile and up as "high" and everything else as "low"

abenton commented 5 years ago

@bellecarrell: No, the original 11 bins we picked

bellecarrell commented 5 years ago

@abenton does that mean stratify before splitting?

bellecarrell commented 5 years ago

Oh for the original bins? 20,30,...,90,95,99,100

abenton commented 5 years ago

Yeah, what was the follower count for each of those thresholds

abenton commented 5 years ago

@bellecarrell: Yes, stratify before splitting. For example:

Do similarly for all other combinations of follower count and specialization bins

bellecarrell commented 5 years ago

@abenton 20: log_follower: 0.9542425094393248 follower: 8.999999999999998 30: log_follower: 1.2304489213782739 follower: 17.0 40: log_follower: 1.4771212547196626 follower: 30.00000000000001 50: log_follower: 1.6989700043360185 follower: 49.999999999999964 60: log_follower: 1.9242792860618814 follower: 83.99999999999994 70: log_follower: 2.181843587944772 follower: 151.99999999999986 80: log_follower: 2.469822015978163 follower: 295.0 90: log_follower: 2.8686444383948255 follower: 738.9999999999997 95: log_follower: 3.170364385489145 follower: 1480.3499231704395 99: log_follower: 3.70418050312626 follower: 5060.34938470123 100: log_follower: 5.287759193851589 follower: 193980.99999999994

bellecarrell commented 5 years ago

Do similarly for all other combinations of follower count and specialization bins @abenton 1) Does that mean first step "gather all X" we'd gather (gastronomy, 30) and do the split? 2) Do we only need to stratify based on what's being predicted? For example, if we're predicting main specialization, do we need to stratify by follower count?

abenton commented 5 years ago

@bellecarrell: Wow, I did not remember the follower count being that skewed! Let's build the following bins for predicting follower count:

Those less than the 0.7 quantile (<152 followers) vs. those with more followers than the 0.9 quantile (>739 followers). I think these correspond to two natural groups of users, those with 100 followers or so, and those with thousands of followers. Users in between the 0.7 and 0.9 quantile will be excluded from prediction.

abenton commented 5 years ago

@bellecarrell:

  1. Yes, exactly "gather all X" where X is a pair of (SPECIALIZATION, FOLLOWER_COUNT_BIN), then split
  2. No, I don't think so. This splitting into train/dev/test will be done once, and we'll use the same folds for all prediction tasks. I like this because it ensures that we are training and testing on people who started with different initial popularity -- models trained should generalize across popularity of blogger.
bellecarrell commented 5 years ago

@abenton that was across all users we collected, not just promoting ones. They were what I used to create the original first 400 from each bin sample that we then annotated users from. I didn't remember it being that skewed either

bellecarrell commented 5 years ago

@abenton Okay, I'll get started on the split shortly.

abenton commented 5 years ago

@bellecarrell: RE follower count bins -- yes, that makes sense (find follower bins based on follower count distribution over all users mentioning blog in their description)