Closed abenton closed 5 years ago
@annabelle: I have started on high vs low follower count by specialization But I was using the percentile value and a cutoff of 70%. Should I calculate the values you had in mind instead (<33,33-66,66+)?
In the Lampos paper, they build a binary prediction task to discriminate between the bottom 25% and the top 10%. It would be great if we could do the same, but we don't have enough users for that.
What were the cutoffs for each of the follower count bins you had?
@annabelle: "what did you have in mind for train dev test split? I wasn't going to stratify by specialization and count unless the prediction task was"
I was thinking of splitting by 60% train/20% dev/ 20% test. You should stratify sample by the follower count bins you had already formed.
@abenton I was just using the 70th percentile and up as "high" and everything else as "low"
@bellecarrell: No, the original 11 bins we picked
@abenton does that mean stratify before splitting?
Oh for the original bins? 20,30,...,90,95,99,100
Yeah, what was the follower count for each of those thresholds
@bellecarrell: Yes, stratify before splitting. For example:
Do similarly for all other combinations of follower count and specialization bins
@abenton 20: log_follower: 0.9542425094393248 follower: 8.999999999999998 30: log_follower: 1.2304489213782739 follower: 17.0 40: log_follower: 1.4771212547196626 follower: 30.00000000000001 50: log_follower: 1.6989700043360185 follower: 49.999999999999964 60: log_follower: 1.9242792860618814 follower: 83.99999999999994 70: log_follower: 2.181843587944772 follower: 151.99999999999986 80: log_follower: 2.469822015978163 follower: 295.0 90: log_follower: 2.8686444383948255 follower: 738.9999999999997 95: log_follower: 3.170364385489145 follower: 1480.3499231704395 99: log_follower: 3.70418050312626 follower: 5060.34938470123 100: log_follower: 5.287759193851589 follower: 193980.99999999994
Do similarly for all other combinations of follower count and specialization bins @abenton 1) Does that mean first step "gather all X" we'd gather (gastronomy, 30) and do the split? 2) Do we only need to stratify based on what's being predicted? For example, if we're predicting main specialization, do we need to stratify by follower count?
@bellecarrell: Wow, I did not remember the follower count being that skewed! Let's build the following bins for predicting follower count:
Those less than the 0.7 quantile (<152 followers) vs. those with more followers than the 0.9 quantile (>739 followers). I think these correspond to two natural groups of users, those with 100 followers or so, and those with thousands of followers. Users in between the 0.7 and 0.9 quantile will be excluded from prediction.
@bellecarrell:
@abenton that was across all users we collected, not just promoting ones. They were what I used to create the original first 400 from each bin sample that we then annotated users from. I didn't remember it being that skewed either
@abenton Okay, I'll get started on the split shortly.
@bellecarrell: RE follower count bins -- yes, that makes sense (find follower bins based on follower count distribution over all users mentioning blog in their description)
Training models to predict follower count