Open bellecarrell opened 5 years ago
From Lampos:
From reviewers and other sources (and not already mentioned):
@abenton list above. Let me know if you need anything else for ranking
Most of these features would be things to control for. "words" and "topics" are not necessarily hypotheses we can test. Stating that we believe that posting more messages about topic "X", or using more of a certain class of words attracts more followers is a hypothesis. To start we can look at the following controls and hypotheses to test.
== Control Features ==
== Hypotheses to test ==
For each of these hypotheses we should cite existing advice either in papers. They can just be webpages if cannot find any formal articles.
Models can be built varying the number of control features and hypotheses, varying the history used to compute features, and the time period to measure change in follower count over.
See Tables 3 and 5 for one way to present weights+significance for multiple regression analyses: https://faculty.wharton.upenn.edu/wp-content/uploads/2013/02/SpoilerEffect.pdf Table 3, in particular, is a good example. 5 different regression models are fit varying the number of hypotheses to test.
This is a good list to start with. Are these ranked, or are they all at the same priority level?
In terms of tasks, I'm thinking about:
On Sat, Apr 6, 2019 at 7:15 PM Adrian Benton notifications@github.com wrote:
Most of these features would be things to control for. "words" and "topics" are not necessarily hypotheses we can test. Stating that we believe that posting more messages about topic "X", or using more of a certain class of words attracts more followers is a hypothesis. To start we can look at the following controls and hypotheses to test.
== Control Features ==
- Current log(follower count)
- Current log(friend count)
- Current "user impact score"
- User enabled geolocation
- Main domain of specialization
- Any other features you can think of to control for
== Hypotheses to test ==
- posting a specific times of the day -> more followers
- fewer RTs (proxy for more original content) -> more followers
- appropriate frequency of posting (e.g. 3-5) -> more followers
- more direct messages, @ mentions (proxy for engaging with audience) -> more followers
- higher % days with posting (posts regularly) -> more followers
- include more URLs in tweets (proxy for introducing original blog content) -> more followers [Twitter user impact paper]
- "interactivity, defined by an intersection of accounts that tweet regularly, do many @-mentions and @-replies, but also mention many different users" -> more followers [Twitter user impact paper]
- more positive sentiment posts -> more followers
- diverse range of topics -> more followers [cite Twitter user impact paper]
For each of these hypotheses we should cite existing advice either in papers. They can just be webpages if cannot find any formal articles.
Models can be built varying the number of control features and hypotheses, varying the history used to compute features, and the time period to measure change in follower count over.
See Tables 3 and 5 for one way to present weights+significance for multiple regression analyses: https://faculty.wharton.upenn.edu/wp-content/uploads/2013/02/SpoilerEffect.pdf Table 3, in particular, is a good example. 5 different regression models are fit varying the number of hypotheses to test.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/102#issuecomment-480544870, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyCeLjrh0tnMlXjkfkrqZtC1WyrdQks5veSqCgaJpZM4cfbYU .
-- Sincerely, Annabelle Carrell
-1. Before these tasks, you should update the data with the full 7 months of tweets and user information.
-1.1. Before computing features for each of these hypotheses, attach a citation to an article and snippet to support this hypothesis.
Okay I'll start on -1 tomorrow.
I don't think I have anything to meet about tomorrow. I'll probably run into issues trying to add the new data since it's in a new directory so I might ask questions about that
On Sun, Apr 7, 2019, 9:20 AM Adrian Benton notifications@github.com wrote:
-1. Before these tasks, you should update the data with the full 7 months of tweets and user information.
-1.1. Before computing features for each of these hypotheses, attach a citation to an article and snippet to support this hypothesis.
- Time-of-day posting shouldn't be hard. Need to see what is the recommended time to post tweets, and then attach a binary feature to each tweet with whether or not it falls in this time range (either w.r.t. eastern time or user's time zone).
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/102#issuecomment-480590052, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyFpcuHVG9GQxOG89OZBBEBDj2hDPks5vefC5gaJpZM4cfbYU .
Anatomy of a tweet object: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html
@bellecarrell Features to extract for hypotheses:
@bellecarrell also save # of tweets made overall. When computing entropy, make sure to smooth the distribution -- can just do add-\delta smoothing where \delta is something smallish (e.g. 0.1). In case we need to go back and recompute entropy, I would also write out the distributions you compute entropy over, so we can try different smoothing schemes.
I am worried about cases where the blogger may have just posted a single tweet, in which case entropy will be 0 if unsmoothed.
@bellecarrell Another corner case to consider: when computing the binary feature "had post on Friday", you should leave this feature null if there was no Friday in the past aggregation window. That way, we will drop examples where we cannot test this hypothesis.
assemble list of features to evaluate using statsmodels