bellecarrell / twitter_brand

In developing a brand on Twitter (and social media in general), how does what you say and how you say it correspond to positive results (more followers, for example)?
0 stars 1 forks source link

feature list for adrian #102

Open bellecarrell opened 5 years ago

bellecarrell commented 5 years ago

assemble list of features to evaluate using statsmodels

bellecarrell commented 5 years ago

From Lampos:

image

From reviewers and other sources (and not already mentioned):

bellecarrell commented 5 years ago

@abenton list above. Let me know if you need anything else for ranking

abenton commented 5 years ago

Most of these features would be things to control for. "words" and "topics" are not necessarily hypotheses we can test. Stating that we believe that posting more messages about topic "X", or using more of a certain class of words attracts more followers is a hypothesis. To start we can look at the following controls and hypotheses to test.

== Control Features ==

== Hypotheses to test ==

For each of these hypotheses we should cite existing advice either in papers. They can just be webpages if cannot find any formal articles.

Models can be built varying the number of control features and hypotheses, varying the history used to compute features, and the time period to measure change in follower count over.

See Tables 3 and 5 for one way to present weights+significance for multiple regression analyses: https://faculty.wharton.upenn.edu/wp-content/uploads/2013/02/SpoilerEffect.pdf Table 3, in particular, is a good example. 5 different regression models are fit varying the number of hypotheses to test.

bellecarrell commented 5 years ago

This is a good list to start with. Are these ranked, or are they all at the same priority level?

In terms of tasks, I'm thinking about:

  1. Add values to user table for controls not already handled (location,

    friends which might actually already be handled)

  2. Calculate feature values for each hypothesis for each user. Some of these may be more time-intensive than others (for time-of-day posting, for example, will we look at the distribution across all posts and compute a particular value? I'll get started on these and ask questions as I go along.
  3. Run tests with statsmodels

On Sat, Apr 6, 2019 at 7:15 PM Adrian Benton notifications@github.com wrote:

Most of these features would be things to control for. "words" and "topics" are not necessarily hypotheses we can test. Stating that we believe that posting more messages about topic "X", or using more of a certain class of words attracts more followers is a hypothesis. To start we can look at the following controls and hypotheses to test.

== Control Features ==

  • Current log(follower count)
  • Current log(friend count)
  • Current "user impact score"
  • User enabled geolocation
  • Main domain of specialization
  • Any other features you can think of to control for

== Hypotheses to test ==

  • posting a specific times of the day -> more followers
  • fewer RTs (proxy for more original content) -> more followers
  • appropriate frequency of posting (e.g. 3-5) -> more followers
  • more direct messages, @ mentions (proxy for engaging with audience) -> more followers
  • higher % days with posting (posts regularly) -> more followers
  • include more URLs in tweets (proxy for introducing original blog content) -> more followers [Twitter user impact paper]
  • "interactivity, defined by an intersection of accounts that tweet regularly, do many @-mentions and @-replies, but also mention many different users" -> more followers [Twitter user impact paper]
  • more positive sentiment posts -> more followers
  • diverse range of topics -> more followers [cite Twitter user impact paper]

For each of these hypotheses we should cite existing advice either in papers. They can just be webpages if cannot find any formal articles.

Models can be built varying the number of control features and hypotheses, varying the history used to compute features, and the time period to measure change in follower count over.

See Tables 3 and 5 for one way to present weights+significance for multiple regression analyses: https://faculty.wharton.upenn.edu/wp-content/uploads/2013/02/SpoilerEffect.pdf Table 3, in particular, is a good example. 5 different regression models are fit varying the number of hypotheses to test.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/102#issuecomment-480544870, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyCeLjrh0tnMlXjkfkrqZtC1WyrdQks5veSqCgaJpZM4cfbYU .

-- Sincerely, Annabelle Carrell

abenton commented 5 years ago

-1. Before these tasks, you should update the data with the full 7 months of tweets and user information.

-1.1. Before computing features for each of these hypotheses, attach a citation to an article and snippet to support this hypothesis.

  1. Time-of-day posting shouldn't be hard. Need to see what is the recommended time to post tweets, and then attach a binary feature to each tweet with whether or not it falls in this time range (either w.r.t. eastern time or user's time zone).
bellecarrell commented 5 years ago

Okay I'll start on -1 tomorrow.

I don't think I have anything to meet about tomorrow. I'll probably run into issues trying to add the new data since it's in a new directory so I might ask questions about that

On Sun, Apr 7, 2019, 9:20 AM Adrian Benton notifications@github.com wrote:

-1. Before these tasks, you should update the data with the full 7 months of tweets and user information.

-1.1. Before computing features for each of these hypotheses, attach a citation to an article and snippet to support this hypothesis.

  1. Time-of-day posting shouldn't be hard. Need to see what is the recommended time to post tweets, and then attach a binary feature to each tweet with whether or not it falls in this time range (either w.r.t. eastern time or user's time zone).

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/102#issuecomment-480590052, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyFpcuHVG9GQxOG89OZBBEBDj2hDPks5vefC5gaJpZM4cfbYU .

abenton commented 5 years ago

Anatomy of a tweet object: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html

abenton commented 5 years ago

@bellecarrell Features to extract for hypotheses:

abenton commented 5 years ago

@bellecarrell also save # of tweets made overall. When computing entropy, make sure to smooth the distribution -- can just do add-\delta smoothing where \delta is something smallish (e.g. 0.1). In case we need to go back and recompute entropy, I would also write out the distributions you compute entropy over, so we can try different smoothing schemes.

I am worried about cases where the blogger may have just posted a single tweet, in which case entropy will be 0 if unsmoothed.

abenton commented 5 years ago

@bellecarrell Another corner case to consider: when computing the binary feature "had post on Friday", you should leave this feature null if there was no Friday in the past aggregation window. That way, we will drop examples where we cannot test this hypothesis.