feature list for adrian

bellecarrell commented 5 years ago

assemble list of features to evaluate using statsmodels

bellecarrell commented 5 years ago

From Lampos:

words
topics

From reviewers and other sources (and not already mentioned):

sentiment

bellecarrell commented 5 years ago

@abenton list above. Let me know if you need anything else for ranking

abenton commented 5 years ago

Most of these features would be things to control for. "words" and "topics" are not necessarily hypotheses we can test. Stating that we believe that posting more messages about topic "X", or using more of a certain class of words attracts more followers is a hypothesis. To start we can look at the following controls and hypotheses to test.

== Control Features ==

Current log(follower count)
Current log(friend count)
Current "user impact score"
User enabled geolocation
Main domain of specialization
Any other features you can think of to control for

== Hypotheses to test ==

posting a specific times of the day -> more followers
fewer RTs (proxy for more original content) -> more followers
appropriate frequency of posting (e.g. 3-5) -> more followers
more direct messages, @ mentions (proxy for engaging with audience) -> more followers
higher % days with posting (posts regularly) -> more followers
include more URLs in tweets (proxy for introducing original blog content) -> more followers [Twitter user impact paper]
"interactivity, defined by an intersection of accounts that tweet regularly, do many @-mentions and @-replies, but also mention many different users" -> more followers [Twitter user impact paper]
more positive sentiment posts -> more followers
diverse range of topics -> more followers [cite Twitter user impact paper]

For each of these hypotheses we should cite existing advice either in papers. They can just be webpages if cannot find any formal articles.

Models can be built varying the number of control features and hypotheses, varying the history used to compute features, and the time period to measure change in follower count over.

See Tables 3 and 5 for one way to present weights+significance for multiple regression analyses: https://faculty.wharton.upenn.edu/wp-content/uploads/2013/02/SpoilerEffect.pdf Table 3, in particular, is a good example. 5 different regression models are fit varying the number of hypotheses to test.

bellecarrell commented 5 years ago

This is a good list to start with. Are these ranked, or are they all at the same priority level?

In terms of tasks, I'm thinking about:

Add values to user table for controls not already handled (location,
friends which might actually already be handled)
Calculate feature values for each hypothesis for each user. Some of these may be more time-intensive than others (for time-of-day posting, for example, will we look at the distribution across all posts and compute a particular value? I'll get started on these and ask questions as I go along.
Run tests with statsmodels

On Sat, Apr 6, 2019 at 7:15 PM Adrian Benton notifications@github.com wrote:

Most of these features would be things to control for. "words" and "topics" are not necessarily hypotheses we can test. Stating that we believe that posting more messages about topic "X", or using more of a certain class of words attracts more followers is a hypothesis. To start we can look at the following controls and hypotheses to test.

== Control Features ==

Current log(follower count)

Current log(friend count)

Current "user impact score"

User enabled geolocation

Main domain of specialization

Any other features you can think of to control for

== Hypotheses to test ==

posting a specific times of the day -> more followers

fewer RTs (proxy for more original content) -> more followers

appropriate frequency of posting (e.g. 3-5) -> more followers

more direct messages, @ mentions (proxy for engaging with audience) -> more followers

higher % days with posting (posts regularly) -> more followers

include more URLs in tweets (proxy for introducing original blog content) -> more followers [Twitter user impact paper]

"interactivity, defined by an intersection of accounts that tweet regularly, do many @-mentions and @-replies, but also mention many different users" -> more followers [Twitter user impact paper]

more positive sentiment posts -> more followers

diverse range of topics -> more followers [cite Twitter user impact paper]

For each of these hypotheses we should cite existing advice either in papers. They can just be webpages if cannot find any formal articles.

Models can be built varying the number of control features and hypotheses, varying the history used to compute features, and the time period to measure change in follower count over.

See Tables 3 and 5 for one way to present weights+significance for multiple regression analyses: https://faculty.wharton.upenn.edu/wp-content/uploads/2013/02/SpoilerEffect.pdf Table 3, in particular, is a good example. 5 different regression models are fit varying the number of hypotheses to test.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/102#issuecomment-480544870, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyCeLjrh0tnMlXjkfkrqZtC1WyrdQks5veSqCgaJpZM4cfbYU .

-- Sincerely, Annabelle Carrell

abenton commented 5 years ago

-1. Before these tasks, you should update the data with the full 7 months of tweets and user information.

-1.1. Before computing features for each of these hypotheses, attach a citation to an article and snippet to support this hypothesis.

Time-of-day posting shouldn't be hard. Need to see what is the recommended time to post tweets, and then attach a binary feature to each tweet with whether or not it falls in this time range (either w.r.t. eastern time or user's time zone).

bellecarrell commented 5 years ago

Okay I'll start on -1 tomorrow.

I don't think I have anything to meet about tomorrow. I'll probably run into issues trying to add the new data since it's in a new directory so I might ask questions about that

On Sun, Apr 7, 2019, 9:20 AM Adrian Benton notifications@github.com wrote:

-1. Before these tasks, you should update the data with the full 7 months of tweets and user information.

-1.1. Before computing features for each of these hypotheses, attach a citation to an article and snippet to support this hypothesis.

Time-of-day posting shouldn't be hard. Need to see what is the recommended time to post tweets, and then attach a binary feature to each tweet with whether or not it falls in this time range (either w.r.t. eastern time or user's time zone).

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/102#issuecomment-480590052, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyFpcuHVG9GQxOG89OZBBEBDj2hDPks5vefC5gaJpZM4cfbYU .

abenton commented 5 years ago

Anatomy of a tweet object: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html

abenton commented 5 years ago

@bellecarrell Features to extract for hypotheses:

posting at specific times of the day/week
- binary was a tweet posted last Friday, % messages on last X Fridays (dependent on aggregation window), % X last Fridays with at least one tweet
- % messages posted between 9-12 am, https://sproutsocial.com/insights/best-times-to-post-on-social-media/
RTs, proxy for originality vs. retweeting another's content
- % tweets are RTs
appropriate frequency of posting
- number of tweets/day
- entropy of proportion messages per day in aggregation window (high entropy means more even tweeting across last few days)
proxies for engaging with audience
- % messages that are replies
- average # of user mentions/tweet
- % messages with user mentions
posts regularly
- % days with at least 1 message
sharing external content
- % messages with URL
- % messages with URL to user's blog (would be nice)
"interactivity, defined by an intersection of accounts that tweet regularly, do many @-mentions and @-replies, but also mention many different users" [from Twitter user impact paper]
- Binary interactive vs. non-interactive. Derived from # of tweets/day, % messages that are replies, avg # of user mentions/tweet, and % messages with user mentions. Interactive are users where all values are above the mean (over population of users for this time period), non-interactive are those with # tweets/day > mean, but all other have values < mean. See Twitter user impact paper for how they derived this.
sentiment
- % messages with sentiment > 0
- average tweet sentiment score
topics: each tweet is assigned to the topic of maximum value
- entropy of topic distribution over tweets
- % of tweets assigned to plurality topic
- % tweets assigned to topic X for each topic

abenton commented 5 years ago

@bellecarrell also save # of tweets made overall. When computing entropy, make sure to smooth the distribution -- can just do add-\delta smoothing where \delta is something smallish (e.g. 0.1). In case we need to go back and recompute entropy, I would also write out the distributions you compute entropy over, so we can try different smoothing schemes.

I am worried about cases where the blogger may have just posted a single tweet, in which case entropy will be 0 if unsmoothed.

abenton commented 5 years ago

@bellecarrell Another corner case to consider: when computing the binary feature "had post on Friday", you should leave this feature null if there was no Friday in the past aggregation window. That way, we will drop examples where we cannot test this hypothesis.

bellecarrell / twitter_brand

feature list for adrian #102

friends which might actually already be handled)