Add IV types to feature table

bellecarrell commented 5 years ago

posting at specific times of the day/week
- [ ] binary was a tweet posted last Friday,
- [ ] % messages on last X Fridays (dependent on aggregation window),
- [ ] % X last Fridays with at least one tweet
- [ ] % messages posted between 9-12 pm, https://sproutsocial.com/insights/best-times-to-post-on-social-media/
RTs, proxy for originality vs. retweeting another's content
- [x] % tweets are RTs
appropriate frequency of posting
- [x] number of tweets/day
- entropy of proportion messages per day in aggregation window (high entropy means more even tweeting across last few days)
proxies for engaging with audience
- [x] % messages that are replies
- [x] average # of user mentions/tweet
- [x] % messages with user mentions
posts regularly
- % days with at least 1 message
sharing external content
- [x] % messages with URL
- % messages with URL to user's blog (would be nice)
"interactivity, defined by an intersection of accounts that tweet regularly, do many @-mentions and @-replies, but also mention many different users" [from Twitter user impact paper]
- [ ] Binary interactive vs. non-interactive. Derived from # of tweets/day, % messages that are replies, avg # of user mentions/tweet, and % messages with user mentions. Interactive are users where all values are above the mean (over population of users for this time period), non-interactive are those with # tweets/day > mean, but all other have values < mean. See Twitter user impact paper for how they derived this.
sentiment
- [ ] % messages with sentiment > 0
- [ ] average tweet sentiment score
topics: each tweet is assigned to the topic of maximum value
- [ ] entropy of topic distribution over tweets
- [ ] % of tweets assigned to plurality topic
- [ ] % tweets assigned to topic X for each topic

bellecarrell commented 5 years ago

@bellecarrell also save # of tweets made overall. When computing entropy, make sure to smooth the distribution -- can just do add-\delta smoothing where \delta is something smallish (e.g. 0.1). In case we need to go back and recompute entropy, I would also write out the distributions you compute entropy over, so we can try different smoothing schemes.

I am worried about cases where the blogger may have just posted a single tweet, in which case entropy will be 0 if unsmoothed.

bellecarrell commented 5 years ago

current checked have code --- still need to test.

unchecked no code

bellecarrell commented 5 years ago

last 3 sections of features -- time of day posting, topic, sentiment -- aren't in current timeline table (sentiment and topic in tables to be joined, time of day also). details for topic and sentiment:

Sentiment features can be found here on the COE grid:

/exp/abenton/twitter_brand_workspace_20190417/sentiment/promoting_user_tweets.with_lexiconbased_sentiment.noduplicates.tsv.gz

Columns: "tweet_sentiment_score" and "tweet_sentiment_class"

Sentiment was inferred using a lexicon that also accounts for words appearing in negated spans (preceded by a negation word without any intervening punctuation). See for description (S140 AffLex and S140 NegLex):

@Article{kiritchenko2014sentiment, title={Sentiment analysis of short informal texts}, author={Kiritchenko, Svetlana and Zhu, Xiaodan and Mohammad, Saif M}, journal={Journal of Artificial Intelligence Research}, volume={50}, pages={723--762}, year={2014} }

Topic weights per tweet can be found here:

/exp/abenton/twitter_brand_workspace_20190417/topic_modeling/promoting_user_tweets.with_topic_dist_inferred_by_nmf-k50_userlevel.noduplicates.tsv.gz

under column "topics_per_tweet".

Trained NMF model: /exp/abenton/twitter_brand_workspace_20190417/topic_modeling/nmf-k50-alpha0.0.model.pickle

Representative words per topic: /exp/abenton/twitter_brand_workspace_20190417/topic_modeling/nmf-k50-alpha0.0.topics.txt

Used a rank 50 non-negative matrix factorization to infer "topics". NMF was fit on document-term matrix where I consider all tweets user made to be a document, and then applied this model to each tweet to infer a weighting of topics for each individual tweet.

bellecarrell / twitter_brand

Add IV types to feature table #113