Implement FastText topics

Sotera / watchman

Watchman: An open-source social-media event-detection system

GNU General Public License v2.0

20 stars 7 forks source link

Implement FastText topics #171

Open lukewendling opened 7 years ago

lukewendling commented 7 years ago

Overview

Run a daily batch process that uses labeled tweets (def: a tweet with exactly 1 hashtag) to train a fasttext model that will classify all tweets during that day into 1 of X hashtags (topics). X == num. of most popular hashtags. Final output will group the tweets by QCR campaign.

Process

Collect list of most popular hashtags for the period.
Train a fasttext model with all single hashtag tweets for the period that have a popular hashtag.
Classify all tweets for the period using the model, and include a confidence score.
Filter results by confidence score to prune poor matches.
Group results by QCR campaign.
Final output is weighted topics by campaign, by day.

drJAGartner commented 7 years ago

Yes, one note - the PM did more or less the same technique, but his nuance was to collect all tweets with individual tags and simply dedupe. I think there's not a wrong way, but just note what you do for data cleansing so we can present it that way.

lukewendling commented 7 years ago

@drJAGartner

Process questions

Current jupyter nb excludes retweets in training. Should we do the same for quoted tweets?
Should we classify retweets, or ignore them in final output?

lukewendling commented 7 years ago

De-dupe: use 1 retweet as fill-in for original tweet.

lukewendling commented 7 years ago

fyi, de-duped retweets like this, mostly thanks to DataFrame#dropDuplicates():

        df_retweets = df\
        .where('featurizer == "hashtag"')\
        .where('broadcast_post_id is not null')

        df_no_retweets = df\
        .where('featurizer == "hashtag"')\
        .where('broadcast_post_id == "null" or broadcast_post_id is null')

        df_retweets = df_retweets.dropDuplicates(['broadcast_post_id'])

        df_hash = df_retweets.union(df_no_retweets)