Sotera / watchman

Watchman: An open-source social-media event-detection system
GNU General Public License v2.0
20 stars 7 forks source link

Implement FastText topics #171

Open lukewendling opened 7 years ago

lukewendling commented 7 years ago

Overview

Run a daily batch process that uses labeled tweets (def: a tweet with exactly 1 hashtag) to train a fasttext model that will classify all tweets during that day into 1 of X hashtags (topics). X == num. of most popular hashtags. Final output will group the tweets by QCR campaign.

Process

  1. Collect list of most popular hashtags for the period.
  2. Train a fasttext model with all single hashtag tweets for the period that have a popular hashtag.
  3. Classify all tweets for the period using the model, and include a confidence score.
  4. Filter results by confidence score to prune poor matches.
  5. Group results by QCR campaign.
  6. Final output is weighted topics by campaign, by day.
drJAGartner commented 7 years ago

Yes, one note - the PM did more or less the same technique, but his nuance was to collect all tweets with individual tags and simply dedupe. I think there's not a wrong way, but just note what you do for data cleansing so we can present it that way.

lukewendling commented 7 years ago

@drJAGartner

Process questions

  1. Current jupyter nb excludes retweets in training. Should we do the same for quoted tweets?
  2. Should we classify retweets, or ignore them in final output?
lukewendling commented 7 years ago

De-dupe: use 1 retweet as fill-in for original tweet.

lukewendling commented 7 years ago

fyi, de-duped retweets like this, mostly thanks to DataFrame#dropDuplicates():

        df_retweets = df\
        .where('featurizer == "hashtag"')\
        .where('broadcast_post_id is not null')

        df_no_retweets = df\
        .where('featurizer == "hashtag"')\
        .where('broadcast_post_id == "null" or broadcast_post_id is null')

        df_retweets = df_retweets.dropDuplicates(['broadcast_post_id'])

        df_hash = df_retweets.union(df_no_retweets)