Open lukewendling opened 7 years ago
Yes, one note - the PM did more or less the same technique, but his nuance was to collect all tweets with individual tags and simply dedupe. I think there's not a wrong way, but just note what you do for data cleansing so we can present it that way.
@drJAGartner
De-dupe: use 1 retweet as fill-in for original tweet.
fyi, de-duped retweets like this, mostly thanks to DataFrame#dropDuplicates():
df_retweets = df\
.where('featurizer == "hashtag"')\
.where('broadcast_post_id is not null')
df_no_retweets = df\
.where('featurizer == "hashtag"')\
.where('broadcast_post_id == "null" or broadcast_post_id is null')
df_retweets = df_retweets.dropDuplicates(['broadcast_post_id'])
df_hash = df_retweets.union(df_no_retweets)
Overview
Run a daily batch process that uses labeled tweets (def: a tweet with exactly 1 hashtag) to train a fasttext model that will classify all tweets during that day into 1 of X hashtags (topics). X == num. of most popular hashtags. Final output will group the tweets by QCR campaign.
Process