Why did I choose this paper? Because it analyzes the effect of tweet length on topic modeling methods.
Main problem:
Which model is better for topic detection in the short text (tweet)?
Does the length of tweets affect the performance of topic detection methods?
Existing work:
The main issue with most of the common topic detection methods is that they are basically designed and trained for extracting topics from the regular text (in terms of the number of words).
Traditional Methods: counting. GAP: sparsity and vocabulary mismatch. Next tries:
Tweet-specific feature extraction
Topic memory networks
Topically enriched word embeddings
RNN and LSTM
CNN for Twitter sentiment analysis
BERT
Inputs:
Tweets
Outputs:
the performance of different methods for topic detection task in short-text (tweet)
analyze the effect of tweet length on topic detection performance.
Method:
Preprocessing (data cleaning):
removing:
retweets
multiple and various hashtags
very short tweets
very short and very long words
stop words
lowercase
lemmatization
select 10% of the whole dataset as out-of-topic and add them to the dataset.
Given the preprocessed tweets to the models, F-measure is calculated to determine the performance. After that, tweets are pooled regarding their length and training for top-4 models is done for each pool. Results show that does the length of a text really affect the performance of topic modeling methods or not.
Experimental Setup:
Dataset: 100 million tweets, April 07, 2020 until June 15, 2020, Using hashtags
Top-6 extracted topics: Covid, Black Lives Matter, Korean music, Bollywood, Games, US politics.
Baselines:
Boolean Search Inverted index
Topic Modeling (LDA)
Bag-of-Words TF-IDF
Word Embeddings FastText: the successor of Word2Vec and GloVe
Neural Networks CNN
Transformer-Based Language Models DistilBERT
Results:
Performance comparison
best models:
CNN
BERT
BOW
Word Embedding
LDA
Boolean Search
Effect of length
The length of a tweet matters for the effectiveness of a topic-detection method in both evaluation and training. Best performance: 25-30 words on Avg.
Why did I choose this paper? Because it analyzes the effect of tweet length on topic modeling methods.
Main problem:
Which model is better for topic detection in the short text (tweet)? Does the length of tweets affect the performance of topic detection methods?
Existing work:
The main issue with most of the common topic detection methods is that they are basically designed and trained for extracting topics from the regular text (in terms of the number of words).
Inputs:
Tweets
Outputs:
Method:
Preprocessing (data cleaning):
Given the preprocessed tweets to the models, F-measure is calculated to determine the performance. After that, tweets are pooled regarding their length and training for top-4 models is done for each pool. Results show that does the length of a text really affect the performance of topic modeling methods or not.
Experimental Setup:
Baselines:
Results:
Performance comparison best models:
Effect of length The length of a tweet matters for the effectiveness of a topic-detection method in both evaluation and training. Best performance: 25-30 words on Avg.
Code:
The code of this paper is unavailable. Dataset is available on: https://github.com/avaapm/ECIR2021
Presentation:
There is no available presentation for this paper.