ECIR2021.Tweet Length Matters: A Comparative Analysis on Topic Detection in Microblogs

soroush-ziaeinejad commented 2 years ago

Why did I choose this paper? Because it analyzes the effect of tweet length on topic modeling methods.

Main problem:

Which model is better for topic detection in the short text (tweet)? Does the length of tweets affect the performance of topic detection methods?

Existing work:

The main issue with most of the common topic detection methods is that they are basically designed and trained for extracting topics from the regular text (in terms of the number of words).

Traditional Methods: counting. GAP: sparsity and vocabulary mismatch. Next tries:
Tweet-specific feature extraction
Topic memory networks
Topically enriched word embeddings
RNN and LSTM
CNN for Twitter sentiment analysis
BERT

Inputs:

Tweets

Outputs:

the performance of different methods for topic detection task in short-text (tweet)
analyze the effect of tweet length on topic detection performance.

Method:

Preprocessing (data cleaning):

removing:
- retweets
- multiple and various hashtags
- very short tweets
- very short and very long words
- stop words
lowercase
lemmatization
select 10% of the whole dataset as out-of-topic and add them to the dataset.

Given the preprocessed tweets to the models, F-measure is calculated to determine the performance. After that, tweets are pooled regarding their length and training for top-4 models is done for each pool. Results show that does the length of a text really affect the performance of topic modeling methods or not.

Experimental Setup:

Dataset: 100 million tweets, April 07, 2020 until June 15, 2020, Using hashtags
Top-6 extracted topics: Covid, Black Lives Matter, Korean music, Bollywood, Games, US politics.

Baselines:

Boolean Search Inverted index
Topic Modeling (LDA)
Bag-of-Words TF-IDF
Word Embeddings FastText: the successor of Word2Vec and GloVe
Neural Networks CNN
Transformer-Based Language Models DistilBERT

Results:

Performance comparison best models:
1. CNN
2. BERT
3. BOW
4. Word Embedding
5. LDA
6. Boolean Search
Effect of length The length of a tweet matters for the effectiveness of a topic-detection method in both evaluation and training. Best performance: 25-30 words on Avg.

Code:

The code of this paper is unavailable. Dataset is available on: https://github.com/avaapm/ECIR2021

Presentation:

There is no available presentation for this paper.

hosseinfani commented 2 years ago

@soroush-ziaeinejad where is the body?!

soroush-ziaeinejad commented 2 years ago

@soroush-ziaeinejad where is the body?!

Will be added today. I wanted to put them in to-do list now.

fani-lab / SEERa