Fine-tuning SBERT for Document Clustering

gniewus commented 3 years ago

Hello,

in my Master Thesis, I was aiming to use BERT for Topic modeling / Document Clustering. As a dataset, I'm using a big corpus of over 100k news articles from a german newspaper(short headline + kicker + article text). The goal is to cluster and later extract the topics out of the clusters. One extra challenge is to make it in a dynamic fashion to cluster not only the old historic articles but also new ones (freshly scraped).

Currently, I'm using "T-systems/bert-german-dbmdz-uncased-sentence-stsb" to generate document embedding for every article (Usually short texts with 250 words and 16 sentences on average). This, combined with HDBSCAN performs quite well, however, I wonder how to improve. I'm facing a few questions in front of me and I'd appreciate any feedback or guidance :)

What would be the best way to embed a longer text? Should I i.e summarize the article first? Or average 3 embeddings for title, kicker, and the given text?
Because I am using Density-Based clustering (HDBSCAN) I was wondering what is the best way to reduce the dimensionality of the embeddings for clustering. Should I use UMAP, or rather restrict the number of feature dimensions in the pooling layer.
I was wondering how to approach the fine-tuning for my dataset.

I wanted to keep on training the already fine-tuned model trained on german STSB, instead of fine-tuning from scratch. Right?
I thought about making use of DeepL or AWS to translate the SemEval Headlines and manually paraphrase some of the headlines from my own dataset to generate a dataset for fine-tuning. As my input is currently the whole article text I wonder if it"s a correct move. Shouldn't I instead find hundreds of pairs of very similar news articles from i.e different media outlets?

Ps. Thanks for the great implementation of SBERT and all your work!

nreimers commented 3 years ago

Hi @gniewus As so often in machine learning, the answer is not known before you have tested it.

1) There is no clear answer, you must test all 3 options 2) Sadly not sure. I used PCA, and it worked quite well. 3) Yes, continue training on a previously trained model helps. Finding news articles from different outlets on the same story sounds reasonable as training data.

PhilipMay commented 3 years ago

Currently, I'm using "T-systems/bert-german-dbmdz-uncased-sentence-stsb" to generate document embedding for every article

There is an much more improved model released by us: https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer - I suggest that you use that model.

PhilipMay commented 3 years ago

@gniewus here is an interesiting discussion about clustering: where sklearn_extra/kmedoids mentioned. kmedoids are the median version of centroids...

See here: https://github.com/UKPLab/sentence-transformers/issues/320#issuecomment-669823924

petulla commented 3 years ago

Hi @gniewus As so often in machine learning, the answer is not known before you have tested it.

There is no clear answer, you must test all 3 options

Sadly not sure. I used PCA, and it worked quite well.

Yes, continue training on a previously trained model helps. Finding news articles from different outlets on the same story sounds reasonable as training data.

What do you think about:

Try a model like reformer, linformer, performer etc.. that can handle more inputs. Try a learned meta-embedding in the fine-tuning task by having several models in parallel (part 3).
Use the distillation script

matinashtiani commented 3 years ago

Hi @petulla, Would you please introduce some references regarding the "learned meta-embedding in the fine-tuning task". It seems a very interesting concept to me and I think it will help me in my research. Does SBERT provide such a meta-embedding?

Ps: I really appreciate all your efforts in implementing such a helpful framework.

zari-sudo commented 3 years ago

Hello,

in my Master Thesis, I was aiming to use BERT for Topic modeling / Document Clustering. As a dataset, I'm using a big corpus of over 100k news articles from a german newspaper(short headline + kicker + article text). The goal is to cluster and later extract the topics out of the clusters. One extra challenge is to make it in a dynamic fashion to cluster not only the old historic articles but also new ones (freshly scraped).

Currently, I'm using "T-systems/bert-german-dbmdz-uncased-sentence-stsb" to generate document embedding for every article (Usually short texts with 250 words and 16 sentences on average). This, combined with HDBSCAN performs quite well, however, I wonder how to improve. I'm facing a few questions in front of me and I'd appreciate any feedback or guidance :)

What would be the best way to embed a longer text? Should I i.e summarize the article first? Or average 3 embeddings for title, kicker, and the given text?

Because I am using Density-Based clustering (HDBSCAN) I was wondering what is the best way to reduce the dimensionality of the embeddings for clustering. Should I use UMAP, or rather restrict the number of feature dimensions in the pooling layer.

I was wondering how to approach the fine-tuning for my dataset.

I wanted to keep on training the already fine-tuned model trained on german STSB, instead of fine-tuning from scratch. Right?

I thought about making use of DeepL or AWS to translate the SemEval Headlines and manually paraphrase some of the headlines from my own dataset to generate a dataset for fine-tuning. As my input is currently the whole article text I wonder if it"s a correct move. Shouldn't I instead find hundreds of pairs of very similar news articles from i.e different media outlets?

Ps. Thanks for the great implementation of SBERT and all your work!

hi tom

Hello,

in my Master Thesis, I was aiming to use BERT for Topic modeling / Document Clustering. As a dataset, I'm using a big corpus of over 100k news articles from a german newspaper(short headline + kicker + article text). The goal is to cluster and later extract the topics out of the clusters. One extra challenge is to make it in a dynamic fashion to cluster not only the old historic articles but also new ones (freshly scraped).

Currently, I'm using "T-systems/bert-german-dbmdz-uncased-sentence-stsb" to generate document embedding for every article (Usually short texts with 250 words and 16 sentences on average). This, combined with HDBSCAN performs quite well, however, I wonder how to improve. I'm facing a few questions in front of me and I'd appreciate any feedback or guidance :)

What would be the best way to embed a longer text? Should I i.e summarize the article first? Or average 3 embeddings for title, kicker, and the given text?

Because I am using Density-Based clustering (HDBSCAN) I was wondering what is the best way to reduce the dimensionality of the embeddings for clustering. Should I use UMAP, or rather restrict the number of feature dimensions in the pooling layer.

I was wondering how to approach the fine-tuning for my dataset.

I wanted to keep on training the already fine-tuned model trained on german STSB, instead of fine-tuning from scratch. Right?

I thought about making use of DeepL or AWS to translate the SemEval Headlines and manually paraphrase some of the headlines from my own dataset to generate a dataset for fine-tuning. As my input is currently the whole article text I wonder if it"s a correct move. Shouldn't I instead find hundreds of pairs of very similar news articles from i.e different media outlets?

Ps. Thanks for the great implementation of SBERT and all your work!

hi Tomasz Tkaczyk i would like to ask you if you have done with your implementation .could you please help me in this regard as i have almost same project issue with the extra effort of hierarchical clustering topic modeling... it would be a great help from if i get to know the idea behind

zari-sudo commented 3 years ago

the data set i have been using is 20newsgriup dataset already available .

UKPLab / sentence-transformers

Fine-tuning SBERT for Document Clustering #542