Open gniewus opened 3 years ago
Hi @gniewus As so often in machine learning, the answer is not known before you have tested it.
1) There is no clear answer, you must test all 3 options 2) Sadly not sure. I used PCA, and it worked quite well. 3) Yes, continue training on a previously trained model helps. Finding news articles from different outlets on the same story sounds reasonable as training data.
Currently, I'm using "T-systems/bert-german-dbmdz-uncased-sentence-stsb" to generate document embedding for every article
There is an much more improved model released by us: https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer - I suggest that you use that model.
@gniewus here is an interesiting discussion about clustering: where sklearn_extra/kmedoids
mentioned. kmedoids are the median version of centroids...
See here: https://github.com/UKPLab/sentence-transformers/issues/320#issuecomment-669823924
Hi @gniewus As so often in machine learning, the answer is not known before you have tested it.
- There is no clear answer, you must test all 3 options
- Sadly not sure. I used PCA, and it worked quite well.
- Yes, continue training on a previously trained model helps. Finding news articles from different outlets on the same story sounds reasonable as training data.
What do you think about:
Hi @petulla, Would you please introduce some references regarding the "learned meta-embedding in the fine-tuning task". It seems a very interesting concept to me and I think it will help me in my research. Does SBERT provide such a meta-embedding?
Ps: I really appreciate all your efforts in implementing such a helpful framework.
Hello,
in my Master Thesis, I was aiming to use BERT for Topic modeling / Document Clustering. As a dataset, I'm using a big corpus of over 100k news articles from a german newspaper(short headline + kicker + article text). The goal is to cluster and later extract the topics out of the clusters. One extra challenge is to make it in a dynamic fashion to cluster not only the old historic articles but also new ones (freshly scraped).
Currently, I'm using "T-systems/bert-german-dbmdz-uncased-sentence-stsb" to generate document embedding for every article (Usually short texts with 250 words and 16 sentences on average). This, combined with HDBSCAN performs quite well, however, I wonder how to improve. I'm facing a few questions in front of me and I'd appreciate any feedback or guidance :)
- What would be the best way to embed a longer text? Should I i.e summarize the article first? Or average 3 embeddings for title, kicker, and the given text?
- Because I am using Density-Based clustering (HDBSCAN) I was wondering what is the best way to reduce the dimensionality of the embeddings for clustering. Should I use UMAP, or rather restrict the number of feature dimensions in the pooling layer.
- I was wondering how to approach the fine-tuning for my dataset.
- I wanted to keep on training the already fine-tuned model trained on german STSB, instead of fine-tuning from scratch. Right?
- I thought about making use of DeepL or AWS to translate the SemEval Headlines and manually paraphrase some of the headlines from my own dataset to generate a dataset for fine-tuning. As my input is currently the whole article text I wonder if it"s a correct move. Shouldn't I instead find hundreds of pairs of very similar news articles from i.e different media outlets?
Ps. Thanks for the great implementation of SBERT and all your work!
hi tom
Hello,
in my Master Thesis, I was aiming to use BERT for Topic modeling / Document Clustering. As a dataset, I'm using a big corpus of over 100k news articles from a german newspaper(short headline + kicker + article text). The goal is to cluster and later extract the topics out of the clusters. One extra challenge is to make it in a dynamic fashion to cluster not only the old historic articles but also new ones (freshly scraped).
Currently, I'm using "T-systems/bert-german-dbmdz-uncased-sentence-stsb" to generate document embedding for every article (Usually short texts with 250 words and 16 sentences on average). This, combined with HDBSCAN performs quite well, however, I wonder how to improve. I'm facing a few questions in front of me and I'd appreciate any feedback or guidance :)
- What would be the best way to embed a longer text? Should I i.e summarize the article first? Or average 3 embeddings for title, kicker, and the given text?
- Because I am using Density-Based clustering (HDBSCAN) I was wondering what is the best way to reduce the dimensionality of the embeddings for clustering. Should I use UMAP, or rather restrict the number of feature dimensions in the pooling layer.
- I was wondering how to approach the fine-tuning for my dataset.
- I wanted to keep on training the already fine-tuned model trained on german STSB, instead of fine-tuning from scratch. Right?
- I thought about making use of DeepL or AWS to translate the SemEval Headlines and manually paraphrase some of the headlines from my own dataset to generate a dataset for fine-tuning. As my input is currently the whole article text I wonder if it"s a correct move. Shouldn't I instead find hundreds of pairs of very similar news articles from i.e different media outlets?
Ps. Thanks for the great implementation of SBERT and all your work!
hi Tomasz Tkaczyk i would like to ask you if you have done with your implementation .could you please help me in this regard as i have almost same project issue with the extra effort of hierarchical clustering topic modeling... it would be a great help from if i get to know the idea behind
the data set i have been using is 20newsgriup dataset already available .
Hello,
in my Master Thesis, I was aiming to use BERT for Topic modeling / Document Clustering. As a dataset, I'm using a big corpus of over 100k news articles from a german newspaper(short headline + kicker + article text). The goal is to cluster and later extract the topics out of the clusters. One extra challenge is to make it in a dynamic fashion to cluster not only the old historic articles but also new ones (freshly scraped).
Currently, I'm using "T-systems/bert-german-dbmdz-uncased-sentence-stsb" to generate document embedding for every article (Usually short texts with 250 words and 16 sentences on average). This, combined with HDBSCAN performs quite well, however, I wonder how to improve. I'm facing a few questions in front of me and I'd appreciate any feedback or guidance :)
What would be the best way to embed a longer text? Should I i.e summarize the article first? Or average 3 embeddings for title, kicker, and the given text?
Because I am using Density-Based clustering (HDBSCAN) I was wondering what is the best way to reduce the dimensionality of the embeddings for clustering. Should I use UMAP, or rather restrict the number of feature dimensions in the pooling layer.
I was wondering how to approach the fine-tuning for my dataset.
Ps. Thanks for the great implementation of SBERT and all your work!