bnosac / doc2vec

Distributed Representations of Sentences and Documents
Other
46 stars 5 forks source link

test #10

Open jwijffels opened 3 years ago

jwijffels commented 3 years ago

@pprablanc would you be interested in testing out this package providing document vectors?

pprablanc commented 3 years ago

Sure, I can test this package. Are there other implementations of PV-DM /PV-DBOW you'd like to compare with yours ?

jwijffels commented 3 years ago

I don't think there are any other in R. Maybe just gensim. But mainly comparing to your other examples where you added an svm/nb on top of a set of embeddings (I saw you did avg embeddings, sif, weighted by tfidf or bm25) to classify something seems a good test. I'm still getting sometimes crashes regarding C stack overflow but I'm working on finding out the reason of this. Feel free to put comments here.

jwijffels commented 3 years ago

I've pushed the package to CRAN today. Maybe you are interested as well in building this https://github.com/ddangelov/Top2Vec by

michalovadek commented 3 years ago

I've pushed the package to CRAN today. Maybe you are interested as well in building this https://github.com/ddangelov/Top2Vec by

* Tokenising text using sentencepiece or tokenizers.bpe / Embed this tokenised text using docvec / Cluster the resulting embeddings with uwot and dbscan / weight topics a bit with tradition tfidf

Great idea. I gave it a very quick and rough go last night and all the pieces seem to be more or less in place:

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 300, iter = 40, hs = TRUE, window = 15,
                       negative = 0, sample = 0.00001, min_count = 50, lr = 0.05, threads = 4)

embeddings_docs <- as.matrix(model, which = "docs")
embeddings_words <- as.matrix(model, which = "words")

docs_umap <- uwot::umap(embeddings_docs, n_neighbors = 15, n_components = 5, metric = "cosine")

cl <- dbscan::hdbscan(docs_umap, minPts = 15)

centroids <- cbind(embeddings_docs,cl$cluster) %>% # create topic vectors as centroids of cluster-document vectors
  as_tibble(rownames = "id") %>% 
  rename(cluster = `V301`) %>% 
  mutate(cluster = as.character(cluster)) %>% 
  group_by(cluster) %>% 
  summarise_if(is.numeric, mean)

k = 30 # iterate over topics to label each with most similar word based on cosine similarity of topic and word vectors

topic <- centroids[k,] %>% 
  select(-cluster) %>% 
  as.numeric() %>%
  matrix(ncol = 300, nrow = 1)
rownames(topic) <- deframe(centroids[k,1])

paragraph2vec_similarity(y = embeddings_words, x = topic, top_n = 10)

Excuse the messy code, I wrote it up in a rush with dplyr. Just wanted to support the notion that top2vec in R is well within reach. The code is also reasonably fast, on par with the Python implementation

jwijffels commented 3 years ago

Great. Thanks for testing out.

michalovadek commented 3 years ago
jwijffels commented 3 years ago
michalovadek commented 3 years ago
jwijffels commented 3 years ago

@michalovadek are you planning to create an R package implementing top2vec?

michalovadek commented 3 years ago

yes, I will start pushing some very initial code to this repo (hopefully soon): https://github.com/michalovadek/top2vecr

I am still thinking about the API, namely how to give the user control over the various parameters (both of your functions as well as umap and hdbscan) without completely overwhelming them

I will also probably start with some inefficient code and only optimize it later, depending on time available

jwijffels commented 3 years ago

Ah great. I'll follow your repository.

michalovadek commented 3 years ago

I pushed a very early implementation to the repo mentioned. Haven't had as much time to work on this but will see in the future. Test it out if you can.

I think all the various components of the main top2vecr function (doc2vec, umap, hdbscan, centroids/medoids, similarity) should be compartmentalized as separate functions, but I am not yet sure whether it makes sense to expose them to the user as well. The main function should in the future also return more data about the process how the topics were obtained. Any suggestions are welcome.

It should be possible to further apply hierarchical clustering to the default hdbscan output, so that the user can basically fix a K number of topics that are to be returned instead of the "optimal" K returned by hdbscan (with default presets). This would then resemble other topic modelling techniques like LDA where K needs to be chosen upfront.

jwijffels commented 3 years ago

Thanks for sharing. Will address remarkts at the repository instead of here.

pprablanc commented 3 years ago

I don't think there are any other in R. Maybe just gensim. But mainly comparing to your other examples where you added an svm/nb on top of a set of embeddings (I saw you did avg embeddings, sif, weighted by tfidf or bm25) to classify something seems a good test. I'm still getting sometimes crashes regarding C stack overflow but I'm working on finding out the reason of this. Feel free to put comments here.

You can find a little test comparing your PVDM, PVDBOW with average embeddings on classification task here --> https://github.com/pprablanc/test_doc2vec I didn't have any crash. PVDBOW works well, but there's something odd with PVDM, the results are pretty low. I don't think there should be a great difference between PVDM and PVDBOW ? What do you think ?

jwijffels commented 3 years ago

Interesting dataset. So a graph dataset where the nodes have text alongside them in order to eventually classify. That made me wonder if I could use this is well alongside this R package which https://github.com/jwijffels/deepwalker to get graph embeddings as well. But that's another story.

jwijffels commented 3 years ago

@michalovadek I've been testing out your implementation of top2vecr and it gives really nice results (as in semantically coherent topics). Only issue I've encountered is that when calling hdbscan, it uses dist on the result of umap and that fails for larger data.

> cl <- dbscan::hdbscan(head(docs_umap, 50000), minPts = 15L)
Error in dist(x, method = "euclidean") : 
  negative length vectors are not allowed
> cl <- dbscan::hdbscan(head(docs_umap, 10000), minPts = 15L)
> str(cl$cluster)
 num [1:10000] 96 50 0 0 0 56 74 0 0 0 ...
  negative length vectors are not allowed
michalovadek commented 3 years ago

thanks for testing, this is a pretty important limitation, as I imagine in many situations the embeddings can really benefit from large corpora. Let's see whether the hdbscan maintainers shed some light on this. I will consider an alternative clustering method in the meanwhile, but I doubt we would achieve the same quality with another method.

jwijffels commented 3 years ago

Yes, that's exactly what I thought as well. There is also currently no predict.hdbscan https://github.com/mhahsler/dbscan/issues/32 if we want to be able to assign new documents to topics. Hopefully the dbscan authors can help. Note there is a predict.hdbscan in this pull request: https://github.com/mhahsler/dbscan/pull/33/files