dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
852 stars 136 forks source link

Topic coherence #241

Closed manuelbickel closed 6 years ago

manuelbickel commented 6 years ago

With reference to our discussion to #229 I am sending this pull request for discussing / reviewing my intial approach for calculating topic coherence scores. I hope have opened the request correctly, never done this before. Let me know if I have to change anything.

Since I am not a professional programmer, kindly excuse if there are any unusual parts of code. I see the code as a starting point, which might certainly be improved in many directions (also some of the comments will have to be removed in later versions). Hence, please do not hesitate to criticize/suggest/change...

Please let me know if any information is missing that might be required for understanding and I will try to add it or if I should prepare any interim results (e.g. with small demonstrative sample data) to demonstrate the ouput of individual parts of the code for clarification so that it easier for you to follow and I spare your time. As already mentioned in #229 some initial examples are provided here.

codecov[bot] commented 6 years ago

Codecov Report

Merging #241 into master will decrease coverage by 3.1%. The diff coverage is 9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #241      +/-   ##
==========================================
- Coverage   79.91%   76.81%   -3.11%     
==========================================
  Files          38       39       +1     
  Lines        2330     2454     +124     
==========================================
+ Hits         1862     1885      +23     
- Misses        468      569     +101
Impacted Files Coverage Δ
R/coherence.R 0% <0%> (ø)
R/utils.R 74.28% <100%> (+10.28%) :arrow_up:
src/Vocabulary.cpp 75% <0%> (ø) :arrow_up:
R/tcm.R 91.01% <0%> (ø) :arrow_up:
R/vectorizers.R 93.87% <0%> (ø) :arrow_up:
R/vocabulary.R 88.96% <0%> (+0.07%) :arrow_up:
R/utils_matrix.R 92.72% <0%> (+0.13%) :arrow_up:
src/VocabCorpus.cpp 92.45% <0%> (+0.29%) :arrow_up:
src/Vocabulary.h 88.88% <0%> (+1.19%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8742b36...fa24161. Read the comment docs.

manuelbickel commented 6 years ago

Seems like I have done something wrong, sorry...have never used Codecov, will try to find out where I was going wrong...

dselivanov commented 6 years ago

Don't worry, we will sort everything out. I'm traveling at the moment. Will be back on Saturday.

8 февр. 2018 г. 19:22 пользователь "manuelbickel" notifications@github.com написал:

Seems like I have done something wrong, sorry...have never used Codecov, will try to find out where I was going wrong...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/pull/241#issuecomment-364145132, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3WT9AM7xs4FfviQEarEWuVb39S4yks5tSxFEgaJpZM4R-fwu .

manuelbickel commented 6 years ago

Alright, thanks. Have a safe and enjoybale trip.

manuelbickel commented 6 years ago

Thanks for reviewing and your excellent comments, I hope it did not cost you too much time. I will try to update the code accordingly next week and hope that I can implement everything as suggested. I will also prepare a movie_review example.

I did not fully understand the checks by codecov which returned some not successful checks. Do I have to change anything in this regard?

dselivanov commented 6 years ago

It just reports that percentage of code covered by tests was reduced. We need to add tests which cover new function.

10 февр. 2018 г. 16:29 пользователь "manuelbickel" notifications@github.com написал:

Thanks for reviewing and your excellent comments, I hope it did not cost you too much time. I will try to update the code accordingly next week and hope that I can implement everything as suggested. I will also prepare a movie_review example.

I did not fully understand the checks by codecov which returned some not successful checks. Do I have to change anything in this regard?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/pull/241#issuecomment-364648413, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3Yp2OCm8ldxjEzz3D73M1gU_3PgUks5tTYu_gaJpZM4R-fwu .

manuelbickel commented 6 years ago

Not yet done with all changes but I wanted to make a note already to approach further improvements that the paper by Röder mentioned in #229 was also implemented in python, which might be easier to understand/transfer than the Java implementation of the paper authors, see https://radimrehurek.com/gensim/models/coherencemodel.html and for the code http://pydoc.net/gensim/3.2.0/gensim.topic_coherence/.

manuelbickel commented 6 years ago

Added code to ensure symmetry of tcm (after reduction to top word space). Input tcm might have format as shown below (diag irrelevant for example). Information of both triangles needs to be combined in such cases to generate symmetric matrix (or at least one complete half of it). As always code might be improved but at least this aspect is on the list now. 9 1 0\ 0 9 0\ 2 3 9

Via 052b196 (might be moved to utils...?)

manuelbickel commented 6 years ago

TODO

Implement most interesting coherence measures as identified in paper by Röder et al., see palmetto_coherences. This requires:

manuelbickel commented 6 years ago

@dselivanov Quick question which would help me to proceed with coherence. Does current text2vec functionality somehow allow to count boolean co-occurrence of terms per skip gram window? I could not figure out how this could be done (some details about this issue are included in above TODO list). Thanks in advance!

dselivanov commented 6 years ago

@manuelbickel can't fully understand what is boolean co-occurrence matrix. Could you give an example of how to calculate it?

manuelbickel commented 6 years ago

@dselivanov Sorry for not being clear. Some of the coherence measures use this kind of co-occurrence counting, therefore, I am asking. Here an example: During create_tcm with skip_gram_window = 5 one of the windows would be: “one term and term again”. Resulting co-occurrence c for "one" and "term" would be c = 2.

What is used in coherence measures is something like sign(c), hence, c for term pairs should only be 1 or 0 per window .

I know I can control weights in create_tcm but if I understand it correclty this covers the distance between terms, but cannot be used for controlling the actual count.

I hope have explained this clearly enough.

dselivanov commented 6 years ago

So the difference is that it should only take into account first co-occurrence? If so I think I can add such feature to the C++ code.

manuelbickel commented 6 years ago

Yes, I think this should be it, only the first co-occurrence of each term pair per window. I could more or less follow the logic of your code here in VocabCorpus.cpp, but my knowledge of C++ is only rudimentary and I think would have ended in a mess by trying myself. Hence, thank you for potentially considerung such feature - three remarks in this context:

Please let me know if I can support this process in any way, even if it is only updating documentation...

dselivanov commented 6 years ago

Ok, I will address 1 and 3. For the 2 - tcm contains diagonal - we count co-occurences of context word with "main" word (comments in source code are a bit misleading).

manuelbickel commented 6 years ago

Nice, thank you! I will go on and finalize a first implementation of coherence on this basis including examples, tests, etc. and will let you know as soon as finished.

Side note: I have tried a vectorized approach for coherence using triangles but did not succeed, yet, to reproduce results, e.g., of stm package (UMass measure). Maybe this might be approached as soon as the first implementation is ready...

manuelbickel commented 6 years ago

...forgot, sorry for misunderstanding your code and the comments concerning diagonal of tcm...

manuelbickel commented 6 years ago

Mh, I think I am doing/understanding something wrong regarding the diagonal of the tcm. In below mini example the diagonal is filled with 0´s, whereas, I would expect it to be filled with 1´s. Could you give me a hint where I went wrong? Thanks in advance.

x <- c("some example text")
tokens = word_tokenizer(x)
it = itoken(tokens)
v = create_vocabulary(it)
create_tcm(it, vocab_vectorizer(v), skip_grams_window_context = "symmetric"
           , weights = rep(1, 3)
           , skip_grams_window = 3L) 

#          example some text
# example       .    1    1
# some          .    .    1
# text          .    .    .
dselivanov commented 6 years ago

Consider another example with text = "some example text some":

library(text2vec)
x <- c("some example text some")
tokens = word_tokenizer(x)
it = itoken(tokens)
v = create_vocabulary(it)
tcm = create_tcm(it, vocab_vectorizer(v), skip_grams_window_context = "symmetric"
           , weights = rep(1, 3)
           , skip_grams_window = 3L) 
tcm
        text example some
text       .       1    2
example    .       .    2
some       .       .    1

Also note tcm word_count attribute:

attr(tcm, "word_count")
# 1 1 2
manuelbickel commented 6 years ago

..ouch, sorry, that was an unnecessary question, I have overseen attributes, should more often use str to get an overview, thank you for clarification.

dselivanov commented 6 years ago

@manuelbickel I was thinking about implementation of the "sliding window vocabulary". It seems it won't be very efficient to create it on top of the create_tcm. It makes sense to adopt create_vocabulary... (it will take some time to do that).

manuelbickel commented 6 years ago

@dselivanov Thank you for updating me. It would be interested to understand the depths of text2vec and to understand which are the efficient (or less efficient) ways, but that will also take some time :-). Hence, unfortunetaley, I am currently not very helpful in this process, but as I said already, will provide support where possible and implement coherence measures as soon as we can use a sliding window in text2vec.

Off-topic: What kind of articles are of potential interest to be published in the context of text2vec? Only articles with methodological focus or also discplinary applications? I am asking because I used text2vec for a simple heatmap network analysis. It included a lot of manual coding, hence, not really machine learning like, rather a basic introduction of the idea of text mining in the context of energy systems, see here. At least, text2vec is cited therein. Not sure if this kind of content would be positive or negative advertisement from your perspective, but if you are interested, we could integrate the article (could write addional summary or so). In any case the next one might be of higher interest, since it is a pure machine learning approach...I will let you know as soon as published (will take some time still).

My intention with this question is not self-marketing here, since others like you can certainly do better methodology-wise, but rather promoting the use of text2vec. Hence, mabye I can contribute to a certain extent in this direction.

dselivanov commented 6 years ago

@manuelbickel please check following code from latest commit:

library(text2vec)
data("movie_review")
it = itoken(movie_review$review, tolower, word_tokenizer)
v = create_vocabulary(it, window_size = 50)

Is this what was needed (according to http://pydoc.net/gensim/3.2.0/gensim.topic_coherence.probability_estimation/ it should be)?

or we need co-occurrence and I haven't fully understand the requirement?

manuelbickel commented 6 years ago

Unfortunately, we do need co-occurrence. The basic idea is to take some top topic words and then calculate the coherence (with different options for formulas, e.g., PMI), which is based on information on the probability and the joint probability of terms in a specific window. Some measures take whole documents as window, but some perform better when using, e.g., window of 10 which would be the range of about one sentence. Some intrinsic measures (e.g. UMass) use the original corpus to calculate probabilities, but some extrinsic measures (e.g. PMI) perform better when using an external corpus such as Wikipedia. Check, e.g. page 4, section 3.2 (maybe also 3.3), in the associated publication - the section is not a long read.

Your new commit introduces a nice feature already that allows to calculate, e.g., sentence counts, thank you for this! It would already be fine for calculating coherence if the something like below would be possible, however create_dtm accesses the original documents via the ìterator`, hence, shows 2 documents. We would have to iterate over the virtual window docs. The issue here is probably, that storing all virtual docs might be too memory efficient. Not sure how complicated it would be for you to implement this.

I hope my explanations are good enough so that you can follow them easily. Please let me know if there

text = c("this is a text")
it = itoken(text)
v = create_vocabulary(it, window_size = 3)
#documet window term matrix
dwtm = create_dtm(it, vocab_vectorizer(v))
# 1 x 4 sparse Matrix of class "dgCMatrix" #<- here we would need windows as docs, (2 in the used example)
# text this a is
# 1    1    1 1  1
#term window co-occurrence matrix
twcm = Matrix::crossprod(sign(dwtm))
#total term counts over all windows (-> marginal probability)
diag(twcm) = v$term_count
dselivanov commented 6 years ago

ok, thanks. Now I understand. Will try to implement this week.

dselivanov commented 6 years ago

@manuelbickel sorry, won't be able to work on this this till next weekend (will in trip till March 16).

manuelbickel commented 6 years ago

thanks for noticing me. sorry that i am not yet able to implement this myself..wish you a good journey!

Am 8. März 2018 17:28:57 MEZ schrieb Dmitriy Selivanov notifications@github.com:

@manuelbickel sorry, won't be able to work on this this till next weekend (will in trip till March 16).

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dselivanov/text2vec/pull/241#issuecomment-371541044

-- sent via mobile - please excuse typos

manuelbickel commented 6 years ago

I just wanted to share some (hopefully) motivating interim results after your implementation of binary co-occurrence (if you are short in time and only want to read about tested and working version, please do not read further and wait for next update).

Two of the coherence measures seem to produce halfway acceptable results for my dataset, which consists of about 30000 scientific abstracts and about 1700 thematically similar Wikipedia articles as reference corpus for coherence calculation via tcm with binary_cooccurrence and skip_grams_window = 10. I have made an initial plot, which may be accessed here.

The best measure to select suitable number of n_topics currently seems to be the one as proposed in textmineR package (prob_diff) (this would not necessarily need binary co-occurrence, but I will have to experiment with the size of the sliding window). The second useful one is cosim_npmi, i.e. the NPMI between each top word with all other top words per topic and the mean cosine similarity between these vectors with the sum of these vectors, i.e. the C_V coherence described for palmetto. Not sure, if I made a mistake with implementation of pmi since it seems worthless in my case, but it looks like the vectorized approach to calculate PMI works, at least. Apart from LDA model with WarpLDA, I have also set up an LDA model with topicmodels (not exactly the same data but sufficiently similar, takes too long with topicmodels to run all the models with necessary adaptions for perfect comparability), which had a peak of loglik at around 250 topics (not shown in picture). This seems to be very roughly in the range of optimum number of topics proposed by the coherence measures. Therefore, I currently judge the results of the coherence measures as halfway usable.

Just for interim documentation I have committed the latest working version via 1ba570b, but I will have to make additional checks on the correctness of the implementation, implement one additional measure based on context vectors, and will also have to work on the efficiency of the caclculation but I wanted to let you know already, that coherence seems to move towards a halfway usable version on basis of your support. Thank you so far!

dselivanov commented 6 years ago

@manuelbickel great news, great work! I very appreciated your contributions - they will allow to experiment with different approaches for topic modeling. I have in mind non-negative matrix factorization from reco package. Also ISLE looks very promising (we have almost everything to implement highly efficient ISLE within R environment).

manuelbickel commented 6 years ago

Sorry for the long off-time, I was moving house. I am more or less done with the first round of coherence measures (code still needs to be cleaned a bit, better names might have to be assigned, and the example needs to be improved, etc.).

However, I am struggling to implement a vectorized version of NPMI. My implementation of PMI works and produces the same results as an mapply loop. In the latest commit fa24161 I have started with a general example and visualized the behaviour of different measures via lines 19-70 (of course, you need to load coherence function). There is something wrong with the normalization step to calculate NPMI. Maybe you might have an idea were I am going wrong. Nonvectorized and vectorized version can be found in line 380 to 418. The working version of vectorized PMI canbe found in lines 361-377. These were used as a blueprint for NPMI.

If we cannot identify the error quickly I might pose a question on SO...

dselivanov commented 6 years ago

@manuelbickel thanks for update I will try to take a look this weekend.

dselivanov commented 6 years ago

Hi @manuelbickel. I think in order to simplify PR review we can start with some simple coherence we are confident with. This way we can incrementally improve coherence calculation.

I've created simple template - around 80 lines of code which calculate 2 metrics. I believe calculation itself is wrong (I tried to adopt your code but dropped reordering for simplicity). Could you please check the template and send another PR with calculation for vectorized metrics you are confident? (with roxygen docs and comments - they look very good!)

So I can merge it and then we will keep adding more.

manuelbickel commented 6 years ago

@dselivanov Thank you for your template, which is certainly more elegant than my initial approach (sorry that you had to write so many lines). I will use the template as suggested and will set up a simpler PR that we may use as a start so that this one here may be closed.