Closed manuelbickel closed 6 years ago
Merging #241 into master will decrease coverage by
3.1%
. The diff coverage is9%
.
@@ Coverage Diff @@
## master #241 +/- ##
==========================================
- Coverage 79.91% 76.81% -3.11%
==========================================
Files 38 39 +1
Lines 2330 2454 +124
==========================================
+ Hits 1862 1885 +23
- Misses 468 569 +101
Impacted Files | Coverage Δ | |
---|---|---|
R/coherence.R | 0% <0%> (ø) |
|
R/utils.R | 74.28% <100%> (+10.28%) |
:arrow_up: |
src/Vocabulary.cpp | 75% <0%> (ø) |
:arrow_up: |
R/tcm.R | 91.01% <0%> (ø) |
:arrow_up: |
R/vectorizers.R | 93.87% <0%> (ø) |
:arrow_up: |
R/vocabulary.R | 88.96% <0%> (+0.07%) |
:arrow_up: |
R/utils_matrix.R | 92.72% <0%> (+0.13%) |
:arrow_up: |
src/VocabCorpus.cpp | 92.45% <0%> (+0.29%) |
:arrow_up: |
src/Vocabulary.h | 88.88% <0%> (+1.19%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 8742b36...fa24161. Read the comment docs.
Seems like I have done something wrong, sorry...have never used Codecov, will try to find out where I was going wrong...
Don't worry, we will sort everything out. I'm traveling at the moment. Will be back on Saturday.
8 февр. 2018 г. 19:22 пользователь "manuelbickel" notifications@github.com написал:
Seems like I have done something wrong, sorry...have never used Codecov, will try to find out where I was going wrong...
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/pull/241#issuecomment-364145132, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3WT9AM7xs4FfviQEarEWuVb39S4yks5tSxFEgaJpZM4R-fwu .
Alright, thanks. Have a safe and enjoybale trip.
Thanks for reviewing and your excellent comments, I hope it did not cost you too much time. I will try to update the code accordingly next week and hope that I can implement everything as suggested. I will also prepare a movie_review example.
I did not fully understand the checks by codecov which returned some not successful checks. Do I have to change anything in this regard?
It just reports that percentage of code covered by tests was reduced. We need to add tests which cover new function.
10 февр. 2018 г. 16:29 пользователь "manuelbickel" notifications@github.com написал:
Thanks for reviewing and your excellent comments, I hope it did not cost you too much time. I will try to update the code accordingly next week and hope that I can implement everything as suggested. I will also prepare a movie_review example.
I did not fully understand the checks by codecov which returned some not successful checks. Do I have to change anything in this regard?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/pull/241#issuecomment-364648413, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3Yp2OCm8ldxjEzz3D73M1gU_3PgUks5tTYu_gaJpZM4R-fwu .
Not yet done with all changes but I wanted to make a note already to approach further improvements that the paper by Röder mentioned in #229 was also implemented in python, which might be easier to understand/transfer than the Java implementation of the paper authors, see https://radimrehurek.com/gensim/models/coherencemodel.html and for the code http://pydoc.net/gensim/3.2.0/gensim.topic_coherence/.
Added code to ensure symmetry of tcm (after reduction to top word space). Input tcm might have format as shown below (diag irrelevant for example). Information of both triangles needs to be combined in such cases to generate symmetric matrix (or at least one complete half of it). As always code might be improved but at least this aspect is on the list now. 9 1 0\ 0 9 0\ 2 3 9
Via 052b196 (might be moved to utils...?)
TODO
n_windows_tcm
? - we do have to, compare gensim_prob_estimation skip_grams_window = 3L
#number of skip grams window would be used as n_docs_tcm in coherence()
get_n_skip_gram_windows <- function(tokens, skip_grams_window) {
sum(sapply(tokens, function(x) {ceiling(length(x)/skip_grams_window)}))}
tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = skip_grams_window
,weights = rep(1, skip_grams_window)) #equal weights...but not boolean...
I have not learnt C and Rcpp sufficiently, yet, but as far as I understand your code we might introduce an option in create_tcm
to use boolean co-occurrence and add something like the following here in VocabCorpus.cpp:
if (boolean_cooccurrence == TRUE && tcm[term_index, context_term_index] == 1)
{
//do nothing
} else if
...existing code
Implement most interesting coherence measures as identified in paper by Röder et al., see palmetto_coherences. This requires:
@dselivanov Quick question which would help me to proceed with coherence
. Does current text2vec functionality somehow allow to count boolean co-occurrence of terms per skip gram window? I could not figure out how this could be done (some details about this issue are included in above TODO list). Thanks in advance!
@manuelbickel can't fully understand what is boolean co-occurrence matrix. Could you give an example of how to calculate it?
@dselivanov Sorry for not being clear. Some of the coherence measures use this kind of co-occurrence counting, therefore, I am asking. Here an example:
During create_tcm
with skip_gram_window = 5
one of the windows would be: “one term and term again”. Resulting co-occurrence c
for "one" and "term" would be c = 2
.
What is used in coherence measures is something like sign(c)
, hence, c
for term pairs should only be 1
or 0
per window .
I know I can control weights in create_tcm
but if I understand it correclty this covers the distance between terms, but cannot be used for controlling the actual count.
I hope have explained this clearly enough.
So the difference is that it should only take into account first co-occurrence? If so I think I can add such feature to the C++ code.
Yes, I think this should be it, only the first co-occurrence of each term pair per window. I could more or less follow the logic of your code here in VocabCorpus.cpp, but my knowledge of C++ is only rudimentary and I think would have ended in a mess by trying myself. Hence, thank you for potentially considerung such feature - three remarks in this context:
weight = rep(1, skip_grams_window)
.create_tcm
, might be skipped, because the information could be retreived from this exemplary workaround get_n_skip_gram_windows <- function(tokens, skip_grams_window) {sum(sapply(tokens, function(x {ceiling(length(x)/skip_grams_window)}))}
.Please let me know if I can support this process in any way, even if it is only updating documentation...
Ok, I will address 1 and 3. For the 2 - tcm
contains diagonal - we count co-occurences of context word with "main" word (comments in source code are a bit misleading).
Nice, thank you! I will go on and finalize a first implementation of coherence on this basis including examples, tests, etc. and will let you know as soon as finished.
Side note: I have tried a vectorized approach for coherence using triangles but did not succeed, yet, to reproduce results, e.g., of stm package (UMass measure). Maybe this might be approached as soon as the first implementation is ready...
...forgot, sorry for misunderstanding your code and the comments concerning diagonal of tcm...
Mh, I think I am doing/understanding something wrong regarding the diagonal of the tcm
. In below mini example the diagonal
is filled with 0
´s, whereas, I would expect it to be filled with 1
´s. Could you give me a hint where I went wrong? Thanks in advance.
x <- c("some example text")
tokens = word_tokenizer(x)
it = itoken(tokens)
v = create_vocabulary(it)
create_tcm(it, vocab_vectorizer(v), skip_grams_window_context = "symmetric"
, weights = rep(1, 3)
, skip_grams_window = 3L)
# example some text
# example . 1 1
# some . . 1
# text . . .
Consider another example with text = "some example text some"
:
library(text2vec)
x <- c("some example text some")
tokens = word_tokenizer(x)
it = itoken(tokens)
v = create_vocabulary(it)
tcm = create_tcm(it, vocab_vectorizer(v), skip_grams_window_context = "symmetric"
, weights = rep(1, 3)
, skip_grams_window = 3L)
tcm
text example some
text . 1 2
example . . 2
some . . 1
Also note tcm
word_count
attribute:
attr(tcm, "word_count")
# 1 1 2
..ouch, sorry, that was an unnecessary question, I have overseen attributes, should more often use str
to get an overview, thank you for clarification.
@manuelbickel I was thinking about implementation of the "sliding window vocabulary". It seems it won't be very efficient to create it on top of the create_tcm
. It makes sense to adopt create_vocabulary
... (it will take some time to do that).
@dselivanov Thank you for updating me. It would be interested to understand the depths of text2vec and to understand which are the efficient (or less efficient) ways, but that will also take some time :-). Hence, unfortunetaley, I am currently not very helpful in this process, but as I said already, will provide support where possible and implement coherence measures as soon as we can use a sliding window in text2vec.
Off-topic: What kind of articles are of potential interest to be published in the context of text2vec? Only articles with methodological focus or also discplinary applications? I am asking because I used text2vec for a simple heatmap network analysis. It included a lot of manual coding, hence, not really machine learning like, rather a basic introduction of the idea of text mining in the context of energy systems, see here. At least, text2vec is cited therein. Not sure if this kind of content would be positive or negative advertisement from your perspective, but if you are interested, we could integrate the article (could write addional summary or so). In any case the next one might be of higher interest, since it is a pure machine learning approach...I will let you know as soon as published (will take some time still).
My intention with this question is not self-marketing here, since others like you can certainly do better methodology-wise, but rather promoting the use of text2vec. Hence, mabye I can contribute to a certain extent in this direction.
@manuelbickel please check following code from latest commit:
library(text2vec)
data("movie_review")
it = itoken(movie_review$review, tolower, word_tokenizer)
v = create_vocabulary(it, window_size = 50)
Is this what was needed (according to http://pydoc.net/gensim/3.2.0/gensim.topic_coherence.probability_estimation/ it should be)?
or we need co-occurrence and I haven't fully understand the requirement?
Unfortunately, we do need co-occurrence. The basic idea is to take some top topic words and then calculate the coherence (with different options for formulas, e.g., PMI), which is based on information on the probability and the joint probability of terms in a specific window. Some measures take whole documents as window, but some perform better when using, e.g., window of 10 which would be the range of about one sentence. Some intrinsic measures (e.g. UMass) use the original corpus to calculate probabilities, but some extrinsic measures (e.g. PMI) perform better when using an external corpus such as Wikipedia. Check, e.g. page 4, section 3.2 (maybe also 3.3), in the associated publication - the section is not a long read.
Your new commit introduces a nice feature already that allows to calculate, e.g., sentence counts, thank you for this! It would already be fine for calculating coherence if the something like below would be possible, however create_dtm
accesses the original documents via the ìterator`, hence, shows 2 documents. We would have to iterate over the virtual window docs. The issue here is probably, that storing all virtual docs might be too memory efficient. Not sure how complicated it would be for you to implement this.
I hope my explanations are good enough so that you can follow them easily. Please let me know if there
text = c("this is a text")
it = itoken(text)
v = create_vocabulary(it, window_size = 3)
#documet window term matrix
dwtm = create_dtm(it, vocab_vectorizer(v))
# 1 x 4 sparse Matrix of class "dgCMatrix" #<- here we would need windows as docs, (2 in the used example)
# text this a is
# 1 1 1 1 1
#term window co-occurrence matrix
twcm = Matrix::crossprod(sign(dwtm))
#total term counts over all windows (-> marginal probability)
diag(twcm) = v$term_count
ok, thanks. Now I understand. Will try to implement this week.
@manuelbickel sorry, won't be able to work on this this till next weekend (will in trip till March 16).
thanks for noticing me. sorry that i am not yet able to implement this myself..wish you a good journey!
Am 8. März 2018 17:28:57 MEZ schrieb Dmitriy Selivanov notifications@github.com:
@manuelbickel sorry, won't be able to work on this this till next weekend (will in trip till March 16).
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dselivanov/text2vec/pull/241#issuecomment-371541044
-- sent via mobile - please excuse typos
I just wanted to share some (hopefully) motivating interim results after your implementation of binary co-occurrence (if you are short in time and only want to read about tested and working version, please do not read further and wait for next update).
Two of the coherence measures seem to produce halfway acceptable results for my dataset, which consists of about 30000 scientific abstracts and about 1700 thematically similar Wikipedia articles as reference corpus for coherence calculation via tcm with binary_cooccurrence
and skip_grams_window = 10
. I have made an initial plot, which may be accessed here.
The best measure to select suitable number of n_topics
currently seems to be the one as proposed in textmineR
package (prob_diff
) (this would not necessarily need binary co-occurrence, but I will have to experiment with the size of the sliding window). The second useful one is cosim_npmi
, i.e. the NPMI between each top word with all other top words per topic and the mean cosine similarity between these vectors with the sum of these vectors, i.e. the C_V coherence described for palmetto. Not sure, if I made a mistake with implementation of pmi
since it seems worthless in my case, but it looks like the vectorized approach to calculate PMI works, at least. Apart from LDA model with WarpLDA
, I have also set up an LDA model with topicmodels
(not exactly the same data but sufficiently similar, takes too long with topicmodels to run all the models with necessary adaptions for perfect comparability), which had a peak of loglik
at around 250 topics (not shown in picture). This seems to be very roughly in the range of optimum number of topics proposed by the coherence measures. Therefore, I currently judge the results of the coherence measures as halfway usable.
Just for interim documentation I have committed the latest working version via 1ba570b, but I will have to make additional checks on the correctness of the implementation, implement one additional measure based on context vectors, and will also have to work on the efficiency of the caclculation but I wanted to let you know already, that coherence
seems to move towards a halfway usable version on basis of your support. Thank you so far!
@manuelbickel great news, great work! I very appreciated your contributions - they will allow to experiment with different approaches for topic modeling. I have in mind non-negative matrix factorization from reco
package. Also ISLE looks very promising (we have almost everything to implement highly efficient ISLE within R environment).
Sorry for the long off-time, I was moving house. I am more or less done with the first round of coherence measures (code still needs to be cleaned a bit, better names might have to be assigned, and the example needs to be improved, etc.).
However, I am struggling to implement a vectorized version of NPMI
. My implementation of PMI
works and produces the same results as an mapply
loop. In the latest commit fa24161 I have started with a general example and visualized the behaviour of different measures via lines 19-70 (of course, you need to load coherence function). There is something wrong with the normalization step to calculate NPMI. Maybe you might have an idea were I am going wrong. Nonvectorized and vectorized version can be found in line 380 to 418. The working version of vectorized PMI
canbe found in lines 361-377. These were used as a blueprint for NPMI.
If we cannot identify the error quickly I might pose a question on SO...
@manuelbickel thanks for update I will try to take a look this weekend.
Hi @manuelbickel. I think in order to simplify PR review we can start with some simple coherence we are confident with. This way we can incrementally improve coherence calculation.
I've created simple template - around 80 lines of code which calculate 2 metrics. I believe calculation itself is wrong (I tried to adopt your code but dropped reordering for simplicity). Could you please check the template and send another PR with calculation for vectorized metrics you are confident? (with roxygen docs and comments - they look very good!)
So I can merge it and then we will keep adding more.
@dselivanov Thank you for your template, which is certainly more elegant than my initial approach (sorry that you had to write so many lines). I will use the template as suggested and will set up a simpler PR that we may use as a start so that this one here may be closed.
With reference to our discussion to #229 I am sending this pull request for discussing / reviewing my intial approach for calculating topic coherence scores. I hope have opened the request correctly, never done this before. Let me know if I have to change anything.
Since I am not a professional programmer, kindly excuse if there are any unusual parts of code. I see the code as a starting point, which might certainly be improved in many directions (also some of the comments will have to be removed in later versions). Hence, please do not hesitate to criticize/suggest/change...
Please let me know if any information is missing that might be required for understanding and I will try to add it or if I should prepare any interim results (e.g. with small demonstrative sample data) to demonstrate the ouput of individual parts of the code for clarification so that it easier for you to follow and I spare your time. As already mentioned in #229 some initial examples are provided here.