Closed leungi closed 5 years ago
Thanks for reporting, I will have a look...
You receive these Inf
and NaN
values because the tcm
your are feeding to coherence
has several zero entries in the diagonal, where there should be non-zero entries in the denominator. The result is that you divide by zero in several instances.
I am not aware of all details how create_tcm
operates but it tends to output an upper triangle matrix. In the current version of text2vec, it seems that not all entries in the diagonal are empty, but most of them. When creating the coherence
function, the diagonal from the ouptut has always been zero, not sure, if this was the case for the test examples by chance...
Anyway, the solution to your problem is one line in the example section of the coherence documentation that might admittedly not be sufficiently prominent. You need to re-assign the marginal probabilities of the individual terms to the diagonal, i.e., their total occurrence or in other words "the number of times a term co-occurs with itself" as follows:
diag(tcm_from_tcm_10) = attributes(tcm_from_tcm_10)$word_count
Please also note, that so far coherence metrics are usually used on the basis of binary co-occurrence counts, which means that the reference tcm only stores the information if two terms co-occur at all in, e.g., a sentence, but not how often. At least, this is the way how coherence scores have been reported in literature, see, e.g. the paper by Röder mentioned in the documentation for coherence.
Therefore, you should turn on the binary co-occurence option in create_tcm
(@dselivanov implemented this as extra option only for making the coherence metrics available) and, furthermore, set all weights equal as follows:
create_tcm(iterator, vectorizer, skip_grams_window = window_size ,weights = rep(1, window_size) ,binary_cooccurence = TRUE)
Please let us know, if this solves your issue so we can close.
By the way, I have been working on a function that automatically creates tcms with the standard settings required by individual coherence metrics (each metric is thought to operate with different windows sizes, alos there are internal and external metrics). It is not documented very well at the moment (also be aware that it writes files to disk) but it might still help to improve your understanding. Maybe we can integrate an advanced version some time into text2vec... You may find it here: ceate_ref_tcm.R You will also need this: tcm_ specs_standard()
@manuelbickel: thanks for the prompt and detailed explanation.
I did come across diag(TCM) = attributes(TCM )$word_count
and wondered what it's for. Upon inspecting the TCM from create_tcm()
, I do see that at times, diag(TCM)
tends to be plenty of zeroes. This solution solves the issue I have, however, I suspect there may be other reasons for the observation than this.
In the below example, though diag(tcm)
still contains a bunch of zeroes, coherence()
doesn't return Inf/NaN
:
library(text2vec)
library(data.table)
library(Matrix)
data(movie_review)
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)
all_ids <- movie_review$id
train_ids <- sample(all_ids, 1000)
test_ids <- setdiff(all_ids, train_ids)
train <- movie_review[J(train_ids)]
test <- movie_review[J(test_ids)]
# define preprocessing function and tokenization function
prep_fun <- tolower
tok_fun <- word_tokenizer
it_train <- itoken(
train$review,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = train$id,
progressbar = FALSE
)
it_test <- itoken(
test$review,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = test$id,
progressbar = FALSE
)
vocab <- create_vocabulary(it_train)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
window_size <- 5L
tcm <-
create_tcm(
it_train,
vectorizer,
skip_grams_window = window_size,
weights = rep(1, window_size),
binary_cooccurence = TRUE
)
dtm <- create_dtm(it_train, vectorizer)
lda_model <- text2vec::LDA$new(
n_topics = 10,
doc_topic_prior = 0.1,
topic_word_prior = 0.05
)
tcm_test <-
create_tcm(
it_test,
vectorizer,
skip_grams_window = window_size,
weights = rep(1, window_size),
binary_cooccurence = TRUE
)
doc_topic_distr <-
lda_model$fit_transform(
x = dtm,
n_iter = 1000,
convergence_tol = 0.001,
n_check_convergence = 25,
progressbar = FALSE
)
#> INFO [2019-01-03 15:20:42] iter 25 loglikelihood = -1226432.208
#> INFO [2019-01-03 15:20:43] iter 50 loglikelihood = -1211310.901
#> INFO [2019-01-03 15:20:43] iter 75 loglikelihood = -1207539.934
#> INFO [2019-01-03 15:20:44] iter 100 loglikelihood = -1205680.388
#> INFO [2019-01-03 15:20:44] iter 125 loglikelihood = -1205488.941
#> INFO [2019-01-03 15:20:44] early stopping at 125 iteration
tw <- lda_model$get_top_words(n = 10, lambda = 1)
sum(diag(tcm) == 0) / length(diag(tcm))
[1] 0.8720486
sum(diag(tcm_test) == 0) / length(diag(tcm_test))
[1] 0.7346526
coherence(tw, tcm, n_doc_tcm = attr(vocab, "document_count"))
## all real values returned
coherence(tw, tcm_test, n_doc_tcm = attr(vocab, "document_count"))
## all real values returned
Created on 2019-01-03 by the reprex package (v0.2.1)
Thanks for the link to create_reference_tcm()
; I'll check it out.
Within coherence function the "final" reference tcm is created by subsetting the top words. In your example the top words (that define the terms/diagonal of the final tcm for calculation) do not intersect with the terms that have zero entries in the diagonal of the tcm. Therefore, your example still works.
intersect(as.vector(tw), colnames(tcm)[(which(diag(tcm) == 0))])
# character(0)
Noted; thanks again for your patience in explaining!
Hi,
I'm replicating an example from the
textmineR
vignette, but the same observation is seen using themovie_review
data intext2vec
.Observation:
coherence()
intext2vec
tends to giveInf
/NaN
when passing in a TCM created fromcreate_tcm()
, and when the argumentskip_grams_window
is set to a low value (e.g., 10L for sentence-level); settingskip_grams_window
to high values (e.g., 150L for paragraph-level) overcomes this and output real numbers.I'm not sure if it should be interpreted that the topic is highly coherent when score is
Inf
, and that it should be dropped whenNaN
.Created on 2019-01-03 by the reprex package (v0.2.1)