dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
852 stars 136 forks source link

transformer_tfidf not found #187

Closed targetnull closed 7 years ago

targetnull commented 7 years ago

I'm using R 3.4.0 and have installed text2vec, but when I follow the example in the library/text2vec/html/create_dtm.html, it tells me that transformer_tifdf is not found.

Here is the code

## Not run: 
data("movie_review")
N = 1000
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer)
v = create_vocabulary(it)
#remove very common and uncommon words
pruned_vocab = prune_vocabulary(v, term_count_min = 10,
 doc_proportion_max = 0.5, doc_proportion_min = 0.001)
vectorizer = vocab_vectorizer(v)
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer)
dtm = create_dtm(it, vectorizer)
# get tf-idf matrix from bag-of-words matrix
dtm_tfidf = transformer_tfidf(dtm)

## Example of parallel mode
# set to number of cores on your machine
N_WORKERS = 1
doParallel::registerDoParallel(N_WORKERS)
splits = split_into(movie_review$review, N_WORKERS)
jobs = lapply(splits, itoken, tolower, word_tokenizer, chunks_number = 1)
vectorizer = hash_vectorizer()
dtm = create_dtm(jobs, vectorizer, type = 'dgTMatrix')

## End(Not run)

It is strange that it has Not run in both the beginning and end of the code, is there any runnable example? Thanks.

targetnull commented 7 years ago

The same goes for RWMD in library/text2vec/html/RelaxedWordMoversDistance.html

Here is the code

## Not run: 
data("movie_review")
tokens = movie_review$review %>%
  tolower %>%
  word_tokenizer
v = create_vocabulary(itoken(tokens)) %>%
  prune_vocabulary(term_count_min = 5, doc_proportion_max = 0.5)
corpus = create_corpus(itoken(tokens), vocab_vectorizer(v, skip_grams_window = 5))
dtm = get_dtm(corpus)
tcm = get_tcm(corpus)
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = v, x_max = 10)
wv = glove_model$fit(tcm, n_iter = 10)
rwmd_model = RWMD(wv)
rwmd_dist = dist2(dtm[1:10, ], dtm[1:100, ], method = rwmd_model, norm = 'none')

## End(Not run)

I'm using text2vec version 0.4.0

dselivanov commented 7 years ago

Thanks for reporting. These are documentation errors. ## Not run: decorator means example should not be run during R CRAN submission tests. Please use these functions in a following way:

tfidf = TfIdf$new()
dtm_transformed = tfidf$fit_transform(dtm)
RWMD$new(word_vectors_matrix)
rwmd$dist2(dtm_1, dtm_2)

Documentation errors were already fixed in dev version. So closing issue.

aushev commented 3 years ago

I still see the same piece (with dtm_tfidf = transformer_tfidf(dtm) where transformer_tfidf() is not defined) in the help ?create_dtm, text2vec version 0.6 - is it normal?

dselivanov commented 3 years ago

@aushev could you provide reproducible example of what doesn't work?

aushev commented 3 years ago
  1. Install text2vec package version 0.6
  2. Call ?create_dtm (help page "Document-term matrix construction")
  3. Copypaste and run example from the help page:
    
    data("movie_review")
    N = 1000
    it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer)
    v = create_vocabulary(it)
    #remove very common and uncommon words
    pruned_vocab = prune_vocabulary(v, term_count_min = 10,
    doc_proportion_max = 0.5, doc_proportion_min = 0.001)
    vectorizer = vocab_vectorizer(v)
    it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer)
    dtm = create_dtm(it, vectorizer)
    # get tf-idf matrix from bag-of-words matrix
    dtm_tfidf = transformer_tfidf(dtm)

4. Last line causes error: `could not find function "transformer_tfidf"`
dselivanov commented 3 years ago

Ok, looks like following line transformer_tfidf=TfIdf$new() is missed in the docs.

aushev commented 3 years ago

yes it seems so