dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

Uncatched extra arguments for foreach in create_vocabulary for itoken_parallel. #307

Closed yihe9063 closed 4 years ago

yihe9063 commented 4 years ago

Hi,

Version: 0.5.1, OS: Windows

Is it possible that the extra arguments "..." for create_vocabulary are not catched?

As I could see in the source code for 0.6:


create_vocabulary = function(it, ngram = c("ngram_min" = 1L, "ngram_max" = 1L),

stopwords = character(0), sep_ngram = "_", window_size = 0L)

Maybe it should automatically use the class method create_vocabulary.itoken_parallel which has the "...". However I couldnt pass the ".export" in any way to create_vocabulary with an initialized itoken_parallel, and got the "unused argument..." error.

Also for create_dtm I could pass .export to it without any problem.

Just saw the update notes for 0.6 could it be related?

Best regards, Yiyang (y-he2)

dselivanov commented 4 years ago

Seems you are right - and we may need to add ... to generic create_vocabulary. PR is welcome.

yihe9063 commented 4 years ago

Just started my vacation, can take a look when I got back.

dselivanov commented 4 years ago

@yihe9063 any chance for you to send PR to fix this?

y-he2 commented 4 years ago

Hi, (This will be my main account.)

Sry for a very late reply, I got caught in two separate vacation trips, and the new virus situation.

I can try to fix it in the upcoming weekends, since I have never collab on a open source project before, it may take some time to get into it.

By the mean time I discovered that itoken_parallel was disabled for Windows in https://github.com/dselivanov/text2vec/issues/293 Any reason why "itoken_par" (not "ifiles_par") would not work in Windows?

itoken_parallel = function(iterable, ...) {
  if(.Platform$OS.type != "unix") {
    warning("`ifiles_parallel` is not supported on windows - falling back to `itoken`")
    UseMethod("itoken")
  } else {
    UseMethod("itoken_parallel")
  }
}

Also the warning msg was wrong for the itoken_parallel, still using the same msg for "ifiles_par" I could try to fix the warning msg at the same time.

But since direct parallel processing was disabled on Windows, is the best practice now to use some future_map alike functionality to do distribute the load on Windows?

Update: I just read the docs for create_dtm, looks like if we split the load to a few iterators, the function should still compute the DTM in parallel, will test whether this is the case.

Update2: got an error when tried that way:

Error in UseMethod("create_dtm") : 
  no applicable method for 'create_dtm' applied to an object of class "list"

Maybe I should open another issue?

Best regards, Ian (Yiyang)

dselivanov commented 4 years ago

But since direct parallel processing was disabled on Windows, is the best practice now to use some future_map alike functionality to do distribute the load on Windows?

Update: I just read the docs for create_dtm, looks like if we split the load to a few iterators, the function should still compute the DTM in parallel, will test whether this is the case.

I theory yes, but I don't recommend to do that.

I'm closing as the issue is not relevant anymore.