Closed pommedeterresautee closed 4 years ago
While first issue is solvable - we can make interface to Vocabulary similar to other models
The second one is pure infrastructural. It should rely on some framework for out-of-core or distributed data manipulation. I personally think that dask
is interesting and it is doable to make R interface, but I don't have time for it and don't feel such things appreciated by anyone in community.
For the second point, I agree, dask should be the answer but won't because no one will use complex infra in R (it s not in the spirit of the rapid prototyping and I would prefer to switch to Scala + Spark if I want to go that way). But you don't think it would make sense to implement some transfo by ref? I know it is not the R way at all, but seems to me to make sense? But may be it s very specific, seems no one is working with >> 100 Gb dataset in R world when I check stack overflow ...
No I don't think so. Manipulating by reference is a solution for a tiny problem which requires a lot of effort. It is easier to rent larger machine.
вт, 29 мая 2018 г., 10:49 Michaël Benesty notifications@github.com:
For the second point, I agree, dask should be the answer but won't because no one will use complex infra in R (it s not in the spirit of the rapid prototyping and I would prefer to switch to Scala + Spark if I want to go that way). But you don't think it would make sense to implement some transfo by ref? I know it is not the R way at all, but seems to me to make sense? But may be it s very specific, seems no one is working with >> 100 Gb dataset in R world when I check stack overflow ...
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/263#issuecomment-392670358, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3V2udUlYMF952YYuuODEaLacQaVLks5t3O9ngaJpZM4UQcBj .
Sometimes dataset doesn't fit in memory. Crucial parts:
DTM
: whenDTM
occupy > 50% RAM, it starts to be difficult to do anything because R objects are immutable and any transformation imply a copy. 2 sols: apply at Cpp level a transformation by reference OR apply the transfo on part of the object.Other comment: sometimes, itoken_parallel through a
segfault
related to aunserialize(a)
operation. According to the message it s because ofmclapply
but I am not sure why / how. It doesn't happen withitoken
but it s much slower. It started to happen few days ago (may be an update in some package).