Manage very large dataset

dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

http://text2vec.org

Other

850 stars 135 forks source link

Manage very large dataset #263

Closed pommedeterresautee closed 4 years ago

pommedeterresautee commented 6 years ago

Sometimes dataset doesn't fit in memory. Crucial parts:

voc building, when using ngrams, zipflaw is not respected anymore, it requires to build on parts, prune lightly and then merge vocs
applying transformation on the DTM: when DTM occupy > 50% RAM, it starts to be difficult to do anything because R objects are immutable and any transformation imply a copy. 2 sols: apply at Cpp level a transformation by reference OR apply the transfo on part of the object.

Other comment: sometimes, itoken_parallel through a segfault related to a unserialize(a) operation. According to the message it s because of mclapply but I am not sure why / how. It doesn't happen with itoken but it s much slower. It started to happen few days ago (may be an update in some package).

dselivanov commented 6 years ago

While first issue is solvable - we can make interface to Vocabulary similar to other models

The second one is pure infrastructural. It should rely on some framework for out-of-core or distributed data manipulation. I personally think that dask is interesting and it is doable to make R interface, but I don't have time for it and don't feel such things appreciated by anyone in community.

pommedeterresautee commented 6 years ago

For the second point, I agree, dask should be the answer but won't because no one will use complex infra in R (it s not in the spirit of the rapid prototyping and I would prefer to switch to Scala + Spark if I want to go that way). But you don't think it would make sense to implement some transfo by ref? I know it is not the R way at all, but seems to me to make sense? But may be it s very specific, seems no one is working with >> 100 Gb dataset in R world when I check stack overflow ...

dselivanov commented 6 years ago

No I don't think so. Manipulating by reference is a solution for a tiny problem which requires a lot of effort. It is easier to rent larger machine.

вт, 29 мая 2018 г., 10:49 Michaël Benesty notifications@github.com:

For the second point, I agree, dask should be the answer but won't because no one will use complex infra in R (it s not in the spirit of the rapid prototyping and I would prefer to switch to Scala + Spark if I want to go that way). But you don't think it would make sense to implement some transfo by ref? I know it is not the R way at all, but seems to me to make sense? But may be it s very specific, seems no one is working with >> 100 Gb dataset in R world when I check stack overflow ...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/263#issuecomment-392670358, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3V2udUlYMF952YYuuODEaLacQaVLks5t3O9ngaJpZM4UQcBj .