dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

Impossible to build very large DTM #297

Closed pommedeterresautee closed 5 years ago

pommedeterresautee commented 5 years ago

I am working on a large DTM right now with plenty of RAM on my server.

During the DTM building dgTMatrix (30Go) is converted to dgCMatrix. However, it seems that for large matrix it doesn't work. I get this error:

Error in asMethod(object) : 
  les vecteurs longs ne sont pas encore supportés : memory.c:3486

This seems related to https://stackoverflow.com/questions/24335692/large-matrices-in-r-long-vectors-not-supported-yet

Matrix size:

Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:2151930306] 516 631 1637 651 3846 2919 2893 1891 3229 705 ...
  ..@ j       : int [1:2151930306] 49167 98912 107700 79854 107097 107777 99145 25536 64576 92035 ...
  ..@ Dim     : int [1:2] 1969658 426743
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:1969658] REMOVED ...
  .. ..$ : chr [1:426743] REMOVED ...
  ..@ x       : num [1:2151930306] 1 1 1 1 5 1 1 1 1 1 ...
  ..@ factors : list()

Any idea of a workaround? Here you are writing you were working on something: https://github.com/dselivanov/text2vec/issues/4

dselivanov commented 5 years ago

I think long vectors are not supported by underlying Matrix pkg.

pommedeterresautee commented 5 years ago

Ok thanks that's what I was starting to discover... Is there a reason why you use dgTMatrix? and not directly dgCMatrix ?

Because in my case

> dim(full_mat)
[1] 1969658  426743

So it s a lot << 2^31

pommedeterresautee commented 5 years ago

More I think about it more I realize I will never be able to build the DTM on R with these data as I have 2151930306 values, whatever I try... so even in dgCMatrix the i vector would be long... This is a huge limit :-(

dselivanov commented 5 years ago

Try to write to Martin Mächler, maintainer of the Matrix pkg. For me it has never been a limitation. Most of the time it is possible to process dataset in chunks.

пн, 11 февр. 2019 г., 21:50 Michaël Benesty notifications@github.com:

More I think about it more I realize I will never be able to build the DTM on R with these data as I have 2151930306 values, whatever I try... so even in dgCMatrix the i vector would be long... This is a huge limit :-(

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/297#issuecomment-462425695, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3aGam3JEiwKGOWDIxw3DPeJc5PUxks5vMa17gaJpZM4a0hef .

pommedeterresautee commented 5 years ago

I am closing as it is not related to text2vec