dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
851 stars 136 forks source link

Change default version of DTM matrix? #84

Closed lmullen closed 8 years ago

lmullen commented 8 years ago

In get_dtm() and create_dtm() the default type of matrix is a dgCMatrix. But under the hood text2vec uses a dgTMatrix itself for create_dtm() then coerces that dgCMatrix if asked. Why not just keep make dgTMatrix the default and provide an option to go to dgCMatrix? For myself, I prefer the dgTMatrix format because in some applications I like to convert it to a data frame, and it is trivial to do so with a dgTMatrix.

I can send a PR if this change is okay with you.

dselivanov commented 8 years ago

This is an interesting idea. The reason I picked dgCMatrix as default is because it is natural format for linear algebra operations. Most other packages which work with sparse matrices usually operate on dgCMatrix objects. But as you correctly noted, natively we keep matrices in triplet form. And it is trivial to convert dgTMatrix to dgCMatrix: as(M, "dgCMatrix"). So this sounds like sensible idea: construct dtm as fast as we can, without additional coercions by default. @zachmayer, @TommyJones what do you think?

zachmayer commented 8 years ago

Personally, I like dgCMatrix for the reason you said (default format for linear algebra)

Maybe there could be an option in the constructor?

dselivanov commented 8 years ago

@zachmayer, such option exists from the very beginning... But what should be the default, that is the question. I tend to make dgTMatrix as default format, but want to hear arguments against this decision.

zachmayer commented 8 years ago

I prefer dgCMatrix as the default, because I primarily use the dtm for linear algebra, e.g. svd via irlba, penalized regression via glmnet and cosine similarity via dot products.

TommyJones commented 8 years ago

I am less familiar with a dgTMatrix, but I know good questions to consider in deciding if it's a good idea to use a dgTMatrix instead of a dgCMatrix.

  1. Is a dgTMatrix compatible with all of the methods/functions used by a dgCMatrix within R's Matrix library?
  2. If yes, is that because every time one makes a function call with a dgTMatrix, it must first be converted to a dgCMatrix?

In the first case, my concern is for users. They may not be the best programmers and an advantage of "Matrix" matrices is that they appear to behave the same as standard R dense matrices (%*%, nrow, ncol, colSums, etc.). This makes it easier for new users to use text2vec and related libraries. (Compare this to the matrices used by slam/tm, which are comparably difficult to use.)

In the second case, if a conversion is happening, then any computation time saved by not doing the conversion from dgTMatrix to dgCMatrix in get_dtm will be lost many times over by function calls on a dgTMatrix.

Bottom line: if a dgTMatrix won't change how users interact with a DTM (for the most part) and using a dgTMatrix as a DTM doesn't slow things down, then go for it. If, however, users have to learn new methods or it slows down execution downstream, I'd say keep it as a dgCMatrix.

On Fri, Mar 25, 2016 at 8:47 AM Zach Mayer notifications@github.com wrote:

I prefer dgTMatrix as the default, because I primarily use the dtm for linear algebra, e.g. svd via irlba, penalized regression via glmnet and cosine similarity via dot products.

— You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub https://github.com/dselivanov/text2vec/issues/84#issuecomment-201266830

dselivanov commented 8 years ago

@lmullen , @zachmayer , @TommyJones thanks for your suggestions.

I think, we need to leave dgCMatrix as default, because this will involve less headache for less experienced users. Main format for sparse matrices is dgCMatrix - for almost all operations, Matrix package coerces all other types to dgCMatrix.

One thing to do in future - is to support coercion to column sparse matrices at c++ level in get_dtm function.