Closed lmullen closed 8 years ago
This is an interesting idea. The reason I picked dgCMatrix
as default is because it is natural format for linear algebra operations. Most other packages which work with sparse matrices usually operate on dgCMatrix
objects.
But as you correctly noted, natively we keep matrices in triplet form. And it is trivial to convert dgTMatrix
to dgCMatrix
: as(M, "dgCMatrix")
. So this sounds like sensible idea: construct dtm
as fast as we can, without additional coercions by default.
@zachmayer, @TommyJones what do you think?
Personally, I like dgCMatrix for the reason you said (default format for linear algebra)
Maybe there could be an option in the constructor?
@zachmayer, such option exists from the very beginning... But what should be the default, that is the question. I tend to make dgTMatrix
as default format, but want to hear arguments against this decision.
I prefer dgCMatrix
as the default, because I primarily use the dtm
for linear algebra, e.g. svd via irlba
, penalized regression via glmnet
and cosine similarity via dot products.
I am less familiar with a dgTMatrix, but I know good questions to consider in deciding if it's a good idea to use a dgTMatrix instead of a dgCMatrix.
In the first case, my concern is for users. They may not be the best programmers and an advantage of "Matrix" matrices is that they appear to behave the same as standard R dense matrices (%*%, nrow, ncol, colSums, etc.). This makes it easier for new users to use text2vec and related libraries. (Compare this to the matrices used by slam/tm, which are comparably difficult to use.)
In the second case, if a conversion is happening, then any computation time saved by not doing the conversion from dgTMatrix to dgCMatrix in get_dtm will be lost many times over by function calls on a dgTMatrix.
Bottom line: if a dgTMatrix won't change how users interact with a DTM (for the most part) and using a dgTMatrix as a DTM doesn't slow things down, then go for it. If, however, users have to learn new methods or it slows down execution downstream, I'd say keep it as a dgCMatrix.
On Fri, Mar 25, 2016 at 8:47 AM Zach Mayer notifications@github.com wrote:
I prefer dgTMatrix as the default, because I primarily use the dtm for linear algebra, e.g. svd via irlba, penalized regression via glmnet and cosine similarity via dot products.
— You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub https://github.com/dselivanov/text2vec/issues/84#issuecomment-201266830
@lmullen , @zachmayer , @TommyJones thanks for your suggestions.
I think, we need to leave dgCMatrix
as default, because this will involve less headache for less experienced users. Main format for sparse matrices is dgCMatrix
- for almost all operations, Matrix package coerces all other types to dgCMatrix
.
One thing to do in future - is to support coercion to column sparse matrices at c++ level in get_dtm
function.
In
get_dtm()
andcreate_dtm()
the default type of matrix is adgCMatrix
. But under the hood text2vec uses adgTMatrix
itself forcreate_dtm()
then coerces thatdgCMatrix
if asked. Why not just keep makedgTMatrix
the default and provide an option to go todgCMatrix
? For myself, I prefer thedgTMatrix
format because in some applications I like to convert it to a data frame, and it is trivial to do so with adgTMatrix
.I can send a PR if this change is okay with you.