JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.72k stars 367 forks source link

svd and eig on DataMatrix #329

Closed ViralBShah closed 5 years ago

ViralBShah commented 11 years ago

What are the benefits of the svd and eig implementations for DataMatrix, over using the one from Base?

The comment in the code suggests that this one can handle NA. If not already done, this is a useful thing to document this, the algorithm used, and the citation.

This also makes me wonder on how to handle sparse datasets with missing data. Often people just use svds in matlab, where the missing data is treated as a 0, which is not what one wants.

johnmyleswhite commented 11 years ago

This algorithm needs a lot of documentation and will not be the default going forward. Its value is that you get an imputed SVD that gives the lowest error possibility on the non-missing values.

It would be very slow for sparse matrices, I suspect.

StefanKarpinski commented 11 years ago

Not only slow, but horribly unstable. This approach, as I recall, is to assume each value is the mean of it's row and/or column, do a truncated SVD, replace each missing value by its low-rank approximation, and repeat until convergence. For sparse matrices, we might want to see if we can get one of the low-rank matrix completion algorithms to work.

johnmyleswhite commented 11 years ago

I'm happy to implement better algorithms. This algorithm is very simple and works reasonably well for small data sets.

nfoti commented 11 years ago

Should other matrix factorizations that are commonly used with missing data go here as well, e.g. non-negative matrix factorization, and common regularized versions of the "SVD" and NMF, etc.? Or should there be a separate package for that stuff that can depend on DataFrames?

johnmyleswhite commented 11 years ago

Including additional imputation options sounds like a good idea. But that would add a lot of functionality that arguably doesn't belong in the basic DataFrames package. Maybe all the linear algebra stuff I started should go elsewhere, so that we can combine it with things like regularized SVD's and NMF more easily

abhijithch commented 11 years ago

I had the same issue while implementing recsys(NETFLIX) using svds in MATLAB. So used Alternating least squares method instead to get fairly better results.

johnmyleswhite commented 11 years ago

Maybe we should just have MatrixCompletion package for handling this? I can easily move my code there and give others a forum for implementing other (probably better) algorithms. I've even got code for the SGD algorithm described in some of the NFLX papers that I can share easily.

ViralBShah commented 11 years ago

MatrixCompletion does seem like a good idea. @abhi123link I just happened to look at this svd implementation post our discussion about missing values for the netflix problem, and that led me to file this issue. Good to see you here!

ViralBShah commented 11 years ago

I wonder where the various Non-negative matrix factorizations would live? MatrixCompletion does seem like a relevant place, but perhaps a little bit of a mismatch. @StefanKarpinski and I have a bunch of these in Matlab from a few years back, which would be nice to bring into julia.

Any alternative package names that can capture all these various algorithms?

ViralBShah commented 11 years ago

@nfoti I missed your NNMF comment above. Any thoughts? BTW, I am not sure if GitHub sends notifications about this, but I did add you to the list of Julia packagers, which should give you commit access to all the packages that are in JuliaLang.

ViralBShah commented 11 years ago

Also cc: @lindahua

StefanKarpinski commented 11 years ago

Non-negative matrix "factorization" is really about matrix approximation since you can't generally factor non-negative matrices into non-negative factors, so perhaps a MatrixApproximation package. Or maybe a MatrixFactorization package for "advanced" or unusual factorizations. A separate MatrixCompletion package makes sense for algorithms that impute missing values – which would naturally go well with DataArrays since they are precisely Arrays augmented with the ability to indicate missing values. I would love to see some really scalable, reliable, fast implementations of matrix completion stuff since I've never been able to get these algorithms to work a really large scale – they tend to work well at small scale and then just fail to converge at larger scales.

nfoti commented 11 years ago

Thanks for adding me to the list @ViralBShah. I've been following Julia for a while and have just started to have time to start implementing things.

Note: Stefan's comment appeared as I was writing this. I am ok the name MatrixCompletion for a package to contain these algorithms and think NMF probably belongs there. But people may not think to look there if they want to use NMF not for completion. I was tossing some other names around, e.g. MatrixFactorization or MatrixDecomposition, but either of those names may make people think LU, QR and such live in the package as well. Despite this, I'm somewhat biased towards MatrixFactorization, MatrixDecomposition or something similar over MatrixCompletion because the algorithms we're talking about including in the package don't necessarily need to be used for completion. Also, things like probabilistic matrix factorization and friends probably belong here as well. In addition I think basic methods to impute missing entries belong in the package as well for completeness. I think that if imputing missing entries is your goal, having to use two packages is tedious.

As an aside, would it be useful to start making topics pages, like the R community does, to collect packages that are useful for different topics? For example, there could be a machine learning topic page with Clustering, MatrixWhateverItsNamed, MLBase, etc. There could also be a finance page with the trading packages and others for the various things that people want to use Julia for. I feel like this may help people with the problem of overlooking functionality within packages because the name didn't specifically reflect that fact that a package implemented a feature.

Lastly, one thing we haven't mentioned is that there are two types of missing data here. The first is "dense" missing data, which is what you usually see in machine learning papers where we hold out 20% of the observations, but most of it is still observed. For this type of problem algorithms that work on DataArrays from the DataFrames package seem like the way to go. The other type is like the Netflix problem, where most of the matrix is missing. In this case we're obviously going to need to implement algorithms on sparse matrices. A lot of these algorithms need quick row- and column-access to the entries of the matrix. Is it worth creating an extension of sparseCSC that allows this? You're blowing the storage cost up by about 2x, but the algorithms are much faster if the problem still fits in memory. Any thoughts? Obviously for very big problems we would have to just suck it up and deal with accessing the rows being slow.

StefanKarpinski commented 11 years ago

NMF is not a matrix completion technique. NMF assumes all input entries are valid and tries to find a factorization into two or more low-rank non-negative matrices whose product approximates the original non-negative matrix. That's why I proposed putting NMF in a MatrixApproximation package. (The NMF problem doesn't even make sense as an actual factorization without additional constraints since the input times the identity is always a valid exact non-negative factorization of the input matrix; only the low-rank approximation version is sensible.)

StefanKarpinski commented 11 years ago

You're quite right about distinguishing the "some missing data" cases from the "some available data" cases – they may even call for different packages, and they certainly call for rather different data representations: for the "some missing" case a dense DataArray is perfect; for the "some available" case, it's a huge waste of space. In fact, for the really serious completion algorithms, it is crucial that they accept a sparse representation of the available data and never explicitly construct the dense completion since it would typically be far too large to store.

nfoti commented 11 years ago

There are versions of NMF that can handle missing data (for example), usually by taking a probabilistic perspective, in which case you can use the decomposition to impute missing entries. This is analogous to Koren's "SVD" for the Netflix problem. I actually do not know much about NMF for completely observed data as I have only ever used it for problems with missing data. Regardless, I think both types of algorithms belong in this MatrixApproximation package.

nfoti commented 11 years ago

On second thought, every other decomposition that assumes complete data is in Base.linalg. The version of NMF that takes a Matrix{T} where all the entries are assumed known could just go in Base. The factorizations that handle missing data could go in a separate package.

StefanKarpinski commented 11 years ago

NMF algorithms should definitely not go in Base. First, there are lots of very different NMF algorithms, none of which is accepted as "the" algorithm, and they will all give you different results – since most of them are initialized randomly, the same algorithm will generally give different results on different runs. Beyond that, NMF is a field of very active research, so new algorithms are likely to be added. This is quite different from every factorization in Base.

StefanKarpinski commented 11 years ago

That Tan & Févotte paper provides more of a weighted approximation algorithm than an actual matrix completion algorithm. Certainly, not in the "most of the data is missing" sense, although it could be used for the "some of the data is missing" kind of problems. My experience is that the original Lee and Seung iterated descent technique works very well and so does alternating non-negative least squares, while almost everything else is pretty lousy. The bigger problem, which seems to be barely studied is initialization. Random initialization is terrible, while various tricks using clustering and/or SVD seem to provide much, much better results.

nfoti commented 11 years ago

That is a very good point about there not being "the" NMF algorithm. Suggestion to go into Base retracted.

Initialization is super important and no one ever admits it. Additionally, the algorithms in this package also suggest packages for cross-validation and grid-search since there are a number of parameters of the model and the algorithm that need to be tuned.

I think that a MatrixApproximation package that contains algorithms for advanced factorizations for completely observed matrices, and for factorizations that handle missing data would be very useful.

ViralBShah commented 11 years ago

I do like the name MatrixApproximation to have all these various algorithms.

What extensions to SparseMatrixCSC are you thinking of? I am certainly happy to refactor the current implementation in any way that makes it easy for other sparse representations to plug in. @dmbates is doing that with RSC sparse matrices, for example.

As for machine learning related packages, it seems to me that JuliaStats is a good home for those.

johnmyleswhite commented 11 years ago

Even though I think the terminology is kind of nuts, maybe we should call these methods MatrixFactorization to follow the trend in machine learning?

I already have one crappy NMF implementation in the DimensionalityReduction package, which needs a lot of work.

nalimilan commented 5 years ago

DataArrays don't live here for a long time.