amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
468 stars 116 forks source link

LDA #48

Closed etrain closed 9 years ago

etrain commented 9 years ago

Linear Discriminant Analysis - as opposed to the topic modeling kind.

tomerk commented 9 years ago

@etrain @shivaram Do you want my local implementation or my distributed implementation?

etrain commented 9 years ago

In the actual speech featurization pipeline, which would you expect to do? E.g. is it the kind of thing that you can estimate on a small sample and then apply to everything, or does it make sense to actually do it distributed?

tomerk commented 9 years ago

LDA is applied to all the labeled data, but unlike PCA not to the full corpus necessarily.

etrain commented 9 years ago

I'm confused: 1) LDA Estimator - do you have a distributed version of this, or does it work on samples? 2) LDA Transformer - would this be applied to the entire dataset/is there a distributed version of this?

In either case, can you give us a sense of the data size (n and d) that you're running the estimator and the transformer on in the case of TIMIT?

tomerk commented 9 years ago

1) LDA Estimator - I have a version of this that works on an RDD of data by doing distributed operations, and one that works by collecting a sample locally and then doing operations (sort of like GMM) 2) LDA Transformer - it's just a linear mapper

n = number of labeled frames in Timit train = I think around 1 or 2 million? d = 17 * 13 = 221

etrain commented 9 years ago

Alright, let's just do the local version at the first cut, and create an issue for the distributed version. In many cases we know there's a decent distributed implementation of these things but for the data size we're dealing with (like this one) it doesn't make sense to distribute the computation.

shivaram commented 9 years ago

2M by 221 is around 4G even if we use doubles. I think the local one is fine for now.

tomerk commented 9 years ago

Hmm so having a single DenseMatrix hold a 2M by 221 matrix just to get the column means isn't a problem?

shivaram commented 9 years ago

No it shouldn't - for example we got PCA on 1M by 128 in the ImageNet pipeline

etrain commented 9 years ago

The limit you will run into is java maximum number of array elements of 4B bytes (because breeze represents matrices Array[Byte].)

On Wed, May 6, 2015 at 3:49 PM, Shivaram Venkataraman < notifications@github.com> wrote:

No it shouldn't - for example we got PCA on 1M by 128 in the ImageNet pipeline

— Reply to this email directly or view it on GitHub https://github.com/amplab/keystone/issues/48#issuecomment-99634399.

tomerk commented 9 years ago

So, it needs to compute the gramian of this 2M by 221 matrix. Is that still okay to do locally instead of distributed?

shivaram commented 9 years ago

The gramian here is At * A ? The output should be 221x221which should be fine. Basically this needs 221 * 2M * 2M flops to run -- This is higher than the SVD which is 2M * 221 * 221 flops AFAIK. I think it'll still be fine (might take 2-3 mins with 16 threads is my guess).

We can also just compute the gramian in distributed fashion and do the rest of the work locally ?