Closed etrain closed 9 years ago
@etrain @shivaram Do you want my local implementation or my distributed implementation?
In the actual speech featurization pipeline, which would you expect to do? E.g. is it the kind of thing that you can estimate on a small sample and then apply to everything, or does it make sense to actually do it distributed?
LDA is applied to all the labeled data, but unlike PCA not to the full corpus necessarily.
I'm confused: 1) LDA Estimator - do you have a distributed version of this, or does it work on samples? 2) LDA Transformer - would this be applied to the entire dataset/is there a distributed version of this?
In either case, can you give us a sense of the data size (n and d) that you're running the estimator and the transformer on in the case of TIMIT?
1) LDA Estimator - I have a version of this that works on an RDD of data by doing distributed operations, and one that works by collecting a sample locally and then doing operations (sort of like GMM) 2) LDA Transformer - it's just a linear mapper
n = number of labeled frames in Timit train = I think around 1 or 2 million? d = 17 * 13 = 221
Alright, let's just do the local version at the first cut, and create an issue for the distributed version. In many cases we know there's a decent distributed implementation of these things but for the data size we're dealing with (like this one) it doesn't make sense to distribute the computation.
2M by 221 is around 4G even if we use doubles. I think the local one is fine for now.
Hmm so having a single DenseMatrix hold a 2M by 221 matrix just to get the column means isn't a problem?
No it shouldn't - for example we got PCA on 1M by 128 in the ImageNet pipeline
The limit you will run into is java maximum number of array elements of 4B bytes (because breeze represents matrices Array[Byte].)
On Wed, May 6, 2015 at 3:49 PM, Shivaram Venkataraman < notifications@github.com> wrote:
No it shouldn't - for example we got PCA on 1M by 128 in the ImageNet pipeline
— Reply to this email directly or view it on GitHub https://github.com/amplab/keystone/issues/48#issuecomment-99634399.
So, it needs to compute the gramian of this 2M by 221 matrix. Is that still okay to do locally instead of distributed?
The gramian here is At * A ? The output should be 221x221which should be fine. Basically this needs 221 * 2M * 2M flops to run -- This is higher than the SVD which is 2M * 221 * 221 flops AFAIK. I think it'll still be fine (might take 2-3 mins with 16 threads is my guess).
We can also just compute the gramian in distributed fashion and do the rest of the work locally ?
Linear Discriminant Analysis - as opposed to the topic modeling kind.