gdkrmr / dimRed

A Framework for Dimensionality Reduction in R
https://www.guido-kraemer.com/software/dimred/
GNU General Public License v3.0
73 stars 15 forks source link

data depth methods #9

Open topepo opened 7 years ago

topepo commented 7 years ago

You might consider adding some of Tukey's data depth methods. R has a few packages that you could wrap including ddalpha (see this paper gives a pretty good description of that).

gdkrmr commented 7 years ago

It is the first time I hear about this, sounds quite interesting! I gave it a quick read so please correct me if I am doing something wrong.

In the ddalpha package I need a training sample with already known classes to train a classifier, so it not unsupervised.

Did you think on something like this?

library(ddalpha)
library(rgl)

# example 1
ds <- depth.space.Mahalanobis(as.matrix(iris[1:4]), c(50, 50, 50))
plot3d(ds, col = as.numeric( iris[[5]]) )

# example 2
perm <- sample(150)
ds2 <- depth.space.Mahalanobis(as.matrix(iris[perm, 1:4]), c(50, 50, 50))
plot3d(ds2, col = as.numeric( iris[[5]][perm] ))

# example 3
clusters <- kmeans(scale(iris[1:4]), 3)
c.ord <- order(clusters$cluster)
ds3 <- depth.space.Mahalanobis(as.matrix(iris[c.ord, 1:4]), as.vector(table(clusters$cluster)))
plot3d(ds3, col = as.numeric( iris[[5]][c.ord]))

The first one is really cool, the second one not so much. One would have to supply a class vector as a parameter or some unsupervised classifier like kNN, as in the third example.

What do you think @topepo ? Is there an entirely unsupervised version of this?

topepo commented 7 years ago

caret has a function that computes the distances of a new sample to the class centroids. I was thinking of something along the same lines although you could certainly just have an interface to generate the depths for all the data.

dimRed has a nice interface to other dimension reduction methods and (supervised or not) these metrics would be great to include. ddalpha is pretty good but I find the api more complex that I think it should be.

gdkrmr commented 7 years ago

I think something like

embed(data, "DataDepth", classes = cl, ...)

where classes can either be some vector with classes or a function that returns a vector of classes from the data should be possible. It could also accept some character vectors like "knn" that takes the number of classes from ndim and does some standard clustering. I like the idea but it will probably take me a while to get to it (after v0.1.0) because I am busy with other stuff at the moment. If you want it in soon I would accept a pull request. There should probably also be a predict function but I am not sure how this should look like, it will probably have to accept some additional arguments.

topepo commented 7 years ago

No problem on time.

For predict, you'll have to just save the original data (as you do in the other methods) and pass it as an argument to depth.X.

Also, I'll send you an invite to a repo that I'll be making public soon in case you are interested in what I've been doing in regards to my previous requests. I have some of the depth parts worked out already but your interface you be better than my do.call's

gdkrmr commented 7 years ago

The recipes are quite a nice idea. Why not simply make a dimRed recipe, this would be interesting because I did not really consider data preprocessing in my package? One of the methods you might want to add is t-SNE, this method is very good for visualization of complex data structures. Also the R package Rtsne is based on a very efficient implementation which can be used for relatively large data which is not the case for Isomap and kPCA.

topepo commented 7 years ago

I've used t-SNE a lot (back when I used to actually analyze data for a living) and like it. However, I'm constrained to using methods where the projection can be applied to new data sets (based on estimates from the old/training data).

I didn't think to make a general dimRed step but did something similar for the depth methods. I'll put that on the list.

gdkrmr commented 7 years ago

t-SNE works by gradient descent and in theory one can hold the old points fixed and apply it to new points only but as far as I know no one implemented it. Here is a cool package for different SNE variants: https://github.com/jlmelville/sneer I think it is not on CRAN.