amueller / scipy-2016-sklearn

Scikit-learn tutorial at SciPy2016
Creative Commons Zero v1.0 Universal
515 stars 516 forks source link

Add explanation for why t-SNE is not a good feature preprocessor for models #78

Closed rhiever closed 7 years ago

rhiever commented 7 years ago

Notebook 22 makes a really important point that t-SNE is only for visualization, yet doesn't explicitly explain why that is the case. We should add a brief explanation for why that is.

amueller commented 7 years ago

Well unsupervised learning will always throw away discriminative information....

rhiever commented 7 years ago

I think the point of the exercise was to show that that effect was particular to t-SNE. e.g. if you apply PCA or Isomap you can oftentimes improve---or at least not negatively affect---your model accuracy.

amueller commented 7 years ago

really? I would imagine that PCA down to two dimensions will heavily impact accuracy.

rasbt commented 7 years ago

really? I would imagine that PCA down to two dimensions will heavily impact accuracy.

I'd say it really depends on a lot of factors (the model, the dataset, the explained-variance ratio, ...). I can imagine that for small datasets and models that tend to overfit (e.g., k-NN with small k or so), it could be really helpful. Or in more general terms, I think it may be a useful thing to do to improve performance ('curse of dimensionality') as an alternative to feature selection and/or if you can't regularize your model.

amueller commented 7 years ago

Yeah it depends on many things, that's true. It is certainly a form of regularization. But reducing digits to 2d is probably too much, no matter what method. The point of the exercise was more "don't use manifold learning for supervised tasks". PCA might be helpful in certain situations.

rhiever commented 7 years ago

Just to explore your suspicions, @amueller:

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=1)

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
print('KNeighborsClassifier accuracy: {}'.format(clf.score(X_test, y_test)))

pca = PCA(n_components=2)
digits_pca_train = pca.fit_transform(X_train)
digits_pca_test = pca.transform(X_test)

clf = KNeighborsClassifier()
clf.fit(digits_pca_train, y_train)
print('KNeighborsClassifier accuracy with PCA: {}'.format(clf.score(digits_pca_test, y_test)))

tsne = TSNE(random_state=42)
digits_tsne_train = tsne.fit_transform(X_train)
digits_tsne_test = tsne.fit_transform(X_test)

clf = KNeighborsClassifier()
clf.fit(digits_tsne_train, y_train)
print('KNeighborsClassifier accuracy with t-SNE: {}'.format(clf.score(digits_tsne_test, y_test)))
KNeighborsClassifier accuracy: 0.9933333333333333
KNeighborsClassifier accuracy with PCA: 0.6266666666666667
KNeighborsClassifier accuracy with t-SNE: 0.0022222222222222222

t-SNE is an order of magnitude worse.

rasbt commented 7 years ago

Oh, wow, the min. expected accuracy would be 10%; that's really, really bad then! But I see that you have an error here:

pca.fit_transform(X_test) digits_tsne_test = tsne.fit_transform(X_test)

It should be

pca.transform(X_test) and digits_tsne_test = tsne.transform(X_test)

rhiever commented 7 years ago

Ah you're right @rasbt. You can't use the transform method one TSNE, and I just copy and pasted from the TSNE code. :-)

I updated the code and results above with that fix. PCA actually performs MUCH better than t-SNE now. So why is t-SNE so bad for classification?

rasbt commented 7 years ago

Oh yeah, good point ... mentioning one problem and introducing another :P

amueller commented 7 years ago

What's your code now? Going down to two dimensions I would expect t-SNE to perform better than PCA, but I'd expect both to be worse than not doing anything.

rasbt commented 7 years ago

What's your code now? Going down to two dimensions I would expect t-SNE to perform better than PCA, but I'd expect both to be worse than not doing anything.

Hm, yeah, I'd also naturally expect t-SNE to perform better on this particular dataset, however, I think the comparison in the code above is not entirely fair. You can't do a fit_transform separately on training and test since the embedding depends on the order of the samples, right? I.e., the "position" of the "clusters" is arbitrary, isn't it? I think you would at least need to use sth. like

adjusted_rand_score(y_predict, y_test)
print('KNeighborsClassifier accuracy with t-SNE: {}'.format(adjusted_rand_score(y_predict, y_test)))

for t-SNE if you fit_transform train and test data separately.

rhiever commented 7 years ago

@rasbt is right on this one. The reason t-SNE doesn't work here is because the t-SNE is fitting on the training data then the testing data, thus causing the clusters to fall in different areas.

fingoldo commented 6 years ago

Is there absolutely no way to add pure .transform method to TSNE, like Isomap already has? In 2D separation of TSNE is much much better from MNIST dataset, a pity it can't be used as a regular transformer..

amueller commented 6 years ago

there is a way to implement this, I think, but it's not implemented in sklearn right now. Not sure if there's a pr. @fingoldo you might also be interested in UMAP: https://github.com/lmcinnes/umap

fingoldo commented 6 years ago

Thank you so much Andreas for this great suggestion! Features added by UMAP proved to be useful indeed :-) Quick & dirty assessment:


from sklearn.datasets import load_digits
digits=load_digits()

def EstimateClassifier(model,transformer=None):
    startTime = datetime.now()
    if transformer:
        transformer.fit(x_train)

        dp=transformer.transform(x_train)   
        x_train_new=np.concatenate((x_train,dp),axis=1)

        dp=transformer.transform(x_test)   
        x_test_new=np.concatenate((x_test,dp),axis=1)        
    else:
        x_train_new,x_test_new=x_train,x_test
    model.fit(x_train_new,y_train)
    timeElapsed = datetime.now() - startTime
    print("Test Accuracy: %s" % (accuracy_score(y_test,model.predict(x_test_new))))

EstimateClassifier(GaussianNB())
Test Accuracy: 0.833333333333
time: 7.51 ms

EstimateClassifier(GaussianNB(),PCA(n_components=2))
Test Accuracy: 0.855555555556
time: 17.5 ms

EstimateClassifier(GaussianNB(),Isomap(n_components=2))
Test Accuracy: 0.893333333333
time: 1.5 s

EstimateClassifier(GaussianNB(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.917777777778
time: 2.67 s

EstimateClassifier(RandomForestClassifier())
Test Accuracy: 0.951111111111
time: 38.5 ms

EstimateClassifier(RandomForestClassifier(),PCA(n_components=2))
Test Accuracy: 0.935555555556
time: 49.5 ms

EstimateClassifier(RandomForestClassifier(),Isomap(n_components=2))
Test Accuracy: 0.971111111111
time: 1.53 s

EstimateClassifier(RandomForestClassifier(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.973333333333
time: 2.68 s
amueller commented 6 years ago

Set n_estimators to 100 in the random forest and it will be better, and probably better without umap.

Sent from phone. Please excuse spelling and brevity.

On Sat, Jul 14, 2018, 17:10 fingoldo notifications@github.com wrote:

Thank you so much Andreas for this great suggestion! Features added by UMAP proved to be useful indeed :-) Quick & dirty assessment:

from sklearn.datasets import load_digits digits=load_digits()

def EstimateClassifier(model,transformer=None): startTime = datetime.now() if transformer: transformer.fit(x_train)

    dp=transformer.transform(x_train)
    x_train_new=np.concatenate((x_train,dp),axis=1)

    dp=transformer.transform(x_test)
    x_test_new=np.concatenate((x_test,dp),axis=1)
else:
    x_train_new,x_test_new=x_train,x_test
model.fit(x_train_new,y_train)
timeElapsed = datetime.now() - startTime
print("Test Accuracy: %s" % (accuracy_score(y_test,model.predict(x_test_new))))

EstimateClassifier(GaussianNB()) Test Accuracy: 0.833333333333 time: 7.51 ms

EstimateClassifier(GaussianNB(),PCA(n_components=2)) Test Accuracy: 0.855555555556 time: 17.5 ms

EstimateClassifier(GaussianNB(),Isomap(n_components=2)) Test Accuracy: 0.893333333333 time: 1.5 s

EstimateClassifier(GaussianNB(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation')) Test Accuracy: 0.917777777778 time: 2.67 s

EstimateClassifier(RandomForestClassifier()) Test Accuracy: 0.951111111111 time: 38.5 ms

EstimateClassifier(RandomForestClassifier(),PCA(n_components=2)) Test Accuracy: 0.935555555556 time: 49.5 ms

EstimateClassifier(RandomForestClassifier(),Isomap(n_components=2)) Test Accuracy: 0.971111111111 time: 1.53 s

EstimateClassifier(RandomForestClassifier(),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation')) Test Accuracy: 0.973333333333 time: 2.68 s

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amueller/scipy-2016-sklearn/issues/78#issuecomment-405053183, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFgJRvjxp7f1C_d9vkcS69AtUPc3nks5uGmxwgaJpZM4KkcBW .

rasbt commented 6 years ago

Haven't read up on UMAP yet -- heard from an attendant that the recent talk at PyData Ann Arbor was really good: https://www.youtube.com/watch?v=YPJQydzTLwQ&t=521s -- but I think it's more meant as a technique for visualizing training examples (or clusters thereof) in low dim rather than using it as a feature extraction technique (if you are not using generalized linear models maybe) in the similar vein as T-SNE? So in that case it would be interesting to add an eval of the random forest on the raw features like Andreas suggested.

Set n_estimators to 100 in the random forest and it will be better, and probably better without umap.

In practice, it could come in handy for huge datasets though, as it is already much faster than T-SNE

screen shot 2018-07-14 at 6 35 56 pm
fingoldo commented 6 years ago

Here we go, guys.

It seems to still be helpful, but now I think i should have used cross_val_score from the beginning, as Isomap's result seems to be out of a picture a bit and affected by the split...


EstimateClassifier(RandomForestClassifier(n_estimators=100))
Test Accuracy: 0.977777777778
time: 346 ms

EstimateClassifier(RandomForestClassifier(n_estimators=100),PCA(n_components=2))
Test Accuracy: 0.98
time: 372 ms

EstimateClassifier(RandomForestClassifier(n_estimators=100),Isomap(n_components=2))
Test Accuracy: 0.977777777778
time: 1.93 s

EstimateClassifier(RandomForestClassifier(n_estimators=100),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation'))
Test Accuracy: 0.986666666667
time: 2.98 s
fingoldo commented 6 years ago

Added cross-validation to get a more definitive answer.

from sklearn.datasets import load_digits
digits=load_digits()

def EstimateClassifier(model,transformer=None):
    startTime = time.time()
    if transformer:
        pipe = Pipeline([('VarianceThreshold',VarianceThreshold()),('union', FeatureUnion([('AsIs',SelectKBest(k='all')),('transformer', transformer)])), ('classifier', model)])
    else:
        pipe = Pipeline([('classifier', model)])    
    accuracies=cross_val_score(pipe,digits.data,digits.target,cv=10)
    timeElapsed = time.time() - startTime
    print("Model: %s, Transformer: %s, avg.accuracy: %0.3f +- %0.3f, time=%0.3fs" % (type(model).__name__,type(transformer).__name__,np.mean(accuracies),np.std(accuracies),timeElapsed))

for model in (GaussianNB(),RandomForestClassifier(n_estimators=100)):
    for transformer in (None,PCA(n_components=2),Isomap(n_components=2),UMAP(n_components=2,n_neighbors=5,min_dist=0.3,metric='correlation')):
        EstimateClassifier(model,transformer)

Model: GaussianNB, Transformer: NoneType, avg.accuracy: 0.810 +- 0.057, time=0.065s
Model: GaussianNB, Transformer: PCA, avg.accuracy: 0.843 +- 0.051, time=0.198s
Model: GaussianNB, Transformer: Isomap, avg.accuracy: 0.883 +- 0.046, time=16.367s
Model: GaussianNB, Transformer: UMAP, avg.accuracy: 0.921 +- 0.028, time=51.394s
Model: RandomForestClassifier, Transformer: NoneType, avg.accuracy: 0.953 +- 0.020, time=3.750s
Model: RandomForestClassifier, Transformer: PCA, avg.accuracy: 0.948 +- 0.023, time=4.002s
Model: RandomForestClassifier, Transformer: Isomap, avg.accuracy: 0.964 +- 0.017, time=19.914s
Model: RandomForestClassifier, Transformer: UMAP, avg.accuracy: 0.969 +- 0.017, time=55.271s

Do you think benefit of adding new features will hold if we add proper hyperparameters tuning, or it will be neglected?

rhiever commented 6 years ago

I think it's more meant as a technique for visualizing training examples (or clusters thereof) in low dim rather than using it as a feature extraction technique

Not so -- UMAP can be used as a visualization mapping technique similar to t-SNE, but also works fine as a feature construction technique (as shown by @fingoldo). I was going to link the SciPy talk, but it seems you already found it. 👍

@fingoldo, I think your initial explorations show that UMAP can potentially be useful as a feature construction technique. It will have to be evaluated further on more benchmarks, perhaps on PMLB.

fingoldo commented 6 years ago

@rhiever Randy will UMAP be included into tpot pipeline? :-)

rhiever commented 6 years ago

It's possible!