decurtoydiaz / learning_with_signatures

Learning with Signatures
56 stars 10 forks source link

100% accuracy on images generated with np.random.random() #3

Closed GillesVandewiele closed 2 years ago

GillesVandewiele commented 2 years ago

Title speaks for itself.

#Compute a class representative for each category using (0:n_signatures) from train.
#e.g. In AFHQ we use 100 signatures per class, that is a total of 300 train samples.

import pickle
import cv2
import os
import iisignature
import matplotlib.pyplot as plt
import numpy as np

categories = 3
labels = ['0', '1', '2']
N_truncated = 2

#Compute signature.
def signature_cyz(image):
    if image is not None:
        image = np.reshape(image,(image.shape[0],image.shape[1] * image.shape[2])) * 255.
        image = iisignature.sig(image, N_truncated)
        return image

supermeanA = np.empty(categories, dtype='object') 
for c in range(0, categories):
    dataA= []
    for _ in range(100):
        img = np.random.random(size=(16, 16, 3))
        dataA.append(signature_cyz(img))

    supermeanA[c] = np.mean(dataA, axis=0)

#Load validation instances from train (begin:end) and compute signatures to tune the weights.
#e.g. In AFHQ we use 500 signatures per class, that is a total of 1500 validation samples.

for c in range(0, categories):
    featuresAA = []
    for _ in range(100):
        img = np.random.random(size=(16, 16, 3))
        featuresAA.append(signature_cyz(img))

    #Estimate optimal \lambda_{*}
    #e.g. In AFHQ we solve the inverse problem lambda * supermeanA = featuresAA[z] z:0..500
    c_0 = supermeanA[c]
    c_0[c_0==0] = 1
    l = (1. / c_0) * featuresAA
    globals()['supermeanl_' + str(c)] = np.mean(l, axis=0)

#Compute RMSE Signature and print accuracy. Load test instances inside the loop, compute signatures and evaluate.
#e.g. We use the full AFHQ validation set as test, that is a total of 1500 samples.

from sklearn.metrics import mean_squared_error

count = np.zeros(categories, dtype='object')

for c2 in range(0,categories):
    for _ in range(1000):
        rmse_c = np.empty(categories, dtype='object')
        img = np.random.random(size=(16, 16, 3))
        for c in range(0, categories):
            rmse_c[c] = mean_squared_error(globals()['supermeanl_' + str(c2)] * supermeanA[c], 
                                           signature_cyz(img), squared=False)
        min_rmse = np.argmin(rmse_c)
        if(min_rmse != c2): 
            count[c2] += 1

    print('RMSE ' + labels[c2])
    print('# of errors:', count[c2])
    print('Accuracy:', 1 - count[c2] / 1000)
    print('\n')
decurtoydiaz commented 2 years ago

Title speaks for itself.

Now please answer with: "Here scale factors are computed optimally assuming you can resolve at test time the ambiguity, using for example the same criteria used to derive the weights"

#Compute a class representative for each category using (0:n_signatures) from train.
#e.g. In AFHQ we use 100 signatures per class, that is a total of 300 train samples.

import pickle
import cv2
import os
import iisignature
import matplotlib.pyplot as plt
import numpy as np

categories = 3
labels = ['0', '1', '2']
N_truncated = 2

#Compute signature.
def signature_cyz(image):
    if image is not None:
        image = np.reshape(image,(image.shape[0],image.shape[1] * image.shape[2])) * 255.
        image = iisignature.sig(image, N_truncated)
        return image

supermeanA = np.empty(categories, dtype='object') 
for c in range(0, categories):
    dataA= []
    for _ in range(100):
        img = np.random.random(size=(16, 16, 3))
        dataA.append(signature_cyz(img))

    supermeanA[c] = np.mean(dataA, axis=0)

#Load validation instances from train (begin:end) and compute signatures to tune the weights.
#e.g. In AFHQ we use 500 signatures per class, that is a total of 1500 validation samples.

for c in range(0, categories):
    featuresAA = []
    for _ in range(100):
        img = np.random.random(size=(16, 16, 3))
        featuresAA.append(signature_cyz(img))

    #Estimate optimal \lambda_{*}
    #e.g. In AFHQ we solve the inverse problem lambda * supermeanA = featuresAA[z] z:0..500
    c_0 = supermeanA[c]
    c_0[c_0==0] = 1
    l = (1. / c_0) * featuresAA
    globals()['supermeanl_' + str(c)] = np.mean(l, axis=0)

#Compute RMSE Signature and print accuracy. Load test instances inside the loop, compute signatures and evaluate.
#e.g. We use the full AFHQ validation set as test, that is a total of 1500 samples.

from sklearn.metrics import mean_squared_error

count = np.zeros(categories, dtype='object')

for c2 in range(0,categories):
    for _ in range(1000):
        rmse_c = np.empty(categories, dtype='object')
        img = np.random.random(size=(16, 16, 3))
        for c in range(0, categories):
            rmse_c[c] = mean_squared_error(globals()['supermeanl_' + str(c2)] * supermeanA[c], 
                                           signature_cyz(img), squared=False)
        min_rmse = np.argmin(rmse_c)
        if(min_rmse != c2): 
            count[c2] += 1

    print('RMSE ' + labels[c2])
    print('# of errors:', count[c2])
    print('Accuracy:', 1 - count[c2] / 1000)
    print('\n')

The code is only a generalisation of our previous work: https://github.com/decurtoydiaz/signatures Please check it out.

Here we show that there exist some optimal weights tuned on validation that can be used to classify without overfitting in test if you are able to resolve the ambiguity of which one to use in test. The ambiguity is a geometric constraint, so it can be resolved (using several ways). But you can forget about that definition and try to find yourself the weights using some other method. They can also be constant factors, or be found using grid search, bayesian analysis or k-fold crossvalidation. The method is indeed general. We only show that the method can achieve perfect accuracy if weights are correctly chosen.

De Curtò y DíAz.

GillesVandewiele commented 2 years ago

We only show that the method can achieve perfect accuracy if weights are correctly chosen.

You mean when you use the label information from the test set?

decurtoydiaz commented 2 years ago

We only show that the method can achieve perfect accuracy if weights are correctly chosen.

You mean when you use the label information from the test set?

We assume we can resolve at test time the correct scale factor to use, not the label (please check the code). These probably optimal scale factors are n-dimensional tensors derived using Definition 4, that could be resolved at test time as they are computed optimally from geometric constraints. If we resolve them correctly, then accuracy is perfect. But again, you can forget about the probably optimal factors computed by the definition and try to derive your own scale factors, by grid search, k-fold crossvalidation or using some other criteria.

De Curtò y DíAz.

decurtoydiaz commented 2 years ago

Please check the new updated code. The ambiguity in test is determined using one-vs-all fixing the proper lambda. You get n-classifiers one per class. Accuracy 100%.

De Curtò y DíAz.

GillesVandewiele commented 2 years ago

And how do you determine which classifier to use for a given unlabeled test sample?

decurtoydiaz commented 2 years ago

All of them. This was the de facto approach before Deep Learning. Please revise your notes on data science.

De Curtò y DíAz.

GillesVandewiele commented 2 years ago

So you make n predictions per test sample and just output all of them? That doesn't make any sense now, does it?

I am asking you: "You have now n predictions from n classifiers, how do you decide, from these n predictions, what the actual class is"? You could take an argmax, but that's not what you are doing in your notebook. Can you evaluate the approach you propose to reduce these n predictions into 1 class?

You can't just make N models that have 100% accuracy for one class and 0% for the other classes (which would be your "all" in a one-vs-all scheme), which your notebook now does, and think this would work.

decurtoydiaz commented 2 years ago

Yes, all of them. You should use all of them. One vs all. Look at wikipedia hell. Use all the classifiers. This was the de facto approach before Deep Learning. Please revise your notes on data science.

You have a perfect classifier of cats, another of dogs and another of wild. It's binary. Think it like that. When you try your wild on cat will say no, and no on dogs and yes on wild. This was traditional vision approach before Deep Learning and commonly used in robotics.

De Curtò y DíAz.

GillesVandewiele commented 2 years ago

But that's NOT what you are doing. All your models will predict "yes" on their respective class. They are just giving constant class predictions.

decurtoydiaz commented 2 years ago

Please try the code, for god's sake. You have three classifiers each one with a fixed lambda. If you look at the code, we go through ALL the test data, and it gets only the corresponding class classified.

De Curtò y DíAz.

decurtoydiaz commented 2 years ago

Please be respectful and considerate in your discourse.

It's late in Hong Kong. Please think carefully. Code is correct.

GillesVandewiele commented 2 years ago

I was extremely respectful in my discourse, especially given some of the comments you have already made in this thread. I also do not appreciate you removing/censoring my comments, so I am repeating the gist of it:

Your code is not reproducible since your "lambda" supermeanl_0 is not assigned anywhere. But that is actually not needed. It is easy to infer that in the cell where you use supermeanl_0, you are always predicting 0 (or cat) based on the following results:

RMSE cat
# of errors: 0
Accuracy: 1.0

RMSE dog
# of errors: 500
Accuracy: 0.0

RMSE wild
# of errors: 500
Accuracy: 0.0

Since you predict all of the cats correctly but NONE of the other samples correctly, you are most likely always predicting cat. (the any other, very unlikely, situation is where you always mistake dog for wild and wild for dog).

The same can be said about your cell that uses supermeanl_1, which always predicts 1 (or dog)

This would be equivalent to the following piece of code:

class CatModel:
  def predict():
     return 'cat'
class DogModel:
  def predict():
     return 'dog'
class WildModel:
  def predict():
     return 'wild'

You have not shown that you can achieve 100% accuracy (or anything remotely close to that) without using the "lambda" of the class label, which is unknown at test time (you use a c2 coefficient that corresponds to the true label in your cells above it). This is called leakage.

decurtoydiaz commented 2 years ago

One classifier for each lambda. For god's sake. It's very late in Hong Kong. One vs all. Please revise your notes on data science.

Code is on the repository. You can try it yourself. When you do one vs all, you fix lambda and go through all the test data. And you get only correct the instances from the corresponding classes. For example, in AFHQ this means you have a perfect classifier of cats, dogs and wild. And as in traditional vision and robotics, when you do one vs all you have to use all classifiers if you are given an unlabeled test instance. There is no leakage. 100% on all tasks. Code is correct. Learning with Signatures rocks.

De Curtò y DíAz.

-- https://www.decurto.tw/

GillesVandewiele commented 2 years ago

So given an image (with true, but unknown label cat), your 3 models will predict: M1: "cat" M2: "dog" M3: "wild"

Because that is EXACTLY what they do (cfr my post above). How do you now know it is actually a cat?

decurtoydiaz commented 2 years ago

Please check the code. When doing one vs all, we go through ALL the test, and each classifier only gets correctly the corresponding images of their class. When trying a wild image, it will say: no, no, yes. And this is exactly what happens, because in the loop we go over the 4 categories with a fixed lambda each time. It's correct. We get 100% accuracy.

De Curtò y DíAz.

-- https://www.decurto.tw/

GillesVandewiele commented 2 years ago

I checked the code, that's where I got the results from I pasted above.

When trying a wild image, it will say: yes, yes, yes.

Please check/run the code yourself and print min_rmse in your loop (e.g. in the cell with supermeanl_0). You will see that it will always print 0 (so basically "yes" to every dog and wild image you show it)

decurtoydiaz commented 2 years ago

Hell, in the example there is only class 0 and 1, to illustrate how it works. You can do 2 by yourself: with lambda 2. Read the comments, they are clear. It's an explanatory python notebook, not a full code.

De Curtò y DíAz.

-- https://www.decurto.tw/

GillesVandewiele commented 2 years ago

I didn't say anything about class three or lambda 3? I am saying your cells with supermeanl_0 and supermeanl_1 always predict the same thing (you can print min_rmse, which is the prediction). So you get a "yes" from every model, every time.

Anyway, let's leave it at this.

decurtoydiaz commented 2 years ago

If you want class2 (wild), you need to use supermeanl_2, as I have explained repeteadly. The comments are clear. I attach the code.

Compute RMSE Signature and print accuracy. Load test instances inside the loop, compute signatures and evaluate.

e.g. We fix lambda_0 per class 0 and get a perfect classifier at test time.

One-vs-all

We fix supermeanl_2 and go through all the test set.

from sklearn.metrics import mean_squared_error

count = np.zeros(categories, dtype='object')

for c2 in range(0,categories): a = os.listdir(folder[c2]) for z in range(0,len(a)): rmse_c = np.empty(categories, dtype='object') for c in range(0,categories): rmse_c[c] = mean_squared_error(supermeanl_2 * supermeanA[c], signature_cyz(folder[c2], a[z]), squared=False) min_rmse = np.argmin(rmse_c) if(min_rmse != c2): count[c2] += 1

print('RMSE ' + labels[c2]) print('# of errors:', count[c2]) print('Accuracy:', 1 - count[c2] / len(a)) print('\n')

Same for other datasets with more classes, one classifier for each class in a one-vs-all fashion. For instance in CIFAR-10 you'll get 10 binary classifiers.

De Curtò y DíAz.

-- https://www.decurto.tw/

decurtoydiaz commented 2 years ago

Again, please be respectful and considerate with your discourse. Code is correct. Issue is solved. There is no leakage. Ambiguity is resolved in test time as explained in the notebook by doing one-vs-all, which indeed was the de facto way to do things in many domains such as Robotics before Deep Learning emerged. No more active participation in this thread.

De Curtò y DíAz.

-- https://www.decurto.tw/

TechnikEmpire commented 2 years ago

So you make n predictions per test sample and just output all of them? That doesn't make any sense now, does it?

Please contact Facebook research and tell them that fastText makes no sense, since it also makes available a loss function that employs the same method of running multiple binary classifiers in parallel.

GillesVandewiele commented 2 years ago

So you make n predictions per test sample and just output all of them? That doesn't make any sense now, does it?

Please contact Facebook research and tell them that fastText makes no sense, since it also makes available a loss function that employs the same method of running multiple binary classifiers in parallel.

These binary classifiers don't have constant predictions (i.e. predict the same class every time) like the ones proposed by the author. It's a 1-vs-all scheme with 0% recall and 0% precision for the "other" class. I'd recommend reading the code yourself before making sarcastic comments such as these.

TechnikEmpire commented 2 years ago

So you make n predictions per test sample and just output all of them? That doesn't make any sense now, does it?

Please contact Facebook research and tell them that fastText makes no sense, since it also makes available a loss function that employs the same method of running multiple binary classifiers in parallel.

These binary classifiers don't have constant predictions (i.e. predict the same class every time) like the ones proposed by the author. It's a 1-vs-all scheme with 0% recall and 0% precision for the "other" class. I'd recommend reading the code yourself before making sarcastic comments such as these.

I was just pointing out that your unqualified objection to the existence of one-vs-all is unfounded. But I see what you're saying, I'm not taking a side here, I'm reviewing myself. I did also notice that the author pulled links to published results so waiting to hear back on that. I also notice that people are making observations about mislabeled elements in some of the datasets without also factoring in that the claims learn from a very small subset of the split, without anyone verifying that the mislabeled elements are within the prediction set. 100% neutral here.

TechnikEmpire commented 2 years ago

I also find it unproductive for everyone to take to twitter being mocking. If the author did screw up, that happens. Research is hard. When people default to mockery it's evidence that they have no idea.

decurtoydiaz commented 2 years ago

Thanks guys for your interest. This is only a draft. And what the current notebook shows is that if we are able to select among the correct lambdas, we have a set of optimal solutions that generalize very well. Those lambdas come from a geometric constraint, for example a set of 9 equations in AFHQ, so we should be able to determine the correct one to use at test time. Being by optimization, geometrical intuition, or some other way (several methods could be useful for this).

Code is under development. Please wait until acceptance. Will keep updating the repository.

De Curtò y DíAz.

GillesVandewiele commented 2 years ago

So you make n predictions per test sample and just output all of them? That doesn't make any sense now, does it?

Please contact Facebook research and tell them that fastText makes no sense, since it also makes available a loss function that employs the same method of running multiple binary classifiers in parallel.

These binary classifiers don't have constant predictions (i.e. predict the same class every time) like the ones proposed by the author. It's a 1-vs-all scheme with 0% recall and 0% precision for the "other" class. I'd recommend reading the code yourself before making sarcastic comments such as these.

I was just pointing out that your unqualified objection to the existence of one-vs-all is unfounded. But I see what you're saying, I'm not taking a side here, I'm reviewing myself. I did also notice that the author pulled links to published results so waiting to hear back on that. I also notice that people are making observations about mislabeled elements in some of the datasets without also factoring in that the claims learn from a very small subset of the split, without anyone verifying that the mislabeled elements are within the prediction set. 100% neutral here.

I know how one-vs-all works and I never objected it. I was asking questions to the author so he could himself realise his flaws in his logic.

TechnikEmpire commented 2 years ago

I know how one-vs-all works and I never objected it. I was asking questions to the author so he could himself realise his flaws in his logic.

But this:

These binary classifiers don't have constant predictions (i.e. predict the same class every time) like the ones proposed by the author.

That is exactly what a "binary classifier" does. I object to the terminology in this context, "unary" would be more appropriate, but this is the language settled on. That is precisely what this kind of classifier does, it predicts how well any input fits into its purposely single-domain knowledge.

Also this:

It's a 1-vs-all scheme with 0% recall and 0% precision for the "other" class.

Makes no sense. I would exactly expect a perfect variant of such a classifier to have 0% recall for other classes, but in this case it's precision would be 100%, not 0%.

All I am saying as a neutral observer is that, if the mislabeled data is not learned from in neither the train or validation sets and does not appear in the test set, then both objections I've seen on twitter/reddit/here are not validated. Also based on these comments the alleged bugs in the code could be people simply not understanding OVA. So either side of the debate is unproven, that's my point. I'm currently syncing the datasets and will try some code here shortly. What I'm not gonna do is jump on twitter and get into a bash fest which is entirely unproductive, especially from a place of misunderstanding and with unfounded allegations.

rasbt commented 2 years ago

I'm currently syncing the datasets and will try some code here shortly.

To save you some time @TechnikEmpire, you can download the mnist png dataset from here (due to the large numbers of files, downloading the files from Google Drive would take forever): https://github.com/myleott/mnist_png

You can reproduce the results (100% test accuracy) with this dataset. You can even randomize the labels in the test set and still get 100% test accuracy. So this dataset should be sufficient for your experiments.

TechnikEmpire commented 2 years ago

I'm currently syncing the datasets and will try some code here shortly.

To save you some time @TechnikEmpire, you can download the mnist png dataset from here (due to the large numbers of files, downloading the files from Google Drive would take forever): https://github.com/myleott/mnist_png

You can reproduce the results (100% test accuracy) with this dataset. You can even randomize the labels in the test set and still get 100% test accuracy. So this dataset should be sufficient for your experiments.

Thanks, I'll take a look. Don't get me wrong I completely understand the healthy skepticism.

rasbt commented 2 years ago

Curious to see what you'll find. I would still need an answer for the question how a method without data leakage can achieve 100% test accuracy if labels in the test set are flipped

TechnikEmpire commented 2 years ago

Curious to see what you'll find. I would still need an answer for the question how a method without data leakage can achieve 100% test accuracy if labels in the test set are flipped

Yep that certainly sounds like a problem.

GillesVandewiele commented 2 years ago

I know how one-vs-all works and I never objected it. I was asking questions to the author so he could himself realise his flaws in his logic.

But this:

These binary classifiers don't have constant predictions (i.e. predict the same class every time) like the ones proposed by the author.

That is exactly what a "binary classifier" does. I object to the terminology in this context, "unary" would be more appropriate, but this is the language settled on. That is precisely what this kind of classifier does, it predicts how well any input fits into its purposely single-domain knowledge.

Also this:

It's a 1-vs-all scheme with 0% recall and 0% precision for the "other" class.

Makes no sense. I would exactly expect a perfect variant of such a classifier to have 0% recall for other classes, but in this case it's precision would be 100%, not 0%.

All I am saying as a neutral observer is that, if the mislabeled data is not learned from in neither the train or validation sets and does not appear in the test set, then both objections I've seen on twitter/reddit/here are not validated. Also based on these comments the alleged bugs in the code could be people simply not understanding OVA. So either side of the debate is unproven, that's my point. I'm currently syncing the datasets and will try some code here shortly. What I'm not gonna do is jump on twitter and get into a bash fest which is entirely unproductive, especially from a place of misunderstanding and with unfounded allegations.

Yes N binary classifiers that predict: this belongs to my class or it doesn't. The author's models predict "yes" every single time. You feed it in image and you get back N times "yes". Great model indeed! As such, it NEVER predicts other. So FP+TP=0 (or FN+TN), as such precision is actually undefined if we're being pedantic. There is no healthy skepticism, the authors code contained a huge leak which we all pointed out. It was achieving 100% acc on completely random images or 100% when test labels were swapped. There is no better empirical evidence. The author then came back with a "one-vs-all" solution that boiled down to:

class CatModel:
  def predict():
     return 'cat'
class DogModel:
  def predict():
     return 'dog'
class WildModel:
  def predict():
     return 'wild'

This OF COURSE does not work.

decurtoydiaz commented 2 years ago

Guys, this is not reddit. I appreciate your sedulous interest though.

Again, code is under development. Please wait until acceptance. Will keep updating the repository.

De Curtò y DíAz.

TechnikEmpire commented 2 years ago

@decurtoydiaz Any ETA on the code? Cause I wrote my own implementation to sidestep several concerns raised by others about your current implementation and I got similar results as nslay did in #1 after he corrected your code, except I tested on with your own copies of the dataset(s). Thanks, look forward to seeing final published code so I can see where I went wrong.

decurtoydiaz commented 2 years ago

@decurtoydiaz Any ETA on the code? Cause I wrote my own implementation to sidestep several concerns raised by others about your current implementation and I got similar results as nslay did in #1 after he corrected your code, except I tested on with your own copies of the dataset(s). Thanks, look forward to seeing final published code so I can see where I went wrong.

Thanks for your interest. Code is not complete yet. We are trying to illustrate the idea, but not showing how to do it practically. The key is using one-vs-all. But that's not everything. It is not fully explained in the current manuscript either. Please wait until acceptance. We'll keep updating the repository.

Best regards, De Curtò y DíAz.

decurtoydiaz commented 2 years ago

Paper is not submitted yet (well, has been submitted but deadline hasn't closed yet). Please wait. This is only a draft. We will keep updating the repository. Thanks for your interest.

De Curtò y DíAz.