Closed Promisery closed 2 years ago
Let's look at it.
If you unroll the loop you'll understand that we just compare one by one all representatives (and each representative has a multiplicative factor) against each test instance. And then choose among them as predicted class the comparison that gives minimum RMSE Signature. Here scale factors are computed optimally assuming you can resolve at test time the ambiguity, using for example the same criteria used to derive the weights (which is not explained in the code). De Curtò y DíAz.
I've figured it out. Thank you for your patience!
Either I'm still not getting this, or I think that the original concern raised by this issue is valid.
Think about it like this, how would we predict a new, unseen image, for which we don't know the label?
We could calculate the signature/features for this image and put it in a variable signature
.
Then your test code could look like this:
rmse_c = np.empty(categories, dtype='object')
for c in range(0,categories):
rmse_c[c] = mean_squared_error(
globals()['supermeanl_' + str(c2)] * supermeanA[c],
signature,
squared=False
)
min_rmse = np.argmin(rmse_c) # this is our predicted class
However, we don't have variable c2
here? Since we do not know the label.
We could loop over c2
as you do in your code, but that would just cause our prediction to be overwritten 10 times, with a different result each time. Maybe we are supposed to average over these predictions for all c2
?
To clarify further, I think it would be useful if you could write a code snippet that would show how to make a prediction on an unlabeled image. Until then, it appears to me you are indeed using the test label in making your prediction.
I agree that the original issue should be re-opened. @Promisery could you please reason how you changed your mind? To me it seems that your concern is valid. @decurtoydiaz could you please detail why the multiplicative factors are constant (and, by extension, what do you mean by 'constant')?
I reimplemented this with CIFAR10 data directly downloaded from toronto.edu website. I got the same results. That's spooky! If there's data leakage, it's not obvious where. Still thinking about it!
Attached is my self-contained reimplementation (for CIFAR10 only).
I originally built signatures on the first 4 file batches and used data_batch_5.bin for validation. If you do it this different way, there's some class imbalance and then RMSE errors are really really bad across the board. So I changed it to be more like the author's by organizing the images into a class-indexed array of lists of images. My folder[c]
is not an array of paths, it's an array of training images with class label c.
The author only considers 10 training images from each class for calculating supermeanA
. And only 100 validation examples from each class. Really spooky stuff that it somehow generalizes to 10000 test images! And this is a separate implementation too. Please point out any errors in my code. I'm still scratching my head!
Aside of how you build supermeanA
/supermeanl
, changing n_signatures can also cause misclassification errors in some class labels. For example, n_signatures = 20 gives.
RMSE 0
# of errors: 0
Accuracy: 1.0
RMSE 1
# of errors: 0
Accuracy: 1.0
RMSE 2
# of errors: 0
Accuracy: 1.0
RMSE 3
# of errors: 0
Accuracy: 1.0
RMSE 4
# of errors: 0
Accuracy: 1.0
RMSE 5
# of errors: 0
Accuracy: 1.0
RMSE 6
# of errors: 0
Accuracy: 1.0
RMSE 7
# of errors: 0
Accuracy: 1.0
RMSE 8
# of errors: 13
Accuracy: 0.987
RMSE 9
# of errors: 0
Accuracy: 1.0
@nlml I came to the same conclusion going through the code. Test samples of class c2
are only ever compared to things multiplied by supermeanl_c2
implying that we already know the test sample is in class c2
.
In the test loop, the author has lists of images organized by class label. Instead of iterating over (image, label) pairs, author iterates over images with class label c2
. That's completely fine.
So instead of calculating misclassification error like something like below
for image, label in imageLabelPairs:
for c in range(categories): # Calculate score for each class label
rmse_c[c] = mean_squared_error(...)
rmse_min = np.argmin(rmse_min) # The label with minimum value is the predicted class
if rmse_min != label: # Does it match ground truth?
count[label] += 1
...
Author has it organized this way
for c2 in range(categories):
for z in range(len(imagesWithLabel[c2])): # z is index to z'th image with ground truth label c2
for c in range(categories): # Calculate score for each class label
rmse_c[c] = mean_squared_error(...)
rmse_min = np.argmin(rmse_min) # The label with minimum value is the predicted class
if rmse_min != c2: # Does it match ground truth?
count[c2] += 1
...
I don't see anything wrong with this.
@nslay Your reimplementation has the same data leakage on line 212.
I think if we were talking about choosing argmin(c) RMSE(supermeanA[c], featuresAA[c2][z])
that makes total sense. You're essentially doing K-means clustering where the K cluster centres are the average over 10 samples from each classes train set, and then you're assigning test samples to the closest cluster by RMSE distance.
As soon as you start comparing the test sample to anything multiplied by something specific to the class it is actually from (that you aren't meant to know yet) then its dependent on leaked information.
@nslay Your reimplementation has the same data leakage on line 212.
I think if we were talking about choosing
argmin(c) RMSE(supermeanA[c], featuresAA[c2][z])
that makes total sense. You're essentially doing K-means clustering where the K cluster centres are the average over 10 samples from each classes train set, and then you're assigning test samples to the closest cluster by RMSE distance.As soon as you start comparing the test sample to anything multiplied by something specific to the class it is actually from (that you aren't meant to know yet) then its dependent on leaked information.
You're absolutely right.
Oh, I think I see yours and @nlml point
rmse_c[c] = mean_squared_error(supermeanl[c2] * supermeanA[c], featuresAA[c2][z], squared=False)
supermeanl[c2] and featuresAA[c2] is the leakage. If done with with an image/label pair iteration, it becomes more obvious
for image, label in imageLabelPairs:
...
rmse_c[c] = mean_squared_error(supermeanl[label] * supermeanA[c], featuresAA[label][z], squared=False)
Then it's more obvious (to me anyway):
Weights, that is, optimal scale factors are computed according to Definition 4. What we show here is that we can achieve perfect score and no overfitting given that you can choose the right scale factors in validation and that you can resolve the ambiguity of which one to use in test. Code in the repository is very preliminary and paper still not accepted, will release more code soon. Thanks for the comments. De Curtò y DíAz.
This seems to be something like what was intended. Check all lambdas over categories in testing since you don't know the ground truth category.
for z in range(xtest.shape[0]): # Over all test examples!
label = int(ytest[z])
image = xtest[z, ...].astype(np.uint8)
image = image.transpose(1,2,0)
#image = image.transpose(2,1,0)
image = np.reshape(image, (image.shape[0], image.shape[1] * image.shape[2]))
image = iisignature.sig(image, N_truncated)
rmse_c = np.empty((categories, categories), dtype='object')
for c2 in range(categories): # Scan over supermeanl
for c in range(categories): # Scan over categories
rmse_c[c2,c] = mean_squared_error(supermeanl[c2] * supermeanA[c], image, squared=False)
rmse_c = rmse_c.min(axis=0) # Consider the minimum rmse_c over c2
min_rmse = np.argmin(rmse_c) # Then calculate the predicted class label
if min_rmse != label:
count[label] += 1
The performance is not good anymore. That's depressing... it would be really cool if author really had got perfect test performance!
RMSE 0
# of errors: 930
Accuracy: 0.06999999999999995
RMSE 1
# of errors: 832
Accuracy: 0.16800000000000004
RMSE 2
# of errors: 872
Accuracy: 0.128
RMSE 3
# of errors: 963
Accuracy: 0.03700000000000003
RMSE 4
# of errors: 926
Accuracy: 0.07399999999999995
RMSE 5
# of errors: 866
Accuracy: 0.134
RMSE 6
# of errors: 412
Accuracy: 0.5880000000000001
RMSE 7
# of errors: 921
Accuracy: 0.07899999999999996
RMSE 8
# of errors: 782
Accuracy: 0.21799999999999997
RMSE 9
# of errors: 781
Accuracy: 0.21899999999999997
I'm sorry author. Clever data organization and coding can get the best of us. Reddit is/was also confused.
Weights, that is, optimal scale factors are computed according to Definition 4. What we show here is that we can achieve perfect score and no overfitting given that you can choose the right scale factors in validation. You are able to determine adequate lambda in test if you use same criteria. Code in the repository is very preliminary and paper still not accepted, will release more code soon. Thanks for the comments. De Curtò.
It's not really validation if you are updating your weights based on it. It's just splitting your train set for each class into two groups to use for different parts of your fitting process.
What we show here is that we can achieve perfect score and no overfitting given that you can choose the right scale factors in validation.
The issue is exactly this. For test samples from class c2
you compare them to the supermeanA[c]
scaled by only the scale factor supermeanl[c2]
for class c2
. You are already stating the test sample is in c2
when you do this. Information has been leaked.
If you pooled over supermeanl[c2] * supermeanA[c]
for all combinations of c, c2 in cartessian_product(C, C)
and reduced that down then that can be fair.
If you considered only supermeanA[c]
in the comparison, that would be fair, its close to K-means where the iterated-integral signature is a feature extractor on the raw images.
But to only compare to things polluted by knowledge of the true class is not fair.
Here is a unique hand drawn 28x28 pixel image of a character that I have just made.
Please provide a minimal code cell (that would work dropped directly into the end of your notebook after running it entirely as you have provided it for MNIST) that would perform inference on this single image and tell us which class it belongs to.
Thanks for the comments. In this particular example, we assume we can correctly resolve the ambiguity at test time of which probably good optimal lambda to use, for instance using the same criteria we used to derive the weights. And that if you do so, there is no overfitting and we get 100% accuracy on all tasks. That is, for example in AFHQ, we can find analytically the n-dimensional lambda that correctly classifies the samples at test time: the only ambiguity here is being able to resolve which one of the 3 n-dimensional scale factors to use (which is not explained in the code). De Curtò y DíAz.
This seems to be something like what was intended. Check all lambdas over categories in testing since you don't know the ground truth category.
for z in range(xtest.shape[0]): # Over all test examples! label = int(ytest[z]) image = xtest[z, ...].astype(np.uint8) image = image.transpose(1,2,0) #image = image.transpose(2,1,0) image = np.reshape(image, (image.shape[0], image.shape[1] * image.shape[2])) image = iisignature.sig(image, N_truncated) rmse_c = np.empty((categories, categories), dtype='object') for c2 in range(categories): # Scan over supermeanl for c in range(categories): # Scan over categories rmse_c[c2,c] = mean_squared_error(supermeanl[c2] * supermeanA[c], image, squared=False) rmse_c = rmse_c.min(axis=0) # Consider the minimum rmse_c over c2 min_rmse = np.argmin(rmse_c) # Then calculate the predicted class label if min_rmse != label: count[label] += 1
The performance is not good anymore. That's depressing... it would be really cool if author really had got perfect test performance!
RMSE 0 # of errors: 930 Accuracy: 0.06999999999999995 RMSE 1 # of errors: 832 Accuracy: 0.16800000000000004 RMSE 2 # of errors: 872 Accuracy: 0.128 RMSE 3 # of errors: 963 Accuracy: 0.03700000000000003 RMSE 4 # of errors: 926 Accuracy: 0.07399999999999995 RMSE 5 # of errors: 866 Accuracy: 0.134 RMSE 6 # of errors: 412 Accuracy: 0.5880000000000001 RMSE 7 # of errors: 921 Accuracy: 0.07899999999999996 RMSE 8 # of errors: 782 Accuracy: 0.21799999999999997 RMSE 9 # of errors: 781 Accuracy: 0.21899999999999997
I'm sorry author. Clever data organization and coding can get the best of us. Reddit is/was also confused.
Scale factors are computed optimally assuming you can resolve at test time the ambiguity, using for example the same criteria used to derive the weights. So, lambdas shouldn't be changed, you have found probably good optimal solutions in validation. The only thing that is not explained here is how you choose among those optimal lambdas at test time. De Curtò y DíAz.
Weights, that is, optimal scale factors are computed according to Definition 4. What we show here is that we can achieve perfect score and no overfitting given that you can choose the right scale factors in validation. You are able to determine adequate lambda in test if you use same criteria. Code in the repository is very preliminary and paper still not accepted, will release more code soon. Thanks for the comments. De Curtò.
It's not really validation if you are updating your weights based on it. It's just splitting your train set for each class into two groups to use for different parts of your fitting process.
What we show here is that we can achieve perfect score and no overfitting given that you can choose the right scale factors in validation.
The issue is exactly this. For test samples from class
c2
you compare them to thesupermeanA[c]
scaled by only the scale factorsupermeanl[c2]
for classc2
. You are already stating the test sample is inc2
when you do this. Information has been leaked.If you pooled over
supermeanl[c2] * supermeanA[c]
for all combinations ofc, c2 in cartessian_product(C, C)
and reduced that down then that can be fair.If you considered only
supermeanA[c]
in the comparison, that would be fair, its close to K-means where the iterated-integral signature is a feature extractor on the raw images.But to only compare to things polluted by knowledge of the true class is not fair.
Here is a unique hand drawn 28x28 pixel image of a character that I have just made.
Please provide a minimal code cell (that would work dropped directly into the end of your notebook after running it entirely as you have provided it for MNIST) that would perform inference on this single image and tell us which class it belongs to.
Again, scale factors are computed optimally assuming you can resolve at test time the ambiguity, using for example the same criteria used to derive the weights. So, lambdas shouldn't be changed, you have found probably good optimal solutions in validation. The only thing that is not explained here is how you choose among those optimal lambdas at test time. De Curtò y DíAz.
I agree that the original issue should be re-opened. @Promisery could you please reason how you changed your mind? To me it seems that your concern is valid. @decurtoydiaz could you please detail why the multiplicative factors are constant (and, by extension, what do you mean by 'constant')?
Here scale factors are computed optimally assuming you can resolve at test time the ambiguity, using for example the same criteria used to derive the weights (which is not explained in the code). De Curtò y DíAz.
Please be considerate and respectful in your discourse.
The code is only a generalisation of our previous work: https://github.com/decurtoydiaz/signatures Please check it out.
Here we show that there exist some optimal weights tuned on validation that can be used to classify without overfitting in test if you are able to resolve the ambiguity of which one to use in test. The ambiguity is a geometric constraint, so it can be resolved (using several ways). But you can forget about that definition and try to find yourself the weights using some other method. They can also be constant factors, or be found using grid search, bayesian analysis or k-fold crossvalidation. The method is indeed general.
De Curtò y DíAz.
Please check the new updated code. The ambiguity in test is determined using one-vs-all fixing the proper lambda. You get n-classifiers one per class. Accuracy 100%.
De Curtò y DíAz.
Please check the new updated code. The ambiguity in test is determined using one-vs-all fixing the proper lambda. You get n-classifiers one per class. Accuracy 100%.
De Curtò y DíAz.
You have added three new code cells for evaluating cat, dog, and wild from AFHQ. In each of the three cells you have unrolled the c2
loop and hard coded what was globals()['supermeanl_' + str(c2)]
to be supermeanl_0
, supermeanl_1
, and supermeanl_2
.
In each of the three cases they are 100% accurate for class c2
and 0% accurate for the other two classes, because your model always guesses whatever is hardcoded for supermeanl_
of c2
.
I'm sorry but this does not work. You cannot perform inference on a sample you don't already have a label for.
To quote previous request:
Here is a unique hand drawn 28x28 pixel image of a character that I have just made.
Please provide a minimal code cell (that would work dropped directly into the end of your notebook after running it entirely as you have provided it for MNIST) that would perform inference on this single image and tell us which class it belongs to.
100% on all classes by leaking c2
(test label) into the predictions.
count = np.zeros(categories, dtype='object')
for c2 in range(0,categories):
a = os.listdir(folder[c2])
for z in range(0,len(a)):
rmse_c = np.empty(categories, dtype='object')
for c in range(0,categories):
rmse_c[c] = mean_squared_error(globals()['supermeanl_' + str(c2)] * supermeanA[c], signature_cyz(folder[c2], a[z]), squared=False)
min_rmse = np.argmin(rmse_c)
if(min_rmse != c2):
count[c2] += 1
---
RMSE cat
# of errors: 0
Accuracy: 1.0
RMSE dog
# of errors: 0
Accuracy: 1.0
RMSE wild
# of errors: 0
Accuracy: 1.0
100% on cat
class by hardcoding the prediction to be cat
, 0% acc on dog
and wild
because the model just says cat
for everything.
count = np.zeros(categories, dtype='object')
for c2 in range(0,categories):
a = os.listdir(folder[c2])
for z in range(0,len(a)):
rmse_c = np.empty(categories, dtype='object')
for c in range(0,categories):
rmse_c[c] = mean_squared_error(supermeanl_0 * supermeanA[c], signature_cyz(folder[c2], a[z]), squared=False)
min_rmse = np.argmin(rmse_c)
if(min_rmse != c2):
count[c2] += 1
---
RMSE cat
# of errors: 0
Accuracy: 1.0
RMSE dog
# of errors: 500
Accuracy: 0.0
RMSE wild
# of errors: 500
Accuracy: 0.0
100% on dog
class by hardcoding the prediction to be dog
, 0% acc on cat
and wild
because the model just says dog
for everything.
count = np.zeros(categories, dtype='object')
for c2 in range(0,categories):
a = os.listdir(folder[c2])
for z in range(0,len(a)):
rmse_c = np.empty(categories, dtype='object')
for c in range(0,categories):
rmse_c[c] = mean_squared_error(supermeanl_1 * supermeanA[c], signature_cyz(folder[c2], a[z]), squared=False)
min_rmse = np.argmin(rmse_c)
if(min_rmse != c2):
count[c2] += 1
---
RMSE cat
# of errors: 500
Accuracy: 0.0
RMSE dog
# of errors: 0
Accuracy: 1.0
RMSE wild
# of errors: 500
Accuracy: 0.0
100% on wild
class by hardcoding the prediction to be wild
, 0% acc on cat
and dog
because the model just says wild
for everything.
count = np.zeros(categories, dtype='object')
for c2 in range(0,categories):
a = os.listdir(folder[c2])
for z in range(0,len(a)):
rmse_c = np.empty(categories, dtype='object')
for c in range(0,categories):
rmse_c[c] = mean_squared_error(supermeanl_2 * supermeanA[c], signature_cyz(folder[c2], a[z]), squared=False)
min_rmse = np.argmin(rmse_c)
if(min_rmse != c2):
count[c2] += 1
---
RMSE cat
# of errors: 500
Accuracy: 0.0
RMSE dog
# of errors: 500
Accuracy: 0.0
RMSE wild
# of errors: 0
Accuracy: 1.0
All of them. You should use all of them. One vs all. Look at wikipedia hell. Use all the classifiers. This was the de facto approach before Deep Learning. Please revise your notes on data science.
De Curtò y DíAz.
If you know which of the three classifiers to use, then you already know what the class label is.
If you don't know which of the three to use, then you have an ensemble of three models all disagreeing with one another equally.
If you know which of the three classifiers to use, then you already know what the class label is.
If you don't know which of the three to use, then you have an ensemble of three models all disagreeing with one another equally.
Again, all of them. You should use all of them. One vs all. Look at wikipedia hell. Use all the classifiers. This was the de facto approach before Deep Learning. Please revise your notes on data science.
You have a perfect classifier of cats, another of dogs and another of wild. It's binary. Think it like that. When you try your wild on cat will say no, and no on dogs and yes on wild. This was traditional vision approach before Deep Learning and commonly used in robotics.
De Curtò y DíAz.
When you try your wild on cat will say no, and no on dogs and yes on wild.
If you try the wild
model on cat
your model will say wild
because it is hardcoded to say wild
.
If you try the wild
model on dog
your model will say wild
because it is hardcoded to say wild
.
If you try the wild
model on wild
your model will say wild
because it is hardcoded to say wild
.
Please try the code, for god's sake. You have three classifiers each one with a fixed lambda. If you look at the code, we go through ALL the test data, and it gets only the corresponding class classified.
De Curtò y DíAz.
Here is a unique hand drawn 28x28 pixel image of a character that I have just made.
Please provide a minimal code cell (that would work dropped directly into the end of your notebook after running it entirely as you have provided it for MNIST) that would perform inference on this single image and tell us which class it belongs to.
I have tried your code. That's why I and the others here are sure you have leaked information from the test labels.
If you don't need access to the test labels to make a prediction, then you will be able to perform inference on this image and classify it.
If you do need access to a test label for this image in order to be able to classify it, then you have to concede that you have leaked information from the test set labels when you computed your accuracy scores.
Please check the updated example. You DON'T need any information from the labels. You try all classifiers on the given input. There is no leakeage. One vs all. Please revise your notes on data science.
De Curtò y DíAz.
You DON'T need any information from the labels. There is no leakeage.
How do you explain the fact that if you swap two class labels in the test set (e.g., rename 1 to 6, and 6 to 1) you still get 100% test accuracy? Unless you swap them in the training set too, this should be impossible unless there is a dataset leakage.
Hell, go back to your first course in programming. I'm not here no answer those questions. It's late in Hong Kong. Renaming one variable to another doesn't change anything. Computers don't understand about variable names. Please, think before doing a question.
All of them. You should use all of them. One vs all. Look at wikipedia hell. Use all the classifiers. This was the de facto approach before Deep Learning. Please revise your notes on data science.
You have a perfect classifier of cats, another of dogs and another of wild. It's binary. Think it like that. When you try your wild on cat will say no, and no on dogs and yes on wild. This was traditional vision approach before Deep Learning and commonly used in robotics.
De Curtò y DíAz.
Please be considerate and respectful in your discourse.
Code is on the repository. You can try it yourself. When you do one vs all, you fix lambda and go through all the test data. And you get only correct the instances from the corresponding classes. For example, in AFHQ this means you have a perfect classifier of cats, dogs and wild. And as in traditional vision and robotics, when you do one vs all you have to use all classifiers if you are given an unlabeled test instance. There is no leakage. 100% on all tasks. Code is correct. Learning with Signatures rocks.
De Curtò y DíAz.
Oh wow I forgot to look at this thread, seems like we are actually just repeating ourselves. Linking #3 to keep it efficient.
Seems like all the bases have been really covered here. But, I feel like it's worth noting that there are well known errors in all of these test sets, so we don't expect 100% (https://arxiv.org/pdf/2103.14749.pdf). Unless a method is outlined for choosing the scale factor a priori per image, the model is not useful (as others have stated). At the very least, if the idea is that one, in principle, could determine which scale factor to use, without labels, the accuracy of the current best way of doing so should be reported, not the theoretical accuracy "if one could perfectly find the scale factor". That problem appears no easier than the original classification problem to me though.
Seems like all the bases have been really covered here. But, I feel like it's worth noting that there are well known errors in all of these test sets, so we don't expect 100% (https://arxiv.org/pdf/2103.14749.pdf). Unless a method is outlined for choosing the scale factor a priori per image, the model is not useful (as others have stated). At the very least, if the idea is that one, in principle, could determine the optimal scale factor without labels, the accuracy of the current best way of doing so should be reported, not the theoretical accuracy "if one could perfectly find the scale factor". That problem appears no easier than the original classification problem to me though.
Code is correct. Issue is already solved and thoroughly discussed in https://github.com/decurtoydiaz/learning_with_signatures/issues/3 .
Again, One vs all. You should use all classifiers each with a fixed lambda. There is no leakage. Ambiguity is resolved in test time as explained in the notebook by doing one-vs-all, which indeed was the de facto way to do things in many domains such as Robotics before Deep Learning emerged. Please revise your notes on data science.
When you do one vs all, you fix lambda and go through ALL the test data. And you get only correct the instances from the corresponding classes. For example, in AFHQ this means you have a perfect classifier of cats, dogs and wild. And as in traditional vision and robotics, when you do one vs all you have to use all classifiers if you are given an unlabeled test instance. There is no leakage. 100% on all tasks.
What's more, weights (videlicet, optimal scale factors) are tuned on VALIDATION (indeed, with very few samples; check the code; it's the range between begin_validate and end_validate; 100 or 500, depending on the task) and then achieve perfect generalisation on the test set. The most beautiful example of this is with Four Shapes, where only using 10 train samples (4 classes, 40 in total) to compute the representatives, and 100 validation samples (4 classes, 400 in total) to tune optimal scale factors, we achieve perfect accuracy on around 14,000 samples. This dataset is also particularly interesting because it is a good test for the properties of the signature transform, that capture area and order of the input paths.
Please, no more active participation in this thread is allowed.
De Curtò y DíAz.
Thank you for sharing your work! I'm still learning from your papers, and there is a question that I am not sure about:
When calculating test accuracy, you used
rmse_c[c] = mean_squared_error(globals()['supermeanl_' + str(c2)] * supermeanA[c], globals()['featuresAA_' + str(c2)][z], squared=False)
wherec2
is the true label of test data. This brings some concern because when testing,c2
should not be available and thus should not be used anywhere. Usingglobals()['featuresAA_' + str(c2)][z]
is fine as it only loads the test data. However, usingglobals()['supermeanl_' + str(c2)]
may cause a leakage. When using supermeanl_i, the true label i.e.c2
, should not be available. Therefore, I believe iterating through all supermeanl0~supermeanl(N-1) is the correct way to do so.