edgarschnfld / CADA-VAE-PyTorch

Official implementation of the paper "Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders" (CVPR 2019)
MIT License
283 stars 57 forks source link

Data loader assumes access to all test samples! #16

Closed sebastianbujwid closed 4 years ago

sebastianbujwid commented 4 years ago

Hi,

In your data loader when transforming the samples you seem to assume access to all test samples: https://github.com/edgarschnfld/CADA-VAE-PyTorch/blob/master/model/data_loader.py#L108-L109

I believe you should be instead using scalar.fit() on the train_features and only scalar.transform() on test_seen_feature and test_unseen_feature.

Assuming access to all test samples corresponds to a different experimental setup and makes comparisons to different methods unfair.

edgarschnfld commented 4 years ago

From how I read the code, test and training features are scaled independently from each other. So the training samples are transformed without the min-max knowledge of the test samples. To be more precise: The code below takes only the unseen class features as input and produces normalized unseen class features as output.

test_unseen_feature = scaler.fit_transform(feature[test_unseen_loc])

I think you confusion comes from this line: train_feature = scaler.fit_transform(feature[trainval_loc])

In particular, the choice of words "trainval". "Trainval" actually refers to "train". The reason for the word "val" is, that for hyperparameter tuning, you divide that training set into a train and val part. This way you never need to look at the final test set, which is how it should be.

sebastianbujwid commented 4 years ago

From how I read the code, test and training features are scaled independently from each other. So the training samples are transformed without the min-max knowledge of the test samples.

Yes, that's how I read the code too.

So the training samples are transformed without the min-max knowledge of the test samples.

Yes, I agree. The problem that I see is that test samples are normalized based on the information from all other test samples which I believe should not be done in typical evaluation setups. The test samples are typically evaluated 1 at the time (without using any information from other test samples). Assuming that you have access to all test samples in advance when doing evaluation corresponds to a different, non-typical evaluation setup. Sorry if I was not precise about that before.

I just Google for a random Stackoverflow answer to a similar issue: https://stats.stackexchange.com/questions/174823/how-to-apply-standardization-normalization-to-train-and-testset-if-prediction-i I hope that clarifies well my concern. From what I see it seems like you did normalization like in the 2nd way, but should have used the 3rd way instead.

Also, an example of how normalizations should be done from Scikit-learn documentation: https://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_mnist.html#sphx-glr-auto-examples-linear-model-plot-sparse-logistic-regression-mnist-py Lines:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
edgarschnfld commented 4 years ago

Ok I see your point. But the sequential observation assumption is not a strict assumption. As an analogy for zero-shot learning with birds (as in CUB): Imagine you read about 50 new bird species in a lexicon (the attributes). Then you go to the zoo (get the image features). After observing all the new birds there, you normalize over your observations and classify them according to your knowledge from the lexicon. Perfectly counts as zero-shot learning, and is not making the sequential assumption. Apart from this, I agree with you.

edgarschnfld commented 4 years ago

I looked into the thing. I changed the adjusted the two lines with the scaler and pushed. I tested the code, and the experimental results are still the same.

sebastianbujwid commented 4 years ago

Ok I see your point. But the sequential observation assumption is not a strict assumption. As an analogy for zero-shot learning with birds (as in CUB): Imagine you read about 50 new bird species in a lexicon (the attributes). Then you go to the zoo (get the image features). After observing all the new birds there, you normalize over your observations and classify them according to your knowledge from the lexicon. Perfectly counts as zero-shot learning, and is not making the sequential assumption. Apart from this, I agree with you.

You can of course evaluate your method in any way you want, but assuming that you are able to see all (or any) of the test samples (in this case input images) before making predictions is not a typical evaluation setup. That would be closer to transductive learning, which is an easier problem in general and did not appear to me that that is what you intended to solve.

This is not an issue of zero-shot learning or not. My comment applies for both lines, also if you evaluate on the test set of just seen classes like in the first of those two lines: test_seen_feature = scaler.fit_transform(feature[test_seen_loc])

hellowangqian commented 4 years ago

This is not an issue of zero-shot learning or not. My comment applies for both lines, also if you evaluate on the test set of just seen classes like in the first of those two lines: test_seen_feature = scaler.fit_transform(feature[test_seen_loc])

@sebastianbujwid Have you done any evaluation by modified the code to the "typical" setting (i.e. normalising the test features using the normalisation parameters derived from the training data in a pure inductive way)? Do the results differ significantly?

edgarschnfld commented 4 years ago

@hellowangqian The code is already adjusted (see last commit). I tested the code with the normalization parameters derived from the training data on 3 datasets and the results were the same as before. Of course, if anyone else notices a deviation, let us know :)