Heathcliff-saku / ViewFool_

This repository contains the ViewFool and ImageNet-V proposed by the paper “ViewFool: Evaluating the Robustness of Visual Recognition to Adversarial Viewpoints” (NeurIPS2022).
26 stars 2 forks source link

What transformations suit ImageNet-V Evaluation? (Error rate 100%) #3

Open f-amerehi opened 3 months ago

f-amerehi commented 3 months ago

Hi @Heathcliff-saku,

Firstly, thank you for the dataset and codes. I'd like to evaluate some fine-tuned models (like ResNet, DenseNet) on the ImageNet-V data. However, the accuracy shows 0. To debug, I simply loaded the default ImageNet1K checkpoints and used the following code to see their accuracy. In the ([first page](https://github.com/Heathcliff-saku/ViewFool?tab=readme-ov-file#22-imagenet-v-benchmark)) of repository, it says that DenseNet's accuracy on ImageNet-V is ~20. However, the following code still shows an error rate of 100. I looked into the evaluation code and believe the normalization is correct. Is there any specific normalization required for the test phase on ImageNet-V? Also, should I set the number of classes to 1000 (ImageNet) or 100 (ImageNet-V)? I would appreciate any thoughts.

Many Thanks.

import torch
from torchvision.datasets import ImageFolder
from torchvision import transforms
from torchvision.models import densenet121, DenseNet121_Weights
from torch.utils.data import DataLoader
from torchmetrics import Accuracy
from tqdm import tqdm

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
test_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    normalize
])

imagenetV = 'C:/datasets/ImageNet-V'
test_dataset = ImageFolder(root=imagenetV, transform=test_transform)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

model = densenet121(weights= DenseNet121_Weights.IMAGENET1K_V1)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

accuracy = Accuracy(task="multiclass", num_classes=1000).to(device)  # tested both 1000 and 100

with torch.no_grad():
    for images, labels in tqdm(test_loader, desc="Evaluation on imagenet-V"):
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        preds = torch.argmax(outputs, dim=1)
        accuracy.update(preds, labels)

error_rate = 1 - accuracy.compute().item()
print(f"Error rate on ImageNet-V dataset: {error_rate*100:.4f}")

#Output shows
#Error rate on ImageNet-V dataset: 100.0000
Heathcliff-saku commented 2 months ago

@f-amerehi Hi! Sorry for the late reply. Regarding your question, you should use a classifier with 1000 classes (to match the number of categories used with weights trained on ImageNet). During the testing phase, you will need to set the ground truth for the images in the various subfolders of ImageNet-V according to the category labels we provide (i.e., their corresponding labels in the ImageNet 1000 classes). Your testing code may need further adjustment; you might refer to the file classifier/predict2.py . We use an assessment for individual categories (this was initially for testing accuracy across different ImageNet-V categories; you can modify this to assess all ImageNet-V samples).

I think if your fine-tuned model has normal test results on ImageNet, it's unlikely that it would result in an accuracy of zero on ImageNet-V. If you have any further progress, feel free to let us know anytime!

f-amerehi commented 2 months ago

Hi @Heathcliff-saku, Many thanks for getting back to me, and thanks for pointing to the correct .py file. I'll check it out. Yes, the fine-tuned accuracy on ImageNet/val is around 93, and I'd like to see how much it is for ImageNet-V. Have a nice day!

Heathcliff-saku commented 2 months ago

@f-amerehi Additionally, you can also try evaluating your model on ImageNet-V+, which is a larger and more powerful viewpoint robustness benchmark recently released by our team, covering 100 ImageNet categories: https://github.com/Heathcliff-saku/VIAT. We have provided the corresponding ImageNet labels for these 100 categories in the dataset file. When creating the dataset, we followed the convention of out-of-distribution (OOD) benchmarks like ImageNet-A/-R/-O, providing only the image folder format and corresponding labels’ .txt files. If you have further questions about the testing process, you might also consider referring to the repositories of these benchmarks. Best of luck!

f-amerehi commented 2 months ago

Very good, thank you @Heathcliff-saku so much!