Dataset mismatch issue from CASIAv2 images and groundtruths

flymao627 commented 4 months ago

Hello, author! After I modified the size of 17 images in the CASIA2.0 dataset using the one you sent, an error occurred when running main_train.py. The error message is as follows:

Original Traceback (most recent call last): File "/home/wit/anaconda3/envs/IML-ViTb/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/wit/anaconda3/envs/IML-ViTb/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/wit/anaconda3/envs/IML-ViTb/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/wit/fly/IML-ViT-main/utils/datasets.py", line 88, in getitem assert tp_shape == gt_shape, "tp and gt image shape must be the same, but {}, {} got {} and {}".format(tp_path,gt_path,tp_shape, gt_shape) AssertionError: tp and gt image shape must be the same, but /home/wit/fly/IML-ViT-main/train_dataset/CASIA2.0/Tp/Tp_D_NRD_S_B_ind00055_ind00055_01344.tif, /home/wit/fly/IML-ViT-main/train_dataset/CASIA2.0/Gt/Tp_D_NRD_S_N_art00070_art00092_11822_gt.png got (384, 256) and (256, 384)

I made the following changes to the code at line 88 of the datasets.py file assert tp_shape == gt_shape, "tp and gt image shape must be the same, but {},{} got {} and {}".format(tp_shape, gt_shape)

I don't quite understand why the images corresponding to tp_path and gt_path are not the same. I hope you can help me and sincerely look forward to your reply. Thank you!

SunnyHaze commented 4 months ago

Hi, thanks for your attention. I see your points and I have carefully checked my dataset and the source of the datasets. As described in this CASIA2.0-Corrected-Groundtruth repository, I obtained the GT of CASIAv2 from this repository. However, their provided GT only contains 5123 images, while the common CASIAv2 dataset has 5124 manipulated images. The extra image Tp_D_NRD_S_B_ind00055_ind00055_01344.tif without a GT is the one that appeared in your issue. I may have removed this image from the Tp when I downloaded the GT from their repo. However, this happened years ago, sorry that I can not remember the exact situation.

Further, my code for load datasets, i.e. the implementation of the mani_dataset class simply utilizes os.listdir() and a sort() function to match all the images under the Tp & Gt sequentially, without further checking whether the file name is match or not. For example, the first TP image will match with the first GT image by dictionary order of the file name. Thus, when it comes to this number x extra Tp image, it will match with the wrong GT and disorder all the coming GTs.

Thus, removing this image from the Tp path is recommended from my point of view. If you can find a proper GT for this image is also a good solution.

This is how this issue happens, I will mention this in a proper place (maybe in one of my correction dataset repositories), and sincerely sorry for the confusion.

If you have further questions, feel free to reach out! Thanks again for pointing out this issue.

SunnyHaze commented 4 months ago

I changed the title of this issue to exactly match the issue you described. Thus others with similar problems can find the solution qickly.

flymao627 commented 4 months ago

Hi, thanks for your attention. I see your points and I have carefully checked my dataset and the source of the datasets. As described in this CASIA2.0-Corrected-Groundtruth repository, I obtained the GT of CASIAv2 from this repository. However, their provided GT only contains 5123 images, while the common CASIAv2 dataset has 5124 manipulated images. The extra image Tp_D_NRD_S_B_ind00055_ind00055_01344.tif without a GT is the one that appeared in your issue. I may have removed this image from the Tp when I downloaded the GT from their repo. However, this happened years ago, sorry that I can not remember the exact situation.

Further, my code for load datasets, i.e. the implementation of the mani_dataset class simply utilizes os.listdir() and a sort() function to match all the images under the Tp & Gt sequentially, without further checking whether the file name is match or not. For example, the first TP image will match with the first GT image by dictionary order of the file name. Thus, when it comes to this number x extra Tp image, it will match with the wrong GT and disorder all the coming GTs.

Thus, removing this image from the Tp path is recommended from my point of view. If you can find a proper GT for this image is also a good solution.

This is how this issue happens, I will mention this in a proper place (maybe in one of my correction dataset repositories), and sincerely sorry for the confusion.

If you have further questions, feel free to reach out! Thanks again for pointing out this issue.

Tp_D_NRD_S_B_ind00055_ind00055_01344.tif

Thank you very much for your prompt reply. In response to your statement "However, their provided GT only contains 5123 images, while the common CASIAv2 dataset has 5124 manipulated images." One of the extra files I found was "_list.txt". Some file names of TP and GT in data set conflict. For example, Tp_D_CMN_M_N_ind00091_ind00091_10647.jpg in TP, In GT, it is "Tp_S_CMN_M_N_ind00091_ind00091_10647_gt.png", For "Tp_D_NRD_S_B_ind00055_ind00055_01344.tif" mentioned above, the GT is "Tp_S_NRD_S_B_ind00055_ind00055_01344_gt.png". I wonder if this problem will affect training? I hope you can help me, thank you!

flymao627 commented 4 months ago

If the images obtained by tp_path and gt_path do not correspond, will the training be affected?

SunnyHaze commented 4 months ago

One of the extra files I found was "_list.txt"

Em, Sorry that I may wrongly explained the existing issue. You are right, there is no extra image, my extra file is also an index file.

After carefully checking, I found that the main issue is that the namtpham repository has already corrected some of the naming problems. Thus, some of the images may named differently if your Tp images are not from Namtpham's repo. As shown in this section:

Thus, if you are not downloading the Tp images from their repository, it may cause a naming mismatch. So a better solution is to download both GT and Tp images from namtpham's repo, and replace the problematic 17 images from my IML-Dataset-Corrections repo. Or, a simple way is to download a fully revised dataset(exactly what I am using) from my CASIA2.0-Corrected-Groundtruth repo.

The issue with the dataset is indeed a tricky matter, but it's also worth addressing seriously. I'm sorry for any inconvenience it may have caused you. If you have any further questions, please feel free to discuss them.

SunnyHaze commented 4 months ago

If the images obtained by tp_path and gt_path do not correspond, will the training be affected?

This is probably not feasible. You can visualize a few images specifically to see if they correspond. Moreover, if the names do not match, it would be difficult to correspond each image in sequence even without my method of sorting and then corresponding.

flymao627 commented 4 months ago

One of the extra files I found was "_list.txt"

Em, Sorry that I may wrongly explained the existing issue. You are right, there is no extra image, my extra file is also an index file.

After carefully checking, I found that the main issue is that the namtpham repository has already corrected some of the naming problems. Thus, some of the images may named differently if your Tp images are not from Namtpham's repo. As shown in this section:

Thus, if you are not downloading the Tp images from their repository, it may cause a naming mismatch. So a better solution is to download both GT and Tp images from namtpham's repo, and replace the problematic 17 images from my IML-Dataset-Corrections repo. Or, a simple way is to download a fully revised dataset(exactly what I am using) from my CASIA2.0-Corrected-Groundtruth repo.

The issue with the dataset is indeed a tricky matter, but it's also worth addressing seriously. I'm sorry for any inconvenience it may have caused you. If you have any further questions, please feel free to discuss them.

Thank you very much. Now the code is ready to run.

SunnyHaze commented 4 months ago

You are welcome! If you have any further questions, please feel free to discuss.

SunnyHaze / IML-ViT

Dataset mismatch issue from CASIAv2 images and groundtruths #16