Open lyyi599 opened 3 days ago
Hi @lyyi599,
Thank you for reaching out and using DALI.
Under the same settings, the DALI pipeline achieves only 67.6% accuracy, while using PyTorch transforms achieves the paper's reported result of 79.1%. In the experiment logs using PyTorch transforms, the model initially learns samples from head classes (categories with more samples that are easier to collect), as observed in the first 50 batches.
DALI file reader does the following:
So it should not favor less-represented samples. Can you gather statistics of the classes DALI reader returns? Do they match the sample distribution in the dataset?
Hi @lyyi599,
Thank you for reaching out and using DALI.
Under the same settings, the DALI pipeline achieves only 67.6% accuracy, while using PyTorch transforms achieves the paper's reported result of 79.1%. In the experiment logs using PyTorch transforms, the model initially learns samples from head classes (categories with more samples that are easier to collect), as observed in the first 50 batches.
DALI file reader does the following:
- initially mix the whole data set to make sure that it doesn't sample only from the first class first
- read N samples into shuffling buffer, where N is 1024 by default
- randomly sample the buffer
So it should not favor less-represented samples. Can you gather statistics of the classes DALI reader returns? Do they match the sample distribution in the dataset?
Hi @JanuszL ,
Thank you for your reply. After my verification, it is indeed the case that each epoch samples the data from the train set once for training, and they match the sample distribution in the dataset. However, the code still produces the results mentioned earlier. For example, in first epoch, the accuracy obtained on val dataset using DALI is as follows: many: 20.0%, med: 21.0%, few: 22.2%, average: 25.4%, while the results using PyTorch transforms are: many: 3.4%, med: 9.3%, few: 16.5%, average: 11.5%.
This is very confusing to me. The main training code I'm currently using is as follows (including the pipeline and transform). Simply switching the dali_dataset flag to include iNaturalist2018 (which corresponds to using the pipeline and transform for data loading) results in the significant accuracy drop mentioned above. Could you help me check if I am using DALI correctly?trainer.py.txt
It is worth mentioning that the iNaturalist2018 dataset uses a .txt file for indexing, so the create_dali_pipeline function includes the corresponding data_list_dir parameter for reading it.
Thank you for any suggestions you may have.
Hello @lyyi599 ,
Where do the iNaturalist18_train.txt
and iNaturalist18_val.txt
come from? They're not part of the original dataset. Perhaps they are simply generated incorrectly and the samples are mislabeled in training?
Hi @mzient ,
Thank you for your reply, and I apologize for not explaining the issue more clearly. Let me provide some context for the code: this is about long-tail recognition. Many real-world problems exhibit a long-tail distribution, meaning that in the training process, the number of samples per class varies. The iNaturalist2018 dataset is an example of this, and it has been widely studied. You can find existing examples here: https://github.com/shijxcs/LIFT/tree/661ead9b78368f05ba79abe4672d63154467f823/datasets/iNaturalist18. Therefore, the issue is likely not related to the .txt file used for indexing.
In the trainer.py code above, the only difference is how iNaturalist18 is loaded—specifically, the use of the pipeline and dataloader. This difference is causing the accuracy drop, so I suspect that there might be an issue with how I’m using DALI, or there could be some bugs in DALI that I haven’t identified. For comparison, the Imagenet_LT dataset is also indexed using a .txt file, and it uses a similar pipeline and dataloader approach, but I am able to obtain comparable results.
Therefore, the issue is likely not related to the .txt file used for indexing.
In the trainer.py code above, the only difference is how iNaturalist18 is loaded—specifically, the use of the pipeline and dataloader. This difference is causing the accuracy drop, so I suspect that there might be an issue with how I’m using DALI, or there could be some bugs in DALI that I haven’t identified. For comparison, the Imagenet_LT dataset is also indexed using a .txt file, and it uses a similar pipeline and dataloader approach, but I am able to obtain comparable results.
We are not saying it is the cause, but we want to make sure we are looking at the same things. If the index file misses some samples then DALI will not return them underrepresenting some classes and overrepresenting others. If you can also provide a way you generated these or the files itself it would be great.
Therefore, the issue is likely not related to the .txt file used for indexing. In the trainer.py code above, the only difference is how iNaturalist18 is loaded—specifically, the use of the pipeline and dataloader. This difference is causing the accuracy drop, so I suspect that there might be an issue with how I’m using DALI, or there could be some bugs in DALI that I haven’t identified. For comparison, the Imagenet_LT dataset is also indexed using a .txt file, and it uses a similar pipeline and dataloader approach, but I am able to obtain comparable results.
We are not saying it is the cause, but we want to make sure we are looking at the same things. If the index file misses some samples then DALI will not return them underrepresenting some classes and overrepresenting others. If you can also provide a way you generated these or the files itself it would be great.
I get you. I download the files from the repo: https://github.com/shijxcs/LIFT/tree/661ead9b78368f05ba79abe4672d63154467f823/datasets/iNaturalist18 , and I ensured that the file properly indexes the corresponding data.
Thanks for this awesome tool.
Describe the question.
For my experiments, I need to modify a piece of open-source code. Due to the time-consuming nature of the transformation and the large size of the iNaturalist2018 dataset (over 400,000 images), I plan to switch to using DALI for data loading. Below is the original open-source code snippet.(from:https://github.com/shijxcs/LIFT/blob/661ead9b78368f05ba79abe4672d63154467f823/trainer.py#L102)
Referring to the DALI example(https://github.com/NVIDIA/DALI/blob/4562f157a203bd17a1dbc9a0b07f05ba3a41c1fb/docs/examples/use_cases/pytorch/resnet50/main.py#L275 )I have successfully adapted the code for ImageNet_LT and achieved results comparable to the original paper (77.0 with LA loss). The modified code is as follows:
As mentioned earlier, the code has been validated on the ImageNet_LT dataset. However, when applied to the iNaturalist2018 dataset, it results in significant differences, specifically:
In contrast, using the DALI pipeline prioritizes learning from tail classes (categories with fewer samples that are harder to collect). The corresponding log is as follows:
It is evident that the learning processes of the two methods differ significantly. However, such a large discrepancy is not observed on the ImageNet_LT dataset.
Of course, after a certain number of epochs, the logs for both PyTorch transforms and the DALI pipeline eventually show lower accuracy for head classes and higher accuracy for tail classes. However, the overall accuracy drops by over 10%, which is an unacceptable difference.
Possible Explanations
The sampling process of the pipeline and the dataloader is different. However, I don't think it would cause such a significant discrepancy.
Heeeelp
Thank you for any suggestions!
Check for duplicates