The img2batch_idx_list doesn't have the corresponding img_idx

shenjiyuan123 commented 1 year ago

Hi, I really think it's a great work!

However, I meet some problems when I try to reproduce your method.

I have successfully run the recover and relabel process. I generate the syn_data and the soft label (i.e. many files like batch_0.tar...). When I want to run the train.sh (I already change the pytorch source code following your instruction), it says that "Caught KeyError in DataLoader worker process 0". I find it doesn't find the corresponding img_idx in the img2batch_idx_list (relabel/utils_fkd.py line143).

The error is following:

Epoch: 0 Traceback (most recent call last): File "/export/home2/jiyuan/SRe2L/train/train_FKD.py", line 360, in main() File "/export/home2/jiyuan/SRe2L/train/train_FKD.py", line 179, in main train(model, args, epoch) File "/export/home2/jiyuan/SRe2L/train/train_FKD.py", line 219, in train for batch_idx, batch_data in enumerate(args.train_loader): File "/export/home2/jiyuan/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 634, in next data = self._next_data() File "/export/home2/jiyuan/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/export/home2/jiyuan/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/export/home2/jiyuan/anaconda3/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/export/home2/jiyuan/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/export/home2/jiyuan/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 62, in fetch mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config(possibly_batched_index[0]) File "/export/home2/jiyuan/SRe2L/train/../relabel/utils_fkd.py", line 143, in load_batch_config batch_idx = self.img2batch_idx_list[self.epoch][img_idx] KeyError: 7542

Could you help me figure it out? Hope for your feedback!

Thanks.

zeyuanyin commented 1 year ago

Hi @shenjiyuan123. Thanks for your interest in this project!

It may result from the data index mismatch between saved configuration files and the running dataloader sampler.

Please check if --fkd-seed is set to the same value in relabel and train files.

shenjiyuan123 commented 1 year ago

Thanks for your answer. But I keep the --fkd-seed to 42 all the time. I think it may not be this reason?

zeyuanyin commented 1 year ago

self.img2batch_idx_list consists of [dict(),dict(),...] and is generated at https://github.com/VILA-Lab/SRe2L/blob/549988a9a7062eec56d5e8aa12187a60b1a798fb/relabel/utils_fkd.py#L216

Error Info KeyError: 7542 you provided means that 7542 does not correspond to any index of the first value in batch lists using in relabel phase, which is the index mismatch I mentioned above. And --batch-size should be set to the same in two phases to avoid the mismatch issue.

Do you keep other settings the same as the example bash in README.md?

shenjiyuan123 commented 1 year ago

Thanks for your patience. I have found my problem. I forget to change the num_img to my setting since I use ipc=10 during the recover process. So sorry for the disturbance~

But, maybe a little suggestion: I think you can add a args.ipc so that can control the get_img2batch_idx_list function rather than changing the values directly in the function.

Btw, does the code of recover process support multiple GPUs? I see the implementation is using DataParallel, however, when I try to use two GPUs to synthesize the data, it says that tensors are not in the same device like the following:

Traceback (most recent call last): File "/export/home2/jiyuan/SRe2L/recover/data_synthesis.py", line 219, in main_syn(ipc_id) File "/export/home2/jiyuan/SRe2L/recover/data_synthesis.py", line 213, in main_syn get_images(args, model_teacher, hook_for_display, ipc_id) File "/export/home2/jiyuan/SRe2L/recover/data_synthesis.py", line 85, in get_images loss_aux = args.tv_l2 * loss_var_l2 + \ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Thank you again!

zeyuanyin commented 1 year ago

Thanks for your suggestions. We have updated the code for the new features:

The dataset information of FKD configuration files will be collected and displayed.
num_img will be calculated automatically, then be passed into get_img2batch_idx_list function
Added descriptions on --batch-size and --epochs to avoid some potential mismatch issues

For recover phase, the code works well in a single GPU. And the code supporting multiple GPUs will be released soon. If you want to make the best of your two GPUs, you can assign one task to each GPU under different ipc_id range settings to generate images with different IDs at the same time.

https://github.com/VILA-Lab/SRe2L/blob/25591ea49da866d04222eef271c8886f02839a71/recover/data_synthesis.py#L217

shenjiyuan123 commented 1 year ago

Thanks for your patience! You really help me a lot~ Hope every going well in your research.

VILA-Lab / SRe2L

The img2batch_idx_list doesn't have the corresponding img_idx #1