Error when loading state_dict

jakubLangr commented 4 years ago

Hello @Luodian ,

Hope you enjoyed the winter holidays! Thank you so much for this code release, it was like second Christmas for me!

Anyway, I tried running the model and I got reasonably far; however, I get the following issue when I try to replicate the model.

-------------- End ----------------
CustomDatasetDataLoader
dataset [GTA5_Cityscapes] was created
initialize network with normal
initialize network with normal
initialize network with normal
Traceback (most recent call last):
  File "train.py", line 20, in <module>
    model = create_model(opt)
  File "/efs/spot/MADAN/cyclegan/models/__init__.py", line 20, in create_model
    model.initialize(opt)
  File "/efs/spot/MADAN/cyclegan/models/multi_cycle_gan_semantic_model.py", line 92, in initialize
    self.netPixelCLS_SYN = get_model(opt.weights_model_type, num_cls=opt.num_cls, pretrained=True, weights_init=opt.weights_init)
  File "/efs/spot/MADAN/cycada/models/models.py", line 12, in get_model
    net = models[name](num_cls=num_cls, **args)
  File "/efs/spot/MADAN/cycada/models/drn.py", line 256, in drn26
    out_map=out_map, finetune=finetune, **kwargs)
  File "/efs/spot/MADAN/cycada/models/drn.py", line 174, in __init__
    self.load_state_dict(state_dict)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 845, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DRN:
    size mismatch for fc.weight: copying a param with shape torch.Size([1000, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([19, 512, 1, 1]).
    size mismatch for fc.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([19]).

I think there's a semantic channel space (19, COCO-style) and then there's the 1000 dim vector, which I am not 100% sure where that comes from.

Let me know if you have any ideas, thanks!

jakubLangr commented 4 years ago

But when I run with opt.num_cls = 19 then, I get the following error:

  File "train.py", line 20, in <module>
    model = create_model(opt)
  File "/efs/spot/MADAN/cyclegan/data/__init__.py", line 59, in __iter__
    for i, data in enumerate(self.dataloader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/efs/spot/MADAN/cyclegan/data/gta5_cityscapes.py", line 92, in __getitem__
    B_label_path = self.B_labels[index_B]
IndexError: list index out of range

Overall, I am somewhat confused why does the checkpoint have 1000 classes and the model 19, the models are assumed to be fairly standard. Or it can be that the DRN checkpoint has changed. Any chance you could upload yours?

jakubLangr commented 4 years ago

I am still working on this @Luodian and I think that the --num_cls 1000 is meant to be part of the command; however, the last IndexError makes me think that there is something missing (specifically, the trainB) folder, but I am unsure what was your trainB. Do you think you could tell us about the dataset folder structure? That would be greatly appreciated!

But my dataroot has been built as I think it should be:

├── cityscapes
│   ├── gtFine
│   └── leftImg8bit
├── cyclegta5
│   ├── images
│   └── labels

Or am I missing something?

Luodian commented 4 years ago

Hi Sir, Sorry for my lagging reply. Yes, I organize my dataset exactly as yours. I tried to run my script, and I didn't find any errors. It seems that you do not correctly load the pretrained model "drn26-cycada-xxx". You can download the model here

Luodian commented 4 years ago

But when I run with opt.num_cls = 19 then, I get the following error:

  File "train.py", line 20, in <module>
    model = create_model(opt)
  File "/efs/spot/MADAN/cyclegan/data/__init__.py", line 59, in __iter__
    for i, data in enumerate(self.dataloader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/efs/spot/MADAN/cyclegan/data/gta5_cityscapes.py", line 92, in __getitem__
    B_label_path = self.B_labels[index_B]
IndexError: list index out of range

Overall, I am somewhat confused why does the checkpoint have 1000 classes and the model 19, the models are assumed to be fairly standard. Or it can be that the DRN checkpoint has changed. Any chance you could upload yours?

I am not sure why the index will be out of range. But can you set a breakpoint in this line and see the 'index_B' variable and the 'len(self.B_labels)' variable? Don't worry, I will collect and be responsive to any mistake. Also, I will make a big update to MADAN before February.

jakubLangr commented 4 years ago

Hi, thanks for your reply.

I redownloaded the Cycada model. So it is a modification that was used by jhoffman rather than the original Fisher Yu DRN?

As to your second comment, I tried doing that before posting; however by this point the code has reached the parallelized points so using standard debuggers is not possible.

Furthermore the command:

sudo /home/ubuntu/anaconda3/envs/pytorch_p36/bin/python train.py --name cyclegan_gta2cityscapes     --resize_or_crop scale_width_and_crop --loadSize 600 --fineSize 500 --which_model_netD n_layers --n_layers_D 3     --no_flip --batchSize 16 --nThreads 16      --dataset_mode gta5_cityscapes --dataroot ./data/     --semantic_loss --gpu 0,1,2,3,4--model multi_cycle_gan_semantic --num_cls 19 --weights_init ./pretrained_models/drn26-cyclegta5-iter115000.pth

Fails with the same IndexError. I have tried setting the breakpoint on the initialize function of the CustomDatasetDataLoader in data/__init__.py, but I get the following issue when I do that:

Traceback (most recent call last):
  File "train.py", line 30, in <module>
    for i, data in enumerate(dataset):
  File "/efs/spot/MADAN/cyclegan/data/__init__.py", line 59, in __iter__
    for i, data in enumerate(self.dataloader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/efs/spot/MADAN/cyclegan/data/gta5_cityscapes.py", line 92, in __getitem__
    B_label_path = self.B_labels[index_B]
IndexError: list index out of range

If you suspect this is an IPython bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipython-dev@python.org

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

Exception ignored in: <async_generator object _ag at 0x7fd9f1a8d118>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/types.py", line 27, in _ag
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/bdb.py", line 53, in trace_dispatch
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/bdb.py", line 79, in dispatch_call
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/bdb.py", line 176, in break_anywhere
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/bdb.py", line 36, in canonic
AttributeError: 'NoneType' object has no attribute 'abspath'

Will investigate further

jakubLangr commented 4 years ago

Hi @Luodian I have tried a somewhat different approach to debugging and I got this error instead. So at least one of the datasets loads correctly.

dataset [GTA5_Cityscapes] was created
sel> /efs/spot/MADAN/cyclegan/data/__init__.py(47)initialize()
     46                 self.dataset = CreateDataset(opt)
---> 47         self.dataloader = torch.utils.data.DataLoader(
     48                         self.dataset,

ipdb> len(self.dataset)
24966
ipdb> c
initialize network with normal
initialize network with normal
initialize network with normal
initialize network with normal
/efs/spot/MADAN/pretrained_models/drn26-cyclegta5-iter115000.pth
Using state dict from /efs/spot/MADAN/pretrained_models/drn26-cyclegta5-iter115000.pth
Loading full model
/efs/spot/MADAN/pretrained_models/drn26-cyclegta5-iter115000.pth
Using state dict from /efs/spot/MADAN/pretrained_models/drn26-cyclegta5-iter115000.pth
Loading full model
initialize network with normal
initialize network with normal
initialize network with normal
---------- Networks initialized -------------
[Network G_A_1] Total number of parameters : 11.378 M
[Network G_B_1] Total number of parameters : 11.378 M
[Network D_A] Total number of parameters : 2.765 M
[Network D_B_1] Total number of parameters : 2.765 M
[Network D_B_2] Total number of parameters : 2.765 M
[Network G_A_2] Total number of parameters : 11.378 M
[Network G_B_2] Total number of parameters : 11.378 M
-----------------------------------------------
create web directory ./checkpoints/cyclegan_gta2cityscapes/web...
Traceback (most recent call last):
  File "train.py", line 15, in <module>
    data_loader = CreateDataLoader(opt)
  File "/efs/spot/MADAN/cyclegan/data/__init__.py", line 60, in __iter__
    for i, data in enumerate(self.dataloader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/efs/spot/MADAN/cyclegan/data/gta5_cityscapes.py", line 92, in __getitem__
    B_label_path = self.B_labels[index_B]
IndexError: list index out of range

Luodian commented 4 years ago

Hi Sir,

I am updating this repo these days and I didn't reproduce your errors. Maybe you can check the length of 'self.B_labels' variable and "index_B". I guess you didn't load the target dataset (cityscapes) correctly.

jakubLangr commented 4 years ago

So I see a slight discrepancy, so I guess that is the source:

ipdb> len(self.B_labels)
5000
ipdb> len(self.A_labels)
24966
ipdb> len(self.A_paths)
24966
ipdb> len(self.B_paths)
22569

That I do not have enough B labels?

But when I run /gtFine/train$ tree . | wc -l I get 11921, which is already more than 5000.

Will continue to investigate.

Thanks for all your help so far!

Luodian commented 4 years ago

Also, you need to check 'self.B_paths', for that 'index_B' is mod by the "len(self.B_paths)".

jakubLangr commented 4 years ago

Well, checking anything once it is being loaded for computation is rather difficult, because it is hardly parallel and debuggers do not work.

So I think I now understand where (roughly) this issue comes from:

find .  -iname *_gtFine_labelIds.png | wc -l
5000

But I have re-unzipped all cityscapes files I have, so I must have missed some.

jakubLangr commented 4 years ago

Oh wait! You've included the coarse images didn't you?

Luodian commented 4 years ago

we didn't include coarse images. My length of "self.B_labels" and "self.B_paths" are both 5000.

jakubLangr commented 4 years ago

ah okay, meanwhile I had both train and train_extra in the cityscapes folder! That's my bad.

I am now investigating the next one down the line:

Traceback (most recent call last):
  File "train.py", line 40, in <module>
    model.set_input(data)
  File "/efs/spot/MADAN/cyclegan/models/multi_cycle_gan_semantic_model.py", line 195, in set_input
    self.real_A_1 = input['A_1'].to(self.device)
KeyError: 'A_1'

Because somehow:

data.keys()
dict_keys(['A', 'B', 'A_paths', 'B_paths', 'A_label', 'B_label'])

So these come from enumerate(dataset), which gets them from __getitem__ in gta5cityscapes.py, which does return:

retrun {'A': A, 'B': B,
                'A_paths': A_path, 'B_paths': B_path, 'A_label': A_label, 'B_label': B_label}

So I have swapped to --dataset_mode gta_synthia_cityscapes, but that is not exactly what I want to do & I will have to download Synthia. I am guessing you used the CVPR16 version, correct?

Thank you for all your help so far!

jakubLangr commented 4 years ago

Hi @Luodian , I just downloaded the CVPR Synthia dataset and got into the right format, but I came across another issue:

create web directory ./checkpoints/cyclegan_gta2cityscapes/web...
Traceback (most recent call last):
  File "train.py", line 29, in <module>
    for i, data in enumerate(dataset):
  File "/efs/spot/MADAN/cyclegan/data/__init__.py", line 60, in __iter__
    for i, data in enumerate(self.dataloader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/efs/spot/MADAN/cyclegan/data/gta_synthia_cityscapes.py", line 126, in __getitem__
    A_label_1 = Image.fromarray(A_label_1, 'L')
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/PIL/Image.py", line 2657, in fromarray
    raise ValueError("Too many dimensions: %d > %d." % (ndim, ndmax))
ValueError: Too many dimensions: 3 > 2.

Admittedly, it looks like this is super close to it running, but any ideas what this might be?

xiaoachen98 commented 4 years ago

After download the code, I can't find the train_cycada_gta_cityscapes_A2B_SEM_KL.sh in CycleGAN folder. Did you forget to upload it?

Luodian commented 4 years ago

After download the code, I can't find the train_cycada_gta_cityscapes_A2B_SEM_KL.sh in CycleGAN folder. Did you forget to upload it?

I'm sorry for late reply. It's a name problem, you can directly run "cyclegan_gta2cityscapes.sh".

Luodian / MADAN

Error when loading state_dict #1