clovaai / tunit

Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)
Other
457 stars 48 forks source link

Error while training afhq_wild; RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory (assert_no_internal_overlap at /pytorch/aten/src/ATen/MemoryOverlap.cpp:36) #3

Closed sizhky closed 4 years ago

sizhky commented 4 years ago

Please find the stacktrace below. Can you let me know what I am doing wrong?

>>> python main.py --dataset afhq_wild --output_k 10 --data_path '/home/yyr/data/' --p_semi 0.0 --img_size 64 --batch_size 32   
PYTORCH VERSION 1.5.0
main.py:146: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
  warnings.warn('You have chosen a specific GPU. This will completely '
False
False
MULTIPROCESSING DISTRIBUTED :  False
Use GPU: 0 for training
Init Generator
GENERATOR NF :  64
Init ContentEncoder
Init Decoder
Init Generator
GENERATOR NF :  64
Init ContentEncoder
Init Decoder
USE CLASSES [2]
LABEL MAP: {2: 0}
USE AFHQ dataset [FOR IIC]
LABEL MAP: {2: 0}
500
dataset                            afhq_wild           

data_path                          /home/yyr/data/     

workers                            4                   

model_name                         GAN_20200617-194923 

epochs                             200                 

iters                              1000                

batch_size                         32                  

val_batch                          10                  

log_step                           100                 

sty_dim                            128                 

output_k                           10                  

img_size                           64                  

dims                               2048                

p_semi                             0.0                 

load_model                         None                

validation                         False               

world_size                         1                   

rank                               0                   

gpu                                0                   

ddp                                False               

port                               8989                

iid_mode                           iid+                

w_gp                               10.0                

w_rec                              0.1                 

w_adv                              1.0                 

w_vec                              0.01                

data_dir                           /home/yyr/data/     

start_epoch                        0                   

train_mode                         GAN_UNSUP           

unsup_start                        0                   

separated                          65                  

ema_start                          66                  

fid_start                          66                  

multiprocessing_distributed        False               

distributed                        False               

ngpus_per_node                     1                   

log_dir                            ./logs/GAN_20200617-194923

event_dir                          ./logs/GAN_20200617-194923/events

res_dir                            ./results/GAN_20200617-194923

num_cls                            10                  

att_to_use                         [2]                 

epoch_acc                          []                  

epoch_avg_subhead_acc              []                  

epoch_stats                        []                  

to_train                           CDGI                

min_data                           4738                

max_data                           4738                

START EPOCH[1]
  0%|                                                  | 0/1000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 524, in <module>
    main()
  File "main.py", line 201, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "main.py", line 305, in main_worker
    trainFunc(train_loader, networks, opts, epoch, args, {'logger': logger, 'queue': queue})
  File "/home/yyr/Documents/github/tunit/train/train_unsupervised.py", line 103, in trainGAN_UNSUP
    c_loss.backward()
  File "/home/yyr/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/yyr/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation. (assert_no_internal_overlap at /pytorch/aten/src/ATen/MemoryOverlap.cpp:36)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fc7ed454536 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::assert_no_internal_overlap(c10::TensorImpl*) + 0xc5 (0x7fc82a771d55 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: at::TensorIterator::check_mem_overlaps() + 0x71 (0x7fc82ab6e8a1 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorIterator::build() + 0x2c (0x7fc82ab77b4c in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0xbb3718 (0x7fc82a8ed718 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x44 (0x7fc82a8ef224 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x316ec4d (0x7fc82cea8c4d in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::CopySlices::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0xb35 (0x7fc82caced65 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2d89c05 (0x7fc82cac3c05 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7fc82cac0f03 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7fc82cac1ce2 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fc82caba359 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fc8391f9998 in /home/yyr/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0xd6cb4 (0x7fc83a0e7cb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #14: <unknown function> + 0x9609 (0x7fc83c549609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #15: clone + 0x43 (0x7fc83c470103 in /lib/x86_64-linux-gnu/libc.so.6)

  0%|                                                  | 0/1000 [00:00<?, ?it/s]
FriedRonaldo commented 4 years ago

Hmm... Actually, I've never seen this kind of error before. It might be from the version compatibility. Your pytorch version seems to be 1.5.0. However, this repository is mainly tested with pytorch 1.1 or 1.2. I read the release note of pytorch 1.5.0 and found that there are some changes in backward and clone function. Please refer to https://github.com/pytorch/pytorch/releases and "Backwards Incompatible Changes" section.

I tested the code again with 1.2. It works well. (I also tested with img_size=64 and also works well). This problem might be solved by using a lower version of pytorch. I think that the code will work with pytorch 1.4 also, but, it is not obvious.


Update

I find another similar issue at https://github.com/pytorch/pytorch/issues/33812 https://github.com/pytorch/pytorch/issues/33812#issuecomment-593407581 The code mentioned work well with pytorch 1.2 but does not work with 1.5.

It might be the version issue! pytorch 1.2 will solve this issue. Please try this code with pytorch 1.2.

sizhky commented 4 years ago

It's working, i'm not sure what changed, but the command did not error out after a machine restart...

Edit: just realized it errored out with --img_size 64 but it's working with --img_sz 128

FriedRonaldo commented 4 years ago

It is weird... I also tested with img_size=64 and it works well.

DanRuta commented 4 years ago

For future reference, I fixed it on my end by changing these lines: https://github.com/clovaai/tunit/blob/master/tools/ops.py#L69

p_i_j[(p_i_j < EPS).data] = EPS
p_j[(p_j < EPS).data] = EPS
p_i[(p_i < EPS).data] = EPS

to the following:

p_i_j = torch.clamp(p_i_j, min=EPS)
p_j = torch.clamp(p_j, min=EPS)
p_i = torch.clamp(p_i, min=EPS)
FriedRonaldo commented 4 years ago

@DanRuta Thanks. I will test with the modified code and update it.

rggs commented 4 years ago

Has anyone had any luck with solving this in pytorch 1.5? I'm having the same issue - calling backward() results in the error. The call to backward is on line 381: https://github.com/rsbball11/IIC/edit/master/code/scripts/cluster/cluster_greyscale_twohead.py

FriedRonaldo commented 4 years ago

For future reference, I fixed it on my end by changing these lines: https://github.com/clovaai/tunit/blob/master/tools/ops.py#L69

p_i_j[(p_i_j < EPS).data] = EPS
p_j[(p_j < EPS).data] = EPS
p_i[(p_i < EPS).data] = EPS

to the following:

p_i_j = torch.clamp(p_i_j, min=EPS)
p_j = torch.clamp(p_j, min=EPS)
p_i = torch.clamp(p_i, min=EPS)

@rsbball11 Does not this comment work? There are some changes on optimizer in pytorch1.5.

I also use the same operation of IIC. So the comment might help you.

rggs commented 4 years ago

Oh, yes that does work. I'm new to this so I didn't fully understand where I was supposed to be looking, but I did some more digging in found it. Thanks!