bahjat-kawar / ddrm

[NeurIPS 2022] Denoising Diffusion Restoration Models -- Official Code Repository
MIT License
531 stars 52 forks source link

imagenet_256_cc.yml runtime error #9

Open mateibejan1 opened 1 year ago

mateibejan1 commented 1 year ago

I'm trying to test the 256 ImageNet model on the deblurring task on the OOD data you provide in your adiacent repository. I'm getting this error:

ERROR - main.py - 2022-07-25 10:25:13,026 - Traceback (most recent call last):
  File "/Users/mbejan/Documents/diffusion/ddrm/main.py", line 164, in main
    runner.sample()
  File "/Users/mbejan/Documents/diffusion/ddrm/runners/diffusion.py", line 161, in sample
    self.sample_sequence(model, cls_fn)
  File "/Users/mbejan/Documents/diffusion/ddrm/runners/diffusion.py", line 249, in sample_sequence
    for x_orig, classes in pbar:
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 438, in __iter__
    return self._get_iterator()
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1048, in __init__
    w.start()
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/mbejan/opt/anaconda3/envs/ddrm/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'Diffusion.sample_sequence.<locals>.seed_worker'

This is the script that creates the behaviour from above:

python main.py --ni \
  --config imagenet_256_cc.yml \
  --doc ood \
  --timesteps 20 \
  --eta 0.85 \
  --etaB 1 \
  --deg deblur_uni \
  --sigma_0 0.05 \

My imagenet_256_cc.yml is the same as the one your provide apart from the out_of _distribution argument, which is set to true.

lshaw8317 commented 1 year ago

18 is related. I also had the same error. Adding global seed_worker to Diffusion.sample_sequence in diffusion.py fails to resolve issue:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\shaw\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\shaw\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'seed_worker' on <module 'runners.diffusion' from 'C:\\Users\\shaw\\Documents\\Year 2\\Diffusion Models\\ddrm\\runners\\diffusion.py'>

The reason (in my case) is that when running on Windows the multiprocessing module uses spawn and so one must (according to docs):

Wrap most of you main script’s code within if name == 'main': block, to make sure it doesn’t run again (most likely generating error) when each worker process is launched. You can place your dataset and DataLoader instance creation logic here, as it doesn’t need to be re-executed in workers.

Make sure that any custom collate_fn, worker_init_fn or dataset code is declared as top level definitions, outside of the main check. This ensures that they are available in worker processes. (this is needed since functions are pickled as references only, not bytecode.)

It is difficult to implement this advice since the seed_worker function needs access to the input args coming from the config file. Simplest "solution" was to just set the worker_init_fn argument to None as below (within Diffusion.sample_sequence):

val_loader = data.DataLoader(
            test_dataset,
            batch_size=config.sampling.batch_size,
            shuffle=True,
            num_workers=config.data.num_workers,
            worker_init_fn=None,
            generator=g,
        )
LinWeiJeff commented 2 weeks ago

18 is related. I also had the same error. Adding global seed_worker to Diffusion.sample_sequence in diffusion.py fails to resolve issue:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\shaw\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\shaw\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'seed_worker' on <module 'runners.diffusion' from 'C:\\Users\\shaw\\Documents\\Year 2\\Diffusion Models\\ddrm\\runners\\diffusion.py'>

The reason (in my case) is that when running on Windows the multiprocessing module uses spawn and so one must (according to docs):

Wrap most of you main script’s code within if name == 'main': block, to make sure it doesn’t run again (most likely generating error) when each worker process is launched. You can place your dataset and DataLoader instance creation logic here, as it doesn’t need to be re-executed in workers. Make sure that any custom collate_fn, worker_init_fn or dataset code is declared as top level definitions, outside of the main check. This ensures that they are available in worker processes. (this is needed since functions are pickled as references only, not bytecode.)

It is difficult to implement this advice since the seed_worker function needs access to the input args coming from the config file. Simplest "solution" was to just set the worker_init_fn argument to None as below (within Diffusion.sample_sequence):

val_loader = data.DataLoader(
            test_dataset,
            batch_size=config.sampling.batch_size,
            shuffle=True,
            num_workers=config.data.num_workers,
            worker_init_fn=None,
            generator=g,
        )

@lshaw8317 Hello, I have the same problem that occurs the error:

ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'Diffusion.sample_sequence.<locals>.seed_worker'

,and I tried to use your solution that I just set the worker_init_fn argument to None. However, after setting and implementing the code, the tqdm bar (indicating sampling progress) freezes at 0% for a while (about 20 seconds), and eventually it occurs a new error shown as below picture: 螢幕擷取畫面 2024-06-17 155512

I don't know why it occurs "MemoryError". Did you encounter this new error? Did you know how to solve it? If you need more information about how I implemented the code, I am very willing to provide Thanks a lot!