XiangLi1999 / Diffusion-LM

Diffusion-LM
Apache License 2.0
1.05k stars 135 forks source link

How to train a new diffusion model & classifer with different diff_steps or embedding dimension? #37

Open ChorlingLau opened 2 years ago

ChorlingLau commented 2 years ago

Hi again! I would like to train a new diffusion model and a matched classifier with different diff_steps or embedding dimension, but I am confused about the parameters that need to be changed.

  1. For different diff_steps, for example 3000, I change the parameters of --diff_steps to 3000 when running improved-diffusion/scripts/run_train.py, some variables in transformers\examples\pytorch\language-modeling\run_clm.py and improved-diffusion/scripts/infill.py which are named diffusion_steps to 300. However, when running infill.py, there are errors shown as follow:

    /opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [173,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [173,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    ......
    /opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    scripts/infill.py:656: UserWarning: Use of masked_fill_ on expanded tensors is deprecated. Please clone() the tensor before performing this operation. This also applies to advanced indexing e.g. tensor[mask] = scalar (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/TensorAdvancedIndexing.cpp:1280.)
    encoded_seq.masked_fill_(encoded_seq == todo_pad_token, 3)
    ddim_sample_loop_progressive device: cuda:0
    ddim_sample_loop_progressive noise: None
    ddim_sample_loop_progressive progress: False
    Traceback (most recent call last):
    File "scripts/infill.py", line 1131, in <module>
    args = main()
    File "scripts/infill.py", line 698, in main
    eta=args.eta,
    File "/Diffusion-LM/improved-diffusion/improved_diffusion/gaussian_diffusion.py", line 1163, in ddim_sample_loop_progressive
    langevin_fn=langevin_fn,
    File "/Diffusion-LM/improved-diffusion/improved_diffusion/gaussian_diffusion.py", line 1039, in ddim_sample
    sample=langevin_fn(sample, mean_pred, sigma, self.alphas_cumprod_prev[t[0]], t, x)
    File "/Diffusion-LM/improved-diffusion/scripts/infill_util.py", line 162, in langevin_fn_tone_length
    model_kwargs={},
    File "/Diffusion-LM/improved-diffusion/improved_diffusion/respace.py", line 93, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
    File "/Diffusion-LM/improved-diffusion/improved_diffusion/gaussian_diffusion.py", line 479, in p_mean_variance
    model_output = model(x, self._scale_timesteps(t), **model_kwargs)
    File "/Diffusion-LM/improved-diffusion/improved_diffusion/respace.py", line 122, in __call__
    map_tensor = th.tensor(self.timestep_map, device=ts.device, dtype=ts.dtype)
    RuntimeError: CUDA error: device-side assert triggered

    I failed to handle the error so can you show me a solution to change diff_steps?

  2. For different embedding dimension, for example 32 (original 16), is it enough after the modification shown as follow? (omit other para) ①run_train.py --inchannel 32run_clm.py --n_embd 32

XiangLi1999 commented 2 years ago

Hi,

Thanks for the questions!

re 1: diffusion_steps is a parameter that I use to down-sample and run fewer diffusioin steps to speed up controllable generation. To debug, you could set this "diffusion_steps" to be 3k, but looking at the error message I dont think this is the source of the problem. I think the error message suggests something like you try to pass in a larger index than the dimension allows... Does this only happen for infill? what happens if you just decode unconditionally?

re 2: might be a type but '' --in_channel 64'' otherwise, yes. I think it's sufficient, just dont forget to pass in the updated --init_emb for run_clm.py

smiles724 commented 2 years ago

Hi,

Thanks for the questions!

re 1: diffusion_steps is a parameter that I use to down-sample and run fewer diffusioin steps to speed up controllable generation. To debug, you could set this "diffusion_steps" to be 3k, but looking at the error message I dont think this is the source of the problem. I think the error message suggests something like you try to pass in a larger index than the dimension allows... Does this only happen for infill? what happens if you just decode unconditionally?

re 2: might be a type but '' --in_channel 64'' otherwise, yes. I think it's sufficient, just dont forget to pass in the updated --init_emb for run_clm.py

Hi, does "diff_steps" have no effect on training a pure diffusion model without controllable generation?