Gradient of the training code diverges

SangHunHan92 commented 3 months ago

Hi, I installed NCLaw with help from other issues. I changed install version of warp to 0.15.1 and hydra-core==1.2.0 and everything is working well.

However, the gradient of the training code seems to be diverging. I ran python experiments/scripts/train/invariant_full_meta-invariant_full_meta.py , and below is what I'm seeing

(nclaw) root@server:/workspace/physics/NCLaw# python experiments/scripts/train/invariant_full_meta-invariant_full_meta.py
env:
  blob:
    bsdf_pcd:
      type: diffuse
      reflectance:
        type: rgb
        value:
        - 0.92941176
        - 0.32941176
        - 0.23137255
    material:
      elasticity:
        cls: InvariantFullMetaElasticity
        layer_widths:
        - 64
        - 64
        norm: null
        nonlinearity: gelu
        no_bias: true
        normalize_input: true
        requires_grad: true
      plasticity:
        cls: InvariantFullMetaPlasticity
        layer_widths:
        - 64
        - 64
        norm: null
        alpha: 0.001
        nonlinearity: gelu
        no_bias: true
        normalize_input: true
        requires_grad: true
      name: jelly
      ckpt: null
    shape:
      type: cube
      name: dataset
      center:
      - 0.5
      - 0.5
      - 0.5
      size:
      - 0.5
      - 0.5
      - 0.5
      resolution: 10
      mode: uniform
      sort: null
    vel:
      random: false
      lin_vel:
      - 1.0
      - -1.5
      - -2.0
      ang_vel:
      - 4.0
      - 4.0
      - 4.0
    name: jelly
    rho: 1000.0
    span:
    - 0
    - 1000
    clip_bound: 0.5
render:
  spp: 32
  width: 512
  height: 512
  skip_frame: 25
  bound: 1.75
  mpm_mul: 6
  sph_version: cuda_ad_rgb
  pcd_version: cuda_ad_rgb
  has_sphere_emitter: true
  fps: 10
sim:
  quality: low
  num_steps: 1000
  gravity:
  - 0.0
  - -9.8
  - 0.0
  bc: freeslip
  num_grids: 20
  dt: 0.0005
  bound: 3
  eps: 1.0e-07
  skip_frame: 1
train:
  teacher:
    strategy: cosine
    start_lambda: 25
    end_lambda: 200
  num_epochs: 300
  batch_size: 128
  elasticity_lr: 1.0
  plasticity_lr: 0.1
  elasticity_wd: 0.0
  plasticity_wd: 0.0
  elasticity_grad_max_norm: 0.1
  plasticity_grad_max_norm: 0.1
name: jelly/train/invariant_full_meta-invariant_full_meta
seed: 0
cpu: 0
num_cpus: 128
gpu: 0
overwrite: false
resume: false

Warp 0.15.1 initialized:
   CUDA Toolkit 11.8, Driver 12.2
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA GeForce RTX 4090" (24 GiB, sm_89, mempool enabled)
     "cuda:1"   : "NVIDIA GeForce RTX 4090" (24 GiB, sm_89, mempool enabled)
   CUDA peer access:
     Not supported
   Kernel cache:
     /root/.cache/warp/0.15.1
target directory (/workspace/physics/NCLaw/experiments/log/jelly/train/invariant_full_meta-invariant_full_meta) already exists, overwrite? [Y/r/n] y
overwriting directory (/workspace/physics/NCLaw/experiments/log/jelly/train/invariant_full_meta-invariant_full_meta)
  0%|                                                                                                                                                | 0/1000 [00:00<?, ?it/s]/root/anaconda3/envs/nclaw/lib/python3.10/site-packages/warp/torch.py:160: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/conda/conda-bld/pytorch_1682343995026/work/build/aten/src/ATen/core/TensorBody.h:486.)
  if t.grad is None:
[jelly/train/invariant_full_meta-invariant_full_meta,20:02:01,epoch   1/300,teachers 40,e-lr 1.00e+00,e-|grad| 0.0000,p-lr 1.00e-01,p-|grad| 0.0000,acc 0.3818]               
[jelly/train/invariant_full_meta-invariant_full_meta,20:02:05,epoch   2/300,teachers 40,e-lr 1.00e+00,e-|grad| 50.2531,p-lr 1.00e-01,p-|grad| 1.4653,acc 0.3818]              
[jelly/train/invariant_full_meta-invariant_full_meta,20:02:08,epoch   3/300,teachers 40,e-lr 1.00e+00,e-|grad| 5447.3193,p-lr 1.00e-01,p-|grad| 321.5664,acc 0.3835]          
[jelly/train/invariant_full_meta-invariant_full_meta,20:02:12,epoch   4/300,teachers 40,e-lr 1.00e+00,e-|grad| 192745.8438,p-lr 1.00e-01,p-|grad| 41211.6562,acc 0.6156]      
[jelly/train/invariant_full_meta-invariant_full_meta,20:02:15,epoch   5/300,teachers 40,e-lr 1.00e+00,e-|grad| 3680833.0000,p-lr 1.00e-01,p-|grad| 2505818.7500,acc 6.2072]   
[jelly/train/invariant_full_meta-invariant_full_meta,20:02:19,epoch   6/300,teachers 40,e-lr 9.99e-01,e-|grad| 50689764.0000,p-lr 9.99e-02,p-|grad| 275593824.0000,acc 79.6979]                                                                                                                                                                             
  2%|██▋                                                                                                                                      | 6/300 [00:25<20:41,  4.22s/it]
Error executing job with overrides: ['overwrite=False', 'resume=False', 'gpu=0', 'cpu=0', 'env=jelly', 'env/blob/material/elasticity=invariant_full_meta', 'env/blob/material/plasticity=invariant_full_meta', 'env.blob.material.elasticity.requires_grad=True', 'env.blob.material.plasticity.requires_grad=True', 'render=debug', 'sim=low', 'name=jelly/train/invariant_full_meta-invariant_full_meta']
Traceback (most recent call last):
  File "/workspace/physics/NCLaw/experiments/train.py", line 134, in main
    elasticity_grad_norm = clip_grad_norm_(
  File "/root/anaconda3/envs/nclaw/lib/python3.10/site-packages/torch/nn/utils/clip_grad.py", line 64, in clip_grad_norm_
    raise RuntimeError(
RuntimeError: The total norm of order 2.0 for gradients from `parameters` is non-finite, so it cannot be clipped. To disable this error and scale the gradients by the non-finite norm anyway, set `error_if_nonfinite=False`

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I didn't modify anything except the install process. What's the problem?

SangHunHan92 commented 3 months ago

The problem has been solved.

I reinstall warp==0.6.1 and don't use tape.py from https://github.com/PingchuanMa/NCLaw/issues/1

Training code is working well.

PingchuanMa commented 3 months ago

Sorry for the late reply. Yes, this is a problem of backward incompatibility. Happy to help with any other questions!

PingchuanMa / NCLaw

Gradient of the training code diverges #4