ValueError: math domain error

hayoung-jeremy commented 4 months ago

summary

error happens when training
tested on Runpod's A100 SXM 80GB x4 GPUs, 128 vCPU 1006 GB RAM
runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

reproduction of the error

installation of OpenLRM was successful
data preparation using blender_script.py was successful, generated 100 pairs of data each containig rgba, pose, intrinsics.npy.

configuration of training_sample.yaml and accelerate_training.yaml as follows :


experiment:
    type: lrm
    seed: 42
    parent: lrm-objaverse
    child: small-dummyrun

model:
    camera_embed_dim: 1024
    rendering_samples_per_ray: 96
    transformer_dim: 512
    transformer_layers: 12
    transformer_heads: 8
    triplane_low_res: 32
    triplane_high_res: 64
    triplane_dim: 32
    encoder_type: dinov2
    encoder_model_name: dinov2_vits14_reg
    encoder_feat_dim: 384
    encoder_freeze: false

dataset:
    subsets:
        -   name: objaverse
            root_dirs:
                - "/root/OpenLRM/views" # modified this value
            meta_path:
                train: "/root/OpenLRM/train_uids.json" # modified this value
                val: "/root/OpenLRM/val_uids.json" # modified this value
            sample_rate: 1.0
    sample_side_views: 3
    source_image_res: 224
    render_image:
        low: 64
        high: 192
        region: 64
    normalize_camera: true
    normed_dist_to_center: auto
    num_train_workers: 4
    num_val_workers: 2
    pin_mem: true

train:
    mixed_precision: bf16  # REPLACE THIS BASED ON GPU TYPE
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16  # REPLACE THIS (PER GPU)
    accum_steps: 1  # REPLACE THIS
    epochs: 60  # REPLACE THIS
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 1000
    debug_batches: null

saver:
    auto_resume: true
    load_model: null
    checkpoint_root: ./exps/checkpoints
    checkpoint_global_steps: 1000
    checkpoint_keep_level: 5

logger:
    stream_level: WARNING
    log_level: INFO
    log_root: ./exps/logs
    tracker_root: ./exps/trackers
    enable_profiler: false
    trackers:
        - tensorboard
    image_monitor:
        train_global_steps: 100
        samples_per_log: 4

compile:
    suppress_errors: true
    print_specializations: true
    disable: true

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4 # only modified this value
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the error message :

[TRAIN STEP]loss=0.624, loss_pixel=0.0577, loss_perceptual=0.566, loss_tv=0.698, lr=8.13e-6: 100%|███████████████████████████████████████████████| 60/60 [04:55<00:00,  4.92s/it]Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/OpenLRM/openlrm/launch.py", line 36, in <module>
    main()
  File "/root/OpenLRM/openlrm/launch.py", line 32, in main
    runner.run()
  File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 338, in run
    self.train()
  File "/root/OpenLRM/openlrm/runners/train/lrm.py", line 343, in train
    self.save_checkpoint()
  File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 118, in wrapper
    result = accelerated_func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 669, in _inner
    return PartialState().on_main_process(function)(*args, **kwargs)
  File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 246, in save_checkpoint
    cur_order = ckpt_base ** math.floor(math.log(max_ckpt // ckpt_period, ckpt_base))
ValueError: math domain error
[2024-04-17 08:24:09,179] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65932 closing signal SIGTERM
[2024-04-17 08:24:09,183] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65933 closing signal SIGTERM
[2024-04-17 08:24:09,186] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65934 closing signal SIGTERM
[2024-04-17 08:24:09,301] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 65931) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1066, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
openlrm.launch FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-17_08:24:09
  host      : dcf76dfb9908
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 65931)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

kunalkathare commented 4 months ago

Hey @hayoung-jeremy , try reducing the value of global_step_period under val: in the train sample yaml file , until it stops giving the error, which worked for me when I was trying to train with 350 objects.

hayoung-jeremy commented 4 months ago

Wow, you're my savior, thank you so much! I'll try it!

hayoung-jeremy commented 4 months ago

Thank you @kunalkathare , I've tried with the following config, modified epoch and global_step_period :

...

train:
    mixed_precision: bf16
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16 
    accum_steps: 1
    epochs: 100  # MODIFIED : 60 -> 100
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 100 # MODIFIED : 1000 -> 100
    debug_batches: null

...

and successfully generated a checkpoint as follows :

[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00,  5.10s/it]

But it seems the loss value is too high. What should I modify to decrease the loss value? Should I increase the epoch to 1000? And what is the ideal loss values for successfully generated checkpoint? Could you share me your case? Thank you so much for your help

kunalkathare commented 4 months ago

Thank you @kunalkathare , I've tried with the following config, modified epoch and global_step_period :
...

train:
    mixed_precision: bf16
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16 
    accum_steps: 1
    epochs: 100  # MODIFIED : 60 -> 100
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 100 # MODIFIED : 1000 -> 100
    debug_batches: null

...
and successfully generated a checkpoint as follows :
[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00,  5.10s/it]
But it seems the loss value is too high. What should I modify to decrease the loss value? Should I increase the epoch to 1000? And what is the ideal loss values for successfully generated checkpoint? Could you share me your case? Thank you so much for your help

The loss value is reduced when the size of the dataset is more, and I guess you can increase the epochs and see if it affects.

hayoung-jeremy commented 4 months ago

Thank you for kind reply @kunalkathare !

I don't have enough dataset for now, can I just copy the same data to increase the amount of it?
And I've tried to increase the epoch to 1000, it also generated the checkpoint with the loss value about 0.3. But the inference result quality from that checkpoint is not that good, as you can see in this issue. So I'm going to try to increase the epoch to 10000, is it okay? If it is, what kind of values should I adjust from the train_sample.yaml?

Really great help from you, many thanks for your assistance.

3DTopia / OpenLRM

ValueError: math domain error #40

summary

reproduction of the error