3DTopia / OpenLRM

An open-source impl. of Large Reconstruction Models
Apache License 2.0
903 stars 48 forks source link

ValueError: math domain error #40

Open hayoung-jeremy opened 4 months ago

hayoung-jeremy commented 4 months ago

summary

reproduction of the error

  1. installation of OpenLRM was successful

  2. data preparation using blender_script.py was successful, generated 100 pairs of data each containig rgba, pose, intrinsics.npy.

  3. configuration of training_sample.yaml and accelerate_training.yaml as follows :

    
    experiment:
        type: lrm
        seed: 42
        parent: lrm-objaverse
        child: small-dummyrun
    
    model:
        camera_embed_dim: 1024
        rendering_samples_per_ray: 96
        transformer_dim: 512
        transformer_layers: 12
        transformer_heads: 8
        triplane_low_res: 32
        triplane_high_res: 64
        triplane_dim: 32
        encoder_type: dinov2
        encoder_model_name: dinov2_vits14_reg
        encoder_feat_dim: 384
        encoder_freeze: false
    
    dataset:
        subsets:
            -   name: objaverse
                root_dirs:
                    - "/root/OpenLRM/views" # modified this value
                meta_path:
                    train: "/root/OpenLRM/train_uids.json" # modified this value
                    val: "/root/OpenLRM/val_uids.json" # modified this value
                sample_rate: 1.0
        sample_side_views: 3
        source_image_res: 224
        render_image:
            low: 64
            high: 192
            region: 64
        normalize_camera: true
        normed_dist_to_center: auto
        num_train_workers: 4
        num_val_workers: 2
        pin_mem: true
    
    train:
        mixed_precision: bf16  # REPLACE THIS BASED ON GPU TYPE
        find_unused_parameters: false
        loss:
            pixel_weight: 1.0
            perceptual_weight: 1.0
            tv_weight: 5e-4
        optim:
            lr: 4e-4
            weight_decay: 0.05
            beta1: 0.9
            beta2: 0.95
            clip_grad_norm: 1.0
        scheduler:
            type: cosine
            warmup_real_iters: 3000
        batch_size: 16  # REPLACE THIS (PER GPU)
        accum_steps: 1  # REPLACE THIS
        epochs: 60  # REPLACE THIS
        debug_global_steps: null
    
    val:
        batch_size: 4
        global_step_period: 1000
        debug_batches: null
    
    saver:
        auto_resume: true
        load_model: null
        checkpoint_root: ./exps/checkpoints
        checkpoint_global_steps: 1000
        checkpoint_keep_level: 5
    
    logger:
        stream_level: WARNING
        log_level: INFO
        log_root: ./exps/logs
        tracker_root: ./exps/trackers
        enable_profiler: false
        trackers:
            - tensorboard
        image_monitor:
            train_global_steps: 100
            samples_per_log: 4
    
    compile:
        suppress_errors: true
        print_specializations: true
        disable: true
    compute_environment: LOCAL_MACHINE
    debug: false
    distributed_type: MULTI_GPU
    downcast_bf16: 'no'
    gpu_ids: all
    machine_rank: 0
    main_training_function: main
    mixed_precision: bf16
    num_machines: 1
    num_processes: 4 # only modified this value
    rdzv_backend: static
    same_network: true
    tpu_env: []
    tpu_use_cluster: false
    tpu_use_sudo: false
    use_cpu: false
  4. the error message :

    [TRAIN STEP]loss=0.624, loss_pixel=0.0577, loss_perceptual=0.566, loss_tv=0.698, lr=8.13e-6: 100%|███████████████████████████████████████████████| 60/60 [04:55<00:00,  4.92s/it]Traceback (most recent call last):
      File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
        exec(code, run_globals)
      File "/root/OpenLRM/openlrm/launch.py", line 36, in <module>
        main()
      File "/root/OpenLRM/openlrm/launch.py", line 32, in main
        runner.run()
      File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 338, in run
        self.train()
      File "/root/OpenLRM/openlrm/runners/train/lrm.py", line 343, in train
        self.save_checkpoint()
      File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 118, in wrapper
        result = accelerated_func(self, *args, **kwargs)
      File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 669, in _inner
        return PartialState().on_main_process(function)(*args, **kwargs)
      File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 246, in save_checkpoint
        cur_order = ckpt_base ** math.floor(math.log(max_ckpt // ckpt_period, ckpt_base))
    ValueError: math domain error
    [2024-04-17 08:24:09,179] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65932 closing signal SIGTERM
    [2024-04-17 08:24:09,183] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65933 closing signal SIGTERM
    [2024-04-17 08:24:09,186] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65934 closing signal SIGTERM
    [2024-04-17 08:24:09,301] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 65931) of binary: /usr/bin/python
    Traceback (most recent call last):
      File "/usr/local/bin/accelerate", line 8, in <module>
        sys.exit(main())
      File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
        args.func(args)
      File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1066, in launch_command
        multi_gpu_launcher(args)
      File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
        distrib_run.run(args)
      File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
        elastic_launch(
      File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ============================================================
    openlrm.launch FAILED
    ------------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2024-04-17_08:24:09
      host      : dcf76dfb9908
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 65931)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================
kunalkathare commented 4 months ago

Hey @hayoung-jeremy , try reducing the value of global_step_period under val: in the train sample yaml file , until it stops giving the error, which worked for me when I was trying to train with 350 objects.

hayoung-jeremy commented 4 months ago

Wow, you're my savior, thank you so much! I'll try it!

hayoung-jeremy commented 4 months ago

Thank you @kunalkathare , I've tried with the following config, modified epoch and global_step_period :

...

train:
    mixed_precision: bf16
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16 
    accum_steps: 1
    epochs: 100  # MODIFIED : 60 -> 100
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 100 # MODIFIED : 1000 -> 100
    debug_batches: null

...

and successfully generated a checkpoint as follows :

[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00,  5.10s/it]

But it seems the loss value is too high. What should I modify to decrease the loss value? Should I increase the epoch to 1000? And what is the ideal loss values for successfully generated checkpoint? Could you share me your case? Thank you so much for your help

kunalkathare commented 4 months ago

Thank you @kunalkathare , I've tried with the following config, modified epoch and global_step_period :

...

train:
    mixed_precision: bf16
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16 
    accum_steps: 1
    epochs: 100  # MODIFIED : 60 -> 100
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 100 # MODIFIED : 1000 -> 100
    debug_batches: null

...

and successfully generated a checkpoint as follows :

[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00,  5.10s/it]

But it seems the loss value is too high. What should I modify to decrease the loss value? Should I increase the epoch to 1000? And what is the ideal loss values for successfully generated checkpoint? Could you share me your case? Thank you so much for your help

The loss value is reduced when the size of the dataset is more, and I guess you can increase the epochs and see if it affects.

hayoung-jeremy commented 4 months ago

Thank you for kind reply @kunalkathare !

Really great help from you, many thanks for your assistance.