enhuiz / vall-e

An unofficial PyTorch implementation of the audio LM VALL-E
MIT License
2.97k stars 419 forks source link

Out of Memory error once the model has been loaded. Fails at New Epoch Starts... #76

Open divineSix opened 1 year ago

divineSix commented 1 year ago

I'm trying to train the model on a subset of LibriTTS data. Once I've completed the quanitization steps (completing step 3 as per Readme), the model crashes because it runs out of GPU memory. I've attached the logs for it below.

Now, I'm using the same command as mentioned in the readme.md, so if there's another command to run the training script such that it leverages the multiple GPUs and the distributed processing, let me know what the command is so that I may try it out.

10349it [00:00, 47122.51it/s]
2023-03-23 23:06:45 - vall_e.data - INFO - GR=0;LR=0 - 
{'.': 1, '..': 2, '...': 3, '</s>': 4, '<s>': 5, 'AA0': 6, 'AA1': 7, 'AA2': 8, 'AE0': 9, 'AE1': 10, 'AE2': 11, 'AH0': 12, 'AH1': 13, 'AH2': 14, 'AO0': 15, 'AO1': 16, 'AO2': 17, 'AW0': 18, 'AW1': 19, 'AW2': 20, 'AY0': 21, 'AY1': 22, 'AY2': 23, 'B': 24, 'CH': 25, 'D': 26, 'DH': 27, 'EH0': 28, 'EH1': 29, 'EH2': 30, 'ER0': 31, 'ER1': 32, 'ER2': 33, 'EY0': 34, 'EY1': 35, 'EY2': 36, 'F': 37, 'G': 38, 'HH': 39, 'IH0': 40, 'IH1': 41, 'IH2': 42, 'IY0': 43, 'IY1': 44, 'IY2': 45, 'JH': 46, 'K': 47, 'L': 48, 'M': 49, 'N': 50, 'NG': 51, 'OW0': 52, 'OW1': 53, 'OW2': 54, 'OY1': 55, 'OY2': 56, 'P': 57, 'R': 58, 'S': 59, 'SH': 60, 'T': 61, 'TH': 62, 'UH0': 63, 'UH1': 64, 'UH2': 65, 'UW0': 66, 'UW1': 67, 'UW2': 68, 'V': 69, 'W': 70, 'Y': 71, 'Z': 72, 'ZH': 73, '_': 74}
2023-03-23 23:06:45 - vall_e.data - INFO - GR=0;LR=0 - 
{'116': 0, '1255': 1, '1272': 2, '1462': 3, '1585': 4, '1630': 5, '1650': 6, '1651': 7, '1673': 8, '1686': 9, '1701': 10, '174': 11, '1919': 12, '1988': 13, '1993': 14, '2035': 15, '2078': 16, '2086': 17, '2277': 18, '2412': 19, '2428': 20, '2506': 21, '251': 22, '2803': 23, '2902': 24, '3000': 25, '3081': 26, '3170': 27, '3536': 28, '3576': 29, '3660': 30, '3663': 31, '3752': 32, '3853': 33, '3915': 34, '4153': 35, '422': 36, '4323': 37, '4515': 38, '4570': 39, '4572': 40, '4831': 41, '5338': 42, '5536': 43, '5543': 44, '5694': 45, '5849': 46, '5895': 47, '6123': 48, '6241': 49, '6267': 50, '6295': 51, '6313': 52, '6319': 53, '6345': 54, '6455': 55, '6467': 56, '652': 57, '6599': 58, '6841': 59, '700': 60, '7601': 61, '7641': 62, '7697': 63, '777': 64, '7850': 65, '7976': 66, '8173': 67, '8254': 68, '8288': 69, '8297': 70, '84': 71, '8842': 72}
2023-03-23 23:06:45 - vall_e.data - INFO - GR=0;LR=0 - 
#samples (train): 3581.
2023-03-23 23:06:45 - vall_e.data - INFO - GR=0;LR=0 - 
#samples (val): 20.
[2023-03-23 23:06:47,244] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2023-03-23 23:06:48 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 - 
Added key: store_based_barrier_key:1 to store for rank: 0
2023-03-23 23:06:48 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 - 
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2023-03-23 23:06:50 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 - 
Added key: store_based_barrier_key:2 to store for rank: 0
2023-03-23 23:06:50 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 - 
Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-03-23 23:06:50,948] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home2/skosgi242/.cache/torch_extensions/py310_cu102 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home2/skosgi242/.cache/torch_extensions/py310_cu102/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.19734907150268555 seconds
[2023-03-23 23:06:51,957] [INFO] [logging.py:93:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2023-03-23 23:06:51,965] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-03-23 23:06:51,965] [INFO] [logging.py:93:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale
[2023-03-23 23:06:51,979] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2023-03-23 23:06:51,980] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
[2023-03-23 23:06:51,980] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7f563d761c90>
[2023-03-23 23:06:51,980] [INFO] [logging.py:93:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.999)]
[2023-03-23 23:06:51,981] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-03-23 23:06:51,981] [INFO] [config.py:1022:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-03-23 23:06:51,981] [INFO] [config.py:1022:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-23 23:06:51,981] [INFO] [config.py:1022:print]   amp_enabled .................. False
[2023-03-23 23:06:51,981] [INFO] [config.py:1022:print]   amp_params ................... False
[2023-03-23 23:06:51,981] [INFO] [config.py:1022:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-03-23 23:06:51,981] [INFO] [config.py:1022:print]   bfloat16_enabled ............. False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   checkpoint_parallel_write_pipeline  False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   checkpoint_tag_validation_enabled  True
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   checkpoint_tag_validation_fail  False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f563d761960>
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   communication_data_type ...... None
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   curriculum_enabled_legacy .... False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   curriculum_params_legacy ..... False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   data_efficiency_enabled ...... False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   dataloader_drop_last ......... False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   disable_allgather ............ False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   dump_state ................... False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   dynamic_loss_scale_args ...... None
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   eigenvalue_enabled ........... False
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   eigenvalue_gas_boundary_resolution  1
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   eigenvalue_layer_num ......... 0
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   eigenvalue_max_iter .......... 100
[2023-03-23 23:06:51,982] [INFO] [config.py:1022:print]   eigenvalue_stability ......... 1e-06
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   eigenvalue_tol ............... 0.01
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   eigenvalue_verbose ........... False
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   elasticity_enabled ........... False
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   fp16_auto_cast ............... False
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   fp16_enabled ................. True
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   fp16_master_weights_and_gradients  False
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   global_rank .................. 0
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   grad_accum_dtype ............. None
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   gradient_accumulation_steps .. 1
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   gradient_clipping ............ 100.0
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   gradient_predivide_factor .... 1.0
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   initial_dynamic_scale ........ 65536
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   load_universal_checkpoint .... False
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   loss_scale ................... 0
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   memory_breakdown ............. False
[2023-03-23 23:06:51,983] [INFO] [config.py:1022:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   optimizer_legacy_fusion ...... False
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   optimizer_name ............... adam
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   optimizer_params ............. None
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   pld_enabled .................. False
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   pld_params ................... False
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   prescale_gradients ........... False
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   scheduler_name ............... WarmupDecayLR
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   scheduler_params ............. {'warmup_min_lr': 1e-06, 'warmup_max_lr': 0.0002, 'warmup_num_steps': 1000, 'total_num_steps': 1000000, 'warmup_type': 'linear'}
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   sparse_attention ............. None
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   sparse_gradients_enabled ..... False
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   steps_per_print .............. 10
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   train_batch_size ............. 24
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   train_micro_batch_size_per_gpu  24
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   use_node_local_storage ....... False
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   wall_clock_breakdown ......... False
[2023-03-23 23:06:51,984] [INFO] [config.py:1022:print]   world_size ................... 1
[2023-03-23 23:06:51,985] [INFO] [config.py:1022:print]   zero_allow_untested_optimizer  False
[2023-03-23 23:06:51,985] [INFO] [config.py:1022:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-03-23 23:06:51,985] [INFO] [config.py:1022:print]   zero_enabled ................. False
[2023-03-23 23:06:51,985] [INFO] [config.py:1022:print]   zero_force_ds_cpu_optimizer .. True
[2023-03-23 23:06:51,985] [INFO] [config.py:1022:print]   zero_optimization_stage ...... 0
[2023-03-23 23:06:51,986] [INFO] [config.py:1007:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 24, 
    "gradient_accumulation_steps": 1, 
    "optimizer": {
        "type": "Adam", 
        "lr": 1e-06
    }, 
    "scheduler": {
        "type": "WarmupDecayLR", 
        "params": {
            "warmup_min_lr": 1e-06, 
            "warmup_max_lr": 0.0002, 
            "warmup_num_steps": 1000, 
            "total_num_steps": 1.000000e+06, 
            "warmup_type": "linear"
        }
    }, 
    "gradient_clipping": 100.0, 
    "fp16": {
        "enabled": true
    }
}
Using /home2/skosgi242/.cache/torch_extensions/py310_cu102 as PyTorch extensions root...
Emitting ninja build file /home2/skosgi242/.cache/torch_extensions/py310_cu102/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.19100594520568848 seconds
[2023-03-23 23:06:52,180] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loading checkpoint from ckpts/LibriTTS/nar/model/default/mp_rank_00_model_states.pt...
[2023-03-23 23:06:53,479] [INFO] [torch_checkpoint_engine.py:25:load] [Torch] Loaded checkpoint from ckpts/LibriTTS/nar/model/default/mp_rank_00_model_states.pt.
[2023-03-23 23:06:53,534] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loading checkpoint from ckpts/LibriTTS/nar/model/default/mp_rank_00_model_states.pt...
[2023-03-23 23:06:54,872] [INFO] [torch_checkpoint_engine.py:25:load] [Torch] Loaded checkpoint from ckpts/LibriTTS/nar/model/default/mp_rank_00_model_states.pt.
2023-03-23 23:06:55 - vall_e.utils.trainer - INFO - GR=0;LR=0 - 
{
  "batch_size": 24,
  "cache_dataloader": false,
  "cache_dir": ".cache/LibriTTS/nar",
  "cfg_name": "LibriTTS/nar",
  "cfg_relpath": null,
  "ckpt_dir": "ckpts/LibriTTS/nar",
  "ckpt_root": "ckpts",
  "data_dirs": "[PosixPath('/ssd_scratch/cvit/skosgi242/vall_e_data/libriTTS')]",
  "data_root": "data",
  "device": "cuda",
  "dis_warmup_max_lr": 0.0004,
  "ds_cfg": {
    "train_micro_batch_size_per_gpu": 24,
    "gradient_accumulation_steps": 1,
    "optimizer": {
      "type": "Adam",
      "lr": 1e-06
    },
    "scheduler": {
      "type": "WarmupDecayLR",
      "params": {
        "warmup_min_lr": 1e-06,
        "warmup_max_lr": 0.0002,
        "warmup_num_steps": 1000,
        "total_num_steps": 1000000,
        "warmup_type": "linear"
      }
    },
    "gradient_clipping": 100.0,
    "fp16": {
      "enabled": true
    }
  },
  "eval_batch_size": 24,
  "eval_every": 1000,
  "fp16_cfg": {
    "enabled": true
  },
  "git_commit": "3476d393d2133fa9b50d5ad999ca13b95fc22060",
  "git_status": "On branch main\nYour branch is up to date with 'origin/main'.\n\nChanges not staged for commit:\n  (use \"git add <file>...\" to update what will be committed)\n  (use \"git checkout -- <file>...\" to discard changes in working directory)\n\n\tmodified:   config/LibriTTS/nar.yml\n\tmodified:   vall_e/config.py\n\nno changes added to commit (use \"git add\" and/or \"git commit -a\")",
  "gradient_accumulation_steps": 1,
  "gradient_clipping": 100.0,
  "log_dir": "logs/LibriTTS/nar/1679593004",
  "log_root": "logs",
  "max_grad_norm": null,
  "max_iter": 1000000,
  "max_num_val": 20,
  "max_phones": 50,
  "max_prompts": 3,
  "max_val_ar_steps": 300,
  "min_phones": 10,
  "model": "nar",
  "nj": 8,
  "num_tokens": 1024,
  "p_additional_prompt": 0.8,
  "relpath": "LibriTTS/nar",
  "sample_rate": 24000,
  "sampling_temperature": 0.2,
  "save_artifacts_every": 100,
  "save_ckpt_every": 10000,
  "save_on_oom": true,
  "save_on_quit": true,
  "spkr_name_getter": "lambda p: p.parts[-3]",
  "start_time": 1679593004,
  "token_dim": 256,
  "use_fp16": true,
  "warmup_max_lr": 0.0002,
  "warmup_min_lr": 1e-06,
  "warmup_num_steps": 1000
}
2023-03-23 23:06:55 - vall_e.utils.trainer - INFO - GR=0;LR=0 - 
New epoch starts.
[2023-03-23 23:06:57,811] [INFO] [logging.py:93:log_dist] [Rank 0] [Torch] Checkpoint default is about to be saved!
/home2/skosgi242/speech_resynthesis_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1365: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-03-23 23:06:57,817] [INFO] [logging.py:93:log_dist] [Rank 0] Saving model checkpoint: ckpts/LibriTTS/nar/model/default/mp_rank_00_model_states.pt
[2023-03-23 23:06:57,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saving ckpts/LibriTTS/nar/model/default/mp_rank_00_model_states.pt...
[2023-03-23 23:07:23,579] [INFO] [torch_checkpoint_engine.py:19:save] [Torch] Saved ckpts/LibriTTS/nar/model/default/mp_rank_00_model_states.pt.
[2023-03-23 23:07:23,580] [INFO] [torch_checkpoint_engine.py:29:commit] [Torch] Checkpoint default is ready now!
Traceback (most recent call last):
  File "/usr/local/apps/python-3.10.2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/apps/python-3.10.2/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home2/skosgi242/MLTTS/vall-e/vall_e/train.py", line 128, in <module>
    main()
  File "/home2/skosgi242/MLTTS/vall-e/vall_e/train.py", line 119, in main
    trainer.train(
  File "/home2/skosgi242/MLTTS/vall-e/vall_e/utils/trainer.py", line 155, in train
    stats = engines.step(feeder=train_feeder, batch=batch)
  File "/home2/skosgi242/MLTTS/vall-e/vall_e/utils/engines.py", line 178, in step
    raise RuntimeError("Out of memory!")
RuntimeError: Out of memory!
acsweet commented 1 year ago

Not sure if you solved this yet (and I'm no expert in this!), but maybe you can try something like this? python -m torch.distributed.launch --nproc_per_node 4 -m vall_e.train yaml=config/your_data/ar_or_nar.yml

ilanshib commented 1 year ago

Your batch size is 24. Did you try using a machine with more GPU memory (e.g. 48 Mb or even more)?