Training stuck at "new epoch starts"

cantabile-kwok commented 1 year ago

Hi and thanks for the great work! I have finished all the preliminary steps and uses python -m vall_e.train yaml=config/test/ar.yml to train. It outputs something like this:

{'data_dirs': ['data/test'], 'model': 'ar-quarter', 'batch_size': 1, 'eval_batch_size': 1, 'save_ckpt_every': 500, 'eval_every': 500, 'max_iter': 1000, 'cfg_name': PosixPath('test/ar')} {}
2it [00:00, 1906.94it/s]
2023-02-28 00:43:47 - vall_e.data - INFO - GR=0;LR=0 - 
{'</s>': 1, '<s>': 2, 'AH0': 3, 'D': 4, 'ER1': 5, 'HH': 6, 'L': 7, 'OW1': 8, 'W': 9, '_': 10}
2023-02-28 00:43:47 - vall_e.data - INFO - GR=0;LR=0 - 
{'test': 0}
2023-02-28 00:43:47 - vall_e.data - INFO - GR=0;LR=0 - 
#samples (train): 2.
2023-02-28 00:43:47 - vall_e.data - INFO - GR=0;LR=0 - 
#samples (val): 0.
[2023-02-28 00:43:47,269] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2023-02-28 00:43:47 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 - 
Added key: store_based_barrier_key:1 to store for rank: 0
2023-02-28 00:43:47 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 - 
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2023-02-28 00:43:51 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 - 
Added key: store_based_barrier_key:2 to store for rank: 0
2023-02-28 00:43:51 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 - 
Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-02-28 00:43:51,787] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /mnt/lustre/sjtu/home/ywg12/.cache/torch_extensions/py310_cu102 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/lustre/sjtu/home/ywg12/.cache/torch_extensions/py310_cu102/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.10433101654052734 seconds
[2023-02-28 00:43:52,152] [INFO] [logging.py:75:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2023-02-28 00:43:52,155] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-02-28 00:43:52,155] [INFO] [logging.py:75:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale
[2023-02-28 00:43:52,165] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2023-02-28 00:43:52,166] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
[2023-02-28 00:43:52,166] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7fa2e7319ed0>
[2023-02-28 00:43:52,166] [INFO] [logging.py:75:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.999)]
[2023-02-28 00:43:52,166] [INFO] [config.py:1009:print] DeepSpeedEngine configuration:
[2023-02-28 00:43:52,166] [INFO] [config.py:1013:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-02-28 00:43:52,166] [INFO] [config.py:1013:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   amp_enabled .................. False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   amp_params ................... False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   bfloat16_enabled ............. False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   checkpoint_parallel_write_pipeline  False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   checkpoint_tag_validation_enabled  True
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   checkpoint_tag_validation_fail  False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa2e7319ae0>
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   communication_data_type ...... None
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   curriculum_enabled_legacy .... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   curriculum_params_legacy ..... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   data_efficiency_enabled ...... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   dataloader_drop_last ......... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   disable_allgather ............ False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   dump_state ................... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   dynamic_loss_scale_args ...... None
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   eigenvalue_enabled ........... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   eigenvalue_gas_boundary_resolution  1
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   eigenvalue_layer_num ......... 0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   eigenvalue_max_iter .......... 100
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   eigenvalue_stability ......... 1e-06
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   eigenvalue_tol ............... 0.01
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   eigenvalue_verbose ........... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   elasticity_enabled ........... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   fp16_auto_cast ............... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   fp16_enabled ................. True
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   fp16_master_weights_and_gradients  False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   global_rank .................. 0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   grad_accum_dtype ............. None
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   gradient_accumulation_steps .. 1
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   gradient_clipping ............ 100.0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   gradient_predivide_factor .... 1.0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   initial_dynamic_scale ........ 65536
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   load_universal_checkpoint .... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   loss_scale ................... 0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   memory_breakdown ............. False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   optimizer_legacy_fusion ...... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   optimizer_name ............... adam
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   optimizer_params ............. None
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   pld_enabled .................. False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   pld_params ................... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   prescale_gradients ........... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   scheduler_name ............... WarmupDecayLR
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   scheduler_params ............. {'warmup_min_lr': 1e-06, 'warmup_max_lr': 0.0002, 'warmup_num_steps': 1000, 'total_num_steps': 1000, 'warmup_type': 'linear'}
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   sparse_attention ............. None
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   sparse_gradients_enabled ..... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   steps_per_print .............. 10
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   train_batch_size ............. 1
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   train_micro_batch_size_per_gpu  1
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   use_node_local_storage ....... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   wall_clock_breakdown ......... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   world_size ................... 1
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   zero_allow_untested_optimizer  False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   zero_enabled ................. False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print]   zero_optimization_stage ...... 0
[2023-02-28 00:43:52,169] [INFO] [config.py:998:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 1, 
    "optimizer": {
        "type": "Adam", 
        "lr": 1e-06
    }, 
    "scheduler": {
        "type": "WarmupDecayLR", 
        "params": {
            "warmup_min_lr": 1e-06, 
            "warmup_max_lr": 0.0002, 
            "warmup_num_steps": 1000, 
            "total_num_steps": 1000, 
            "warmup_type": "linear"
        }
    }, 
    "gradient_clipping": 100.0, 
    "fp16": {
        "enabled": true
    }
}
Using /mnt/lustre/sjtu/home/ywg12/.cache/torch_extensions/py310_cu102 as PyTorch extensions root...
Emitting ninja build file /mnt/lustre/sjtu/home/ywg12/.cache/torch_extensions/py310_cu102/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.12616920471191406 seconds
[2023-02-28 00:43:52,296] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from ckpts/test/ar/model/default/mp_rank_00_model_states.pt...
[2023-02-28 00:43:52,350] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from ckpts/test/ar/model/default/mp_rank_00_model_states.pt.
[2023-02-28 00:43:52,351] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from ckpts/test/ar/model/default/mp_rank_00_model_states.pt...
[2023-02-28 00:43:52,400] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from ckpts/test/ar/model/default/mp_rank_00_model_states.pt.
fatal: Not a git repository (or any parent up to mount point /mnt/lustre)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /mnt/lustre)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2023-02-28 00:43:52 - vall_e.utils.trainer - INFO - GR=0;LR=0 - 
{
  "batch_size": 1,
  "cache_dataloader": false,
  "cache_dir": ".cache/test/ar",
  "cfg_name": "test/ar",
  "cfg_relpath": null,
  "ckpt_dir": "ckpts/test/ar",
  "ckpt_root": "ckpts",
  "data_dirs": "[PosixPath('data/test')]",
  "data_root": "data",
  "device": "cuda",
  "dis_warmup_max_lr": 0.0004,
  "ds_cfg": {
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "optimizer": {
      "type": "Adam",
      "lr": 1e-06
    },
    "scheduler": {
      "type": "WarmupDecayLR",
      "params": {
        "warmup_min_lr": 1e-06,
        "warmup_max_lr": 0.0002,
        "warmup_num_steps": 1000,
        "total_num_steps": 1000,
        "warmup_type": "linear"
      }
    },
    "gradient_clipping": 100.0,
    "fp16": {
      "enabled": true
    }
  },
  "eval_batch_size": 1,
  "eval_every": 500,
  "fp16_cfg": {
    "enabled": true
  },
  "git_commit": "",
  "git_status": "",
  "gradient_accumulation_steps": 1,
  "gradient_clipping": 100.0,
  "log_dir": "logs/test/ar/1677516227",
  "log_root": "logs",
  "max_grad_norm": null,
  "max_iter": 1000,
  "max_num_val": 20,
  "max_phones": 50,
  "max_prompts": 3,
  "max_val_ar_steps": 300,
  "min_phones": 10,
  "model": "ar-quarter",
  "nj": 8,
  "num_tokens": 1024,
  "p_additional_prompt": 0.8,
  "relpath": "test/ar",
  "sample_rate": 24000,
  "sampling_temperature": 1.0,
  "save_artifacts_every": 100,
  "save_ckpt_every": 500,
  "save_on_oom": true,
  "save_on_quit": true,
  "spkr_name_getter": "lambda p: p.parts[-2]",
  "start_time": 1677516227,
  "token_dim": 256,
  "use_fp16": true,
  "warmup_max_lr": 0.0002,
  "warmup_min_lr": 1e-06,
  "warmup_num_steps": 1000
}
2023-02-28 00:43:52 - vall_e.utils.trainer - INFO - GR=0;LR=0 - 
New epoch starts.

Then it somehow stuck there forever. It kept stuck no matter what I pressed. If I Ctrl-C, the program just quits with no error message. This is strange as I would never know where the program halts and how long it will leave me waiting.

cantabile-kwok commented 1 year ago

By the way, as I'm using a server with slurm job scheduler, if I submit a job to run the training program on a remote node, I gives me this error message:

Traceback (most recent call last):
  File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/train.py", line 130, in <module>
    main()
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/train.py", line 121, in main
    trainer.train(
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 143, in train
    command = _non_blocking_input()
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 91, in _non_blocking_input
    selector = _get_stdin_selector()
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 82, in _get_stdin_selector
    selector.register(fileobj=sys.stdin, events=selectors.EVENT_READ)
  File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/selectors.py", line 360, in register
    self._selector.register(key.fd, poller_events)
PermissionError: [Errno 1] Operation not permitted

I suppose this is because of the non-blocking stdin that does not have permission on a remote end. This feature is fancy, but how can I turn it off?

cantabile-kwok commented 1 year ago

Alright, I guess the initial problem that program gets stuck is simply because the model has already been at 1000 steps which is a maximum. But I still have the problem with non-blocking input. Looking forward to any help!

yiwei0730 commented 1 year ago

you should type the 'quit' if you wanted to out in this process. maybe you can check the config for your maximum step!

cantabile-kwok commented 1 year ago

Update: The initial problem is because the training process has already reached its maximum step. Then I deleted the non-blocking inputs so that it can run on remote servers.

MajoRoth commented 1 year ago

By the way, as I'm using a server with slurm job scheduler, if I submit a job to run the training program on a remote node, I gives me this error message:

Traceback (most recent call last):
  File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/train.py", line 130, in <module>
    main()
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/train.py", line 121, in main
    trainer.train(
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 143, in train
    command = _non_blocking_input()
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 91, in _non_blocking_input
    selector = _get_stdin_selector()
  File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 82, in _get_stdin_selector
    selector.register(fileobj=sys.stdin, events=selectors.EVENT_READ)
  File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/selectors.py", line 360, in register
    self._selector.register(key.fd, poller_events)
PermissionError: [Errno 1] Operation not permitted

I suppose this is because of the non-blocking stdin that does not have permission on a remote end. This feature is fancy, but how can I turn it off?

have you figured out a solution for this?

cantabile-kwok commented 1 year ago

I did some modification to the code. Specifically I deleted everything related to that non-blocking stdin. I remember that changing one file is necessary. @MajoRoth

enhuiz / vall-e

Training stuck at "new epoch starts" #58