Regarding Multinode Training

nonmetal commented 6 months ago

Hello, thank you very much for providing great training code. Our team is currently trying to train a new contentvec model to reduce language dependency. Therefore, we are attempting to train from data other than LibriSpeech.

Though, I have a question as there is an error regarding the multinode environment. Although we adjusted the variable of PROC_PER_NODE in run_pretrain_multi.sh to match the number of GPUs that is currently available, only one GPU is detected in actual training, and this was the same even if the same method as the example (LibriSpeech) was used. I attach log regarding to the issue.

# set up environment variables for Torch DistributedDataParallel
WORLD_SIZE_JOB=\$SLURM_NTASKS
RANK_NODE=\$SLURM_NODEID
PROC_PER_NODE=4
MASTER_ADDR_JOB=\$SLURM_SUBMIT_HOST
MASTER_PORT_JOB="12234"
DDP_BACKEND=c10d

[2023-12-22 08:59:33,032][fairseq_cli.train][INFO] - {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 200, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': 'tblog', 'wandb_project': None, 'azureml_logging': False, 'seed': 1337, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': 29671, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'no_c10d', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': True, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False}, 'dataset': {'_name': None, 'num_workers': 10, 'skip_invalid_size_inputs_valid_test': True, 'max_tokens': 500000, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 5, 'validate_interval_updates': 10000, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 500000, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 100000, 'stop_time_hours': 0.0, 'clip_norm': 10.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 25000, 'keep_interval_updates': 1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': 10, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'contentvec', 'label_rate': 50, 'extractor_mode': 'default', 'encoder_layers': 12, 'encoder_layers_1': 3, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': 'gelu', 'ctr_layers': [-6], 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'logit_temp_ctr': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'num_negatives': 100, 'cross_sample_negatives': 0, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False}, 'task': {'_name': 'contentvec_pretraining', 'data': './contentvec/metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': './contentvec/label', 'label_rate': 50, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'crop': True, 'pad_audio': False, 'spk2info': './contentvec/metadata/output_all.dict'}, 'criterion': {'_name': 'contentvec', 'pred_masked_weight': 1.0, 'pred_nomask_weight': 0.0, 'loss_weights': [10.0, 1e-05], 'log_keys': []}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9,0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 8000, 'force_anneal': None, 'end_learning_rate': 0.0, 'power': 1.0, 'total_num_update': 100000, 'lr': [0.0005]}, 'scoring': None, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}, 'job_logging_cfg': {'version': 1, 'formatters': {'simple': {'format': '[%(asctime)s][%(name)s][%(levelname)s] - %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'stream': 'ext://sys.stdout'}, 'file': {'class': 'logging.FileHandler', 'formatter': 'simple', 'filename': 'hydra_train.log'}}, 'root': {'level': 'INFO', 'handlers': ['console', 'file']}, 'disable_existing_loggers': False}}
[2023-12-22 08:59:34,900][fairseq_cli.train][INFO] - task: ContentvecPretrainingTask
[2023-12-22 08:59:34,900][fairseq_cli.train][INFO] - model: ContentvecModel
[2023-12-22 08:59:34,900][fairseq_cli.train][INFO] - criterion: ContentvecCriterion
[2023-12-22 08:59:34,902][fairseq_cli.train][INFO] - num. shared model params: 118,901,376 (num. trained: 118,901,376)
[2023-12-22 08:59:34,903][fairseq_cli.train][INFO] - num. expert model params: 0 (num. trained: 0)
[2023-12-22 08:59:35,054][fairseq.data.audio.contentvec_dataset][INFO] - max_keep=None, min_keep=32000, loaded 14783, skipped 1216 short and 0 long, longest-loaded=376963, shortest-loaded=32000
[2023-12-22 08:59:35,470][fairseq.data.audio.contentvec_dataset][INFO] - pad_audio=False, random_crop=True, crop=True, normalize=False, max_sample_size=250000
[2023-12-22 08:59:39,611][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.1.0.bias
[2023-12-22 08:59:39,611][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.2.0.bias
[2023-12-22 08:59:39,611][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.3.0.bias
[2023-12-22 08:59:39,611][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.4.0.bias
[2023-12-22 08:59:39,611][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.5.0.bias
[2023-12-22 08:59:39,611][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.6.0.bias
[2023-12-22 08:59:39,611][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.12.self_attn_layer_norm.weight_ln.bias
[2023-12-22 08:59:39,611][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.12.self_attn_layer_norm.bias_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.12.final_layer_norm.weight_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.12.final_layer_norm.bias_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.13.self_attn_layer_norm.weight_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.13.self_attn_layer_norm.bias_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.13.final_layer_norm.weight_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.13.final_layer_norm.bias_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.14.self_attn_layer_norm.weight_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.14.self_attn_layer_norm.bias_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.14.final_layer_norm.weight_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.layers.14.final_layer_norm.bias_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.cond_layer_norm.weight_ln.bias
[2023-12-22 08:59:39,612][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.cond_layer_norm.bias_ln.bias
[2023-12-22 08:59:39,613][fairseq.utils][INFO] - ***********************CUDA enviroments for all 1 workers***********************
[2023-12-22 08:59:39,613][fairseq.utils][INFO] - rank   0: capabilities =  7.0  ; total memory = 15.772 GB ; name = Tesla V100-SXM2-16GB                    
[2023-12-22 08:59:39,613][fairseq.utils][INFO] - ***********************CUDA enviroments for all 1 workers***********************
[2023-12-22 08:59:39,613][fairseq_cli.train][INFO] - training on 1 devices (GPUs/TPUs)
[2023-12-22 08:59:39,613][fairseq_cli.train][INFO] - max tokens per device = 500000 and max sentences per device = None
[2023-12-22 08:59:39,614][fairseq.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt
[2023-12-22 08:59:39,615][fairseq.trainer][INFO] - No existing checkpoint found checkpoints/checkpoint_last.pt
[2023-12-22 08:59:39,615][fairseq.trainer][INFO] - loading train data for epoch 1
[2023-12-22 08:59:40,755][fairseq.data.audio.contentvec_dataset][INFO] - max_keep=None, min_keep=32000, loaded 1161315, skipped 44395 short and 0 long, longest-loaded=1589248, shortest-loaded=32000
[2023-12-22 08:59:50,937][fairseq.data.audio.contentvec_dataset][INFO] - pad_audio=False, random_crop=True, crop=True, normalize=False, max_sample_size=250000
[2023-12-22 08:59:55,910][fairseq.trainer][INFO] - begin training epoch 1
[2023-12-22 08:59:55,911][fairseq_cli.train][INFO] - Start iterating over samples

I would like to ask if there is any error regarding the issue, and whether I should refer to the fairseq code. Also, I would like to ask whether you have a plan to provide additional implementation for the fine-tuning code in the future.

auspicious3000 commented 6 months ago

This is essentially a problem of how to run DDP in pytorch using slurm, because fairseq warps on pytorch's DDP, which is a more general problem not specific to our repo. On the slurm side, as far as I know, in the newer version of slurm, line 6 to 11 of the run_pretrain_multi.sh are set by slurm, which can be safely commented out. On the pytorch side, in fairseq config, there is a distributed_training section where you can experiment with different settings. F

As for the fine-tuning code, it is available in fairseq's repo.

nonmetal commented 6 months ago

Thank you for your comments! I would try for base on your comments 🙏

nonmetal commented 5 months ago

I was able to solve a problem of distributed learning based on your information! Huge thanks🙏

li-henan commented 2 months ago

dear friend，I try to finetune contentvec，do you finetune using train code straightly or write code referring to fairseq ？could you tell how to finetune contentvec using this code？ Sincerely thank you very much if you can answer

auspicious3000 / contentvec

Regarding Multinode Training #20