facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.25k stars 6.38k forks source link

"OOM during optimization" when fine-tuning NLLB #4930

Open zgerrard opened 1 year ago

zgerrard commented 1 year ago

❓ Questions and Help

What is your question?

Hi, I am getting "OOM during optimization, irrecoverable" when trying to fine-tune the 3.3B parameter NLLB model.

Stack trace:
Traceback (most recent call last):
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/trainer.py", line 1147, in train_step
    raise e
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/trainer.py", line 1099, in train_step
    self.task.optimizer_step(
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/tasks/fairseq_task.py", line 550, in optimizer_step
    optimizer.step()
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/optim/fp16_optimizer.py", line 440, in step
    self.wrapped_optimizer.step(
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/optim/fairseq_optimizer.py", line 120, in step
    self.optimizer.step(closure, scale=scale)
  File "/home/x/.local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/optim/fused_adam.py", line 209, in step
    exp_avg = exp_avg.float() * state["exp_avg_scale"]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.11 GiB (GPU 0; 23.70 GiB total capacity; 20.43 GiB already allocated; 2.13 GiB free; 20.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any ideas? Any help will be greatly appreciated.

What have you tried?

Tried fine-tuning smaller models and only the 600M param. (smallest) model didn't cause the error above.

What's your environment?

FayZ676 commented 1 year ago

What were your hyperparameter settings?

zgerrard commented 1 year ago

@FayZ676 I used default parameters from nllb200_dense3.3B_finetune_on_fbseed.yaml, just changed the dataset path. Also, tried changing the max_tokens to a smaller number, but it didn’t fix the error.

zgerrard commented 1 year ago

All hyperparameters:

{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': 'out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', 'wandb_project': None, 'azureml_logging': False, 'seed': 2, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False, 'moe_generation': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False, 'freeze_up_to_layer': None}, 'dataset': {'_name': None, 'num_workers': 1, 'num_workers_valid': 0, 'skip_invalid_size_inputs_valid_test': True, 'max_tokens': 100, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1000, 'validate_interval_updates': 10, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 100, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [5e-05], 'stop_min_lr': 1e-09, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': 'out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': '/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', 'ignore_suffix': False, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1000, 'save_interval_updates': 50, 'keep_interval_updates': 1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': 1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_best_checkpoints': False, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'synchronize_checkpoints_before_copy': False, 'symlink_best_and_last_checkpoints': False, 'best_checkpoint_metric': 'nll_loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 's3_upload_path': None, 'replication_count': 1, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'stats_path': None, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(no_progress_bar=False, log_interval=100, log_format='json', log_file=None, tensorboard_logdir='out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', wandb_project=None, azureml_logging=False, seed=2, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=True, memory_efficient_fp16=True, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', log_nvidia_smi=False, use_tutel_moe=False, tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='inverse_sqrt', scoring='bleu', criterion='label_smoothed_cross_entropy', task='translation_multi_simple_epoch', num_workers=1, num_workers_valid=0, skip_invalid_size_inputs_valid_test=True, max_tokens=100, batch_size=None, required_batch_size_multiple=8, required_seq_len_multiple=1, dataset_impl=None, data_buffer_size=10, train_subset='train', valid_subset='valid', combine_valid_subsets=None, ignore_unused_valid_subsets=False, validate_interval=1000, validate_interval_updates=10, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, max_tokens_valid=100, batch_size_valid=None, max_valid_steps=None, curriculum=0, gen_subset='test', num_shards=1, shard_id=0, grouped_shuffling=False, update_epoch_batch_itr=False, update_ordered_indices_seed=False, distributed_world_size=1, distributed_num_procs=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, device_id=0, distributed_no_spawn=False, ddp_backend='fully_sharded', ddp_comm_hook='none', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=False, gradient_as_bucket_view=False, fast_stat_sync=False, heartbeat_timeout=-1, broadcast_buffers=False, slowmo_momentum=None, slowmo_base_algorithm='localsgd', localsgd_frequency=3, nprocs_per_node=1, pipeline_model_parallel=False, pipeline_balance=None, pipeline_devices=None, pipeline_chunks=0, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_checkpoint='never', zero_sharding='none', no_reshard_after_forward=False, fp32_reduce_scatter=False, cpu_offload=False, use_sharded_state=False, not_fsdp_flatten_parameters=False, freeze_up_to_layer=None, arch='transformer', max_epoch=0, max_update=50, stop_time_hours=0, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[5e-05], stop_min_lr=1e-09, use_bmuf=False, train_with_epoch_remainder_batch=False, save_dir='out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', restore_file='checkpoint_last.pt', continue_once=None, finetune_from_model='/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', ignore_suffix=False, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1000, save_interval_updates=50, keep_interval_updates=1, keep_interval_updates_pattern=-1, keep_last_epochs=1, keep_best_checkpoints=-1, no_save=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_best_checkpoints=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, synchronize_checkpoints_before_copy=False, symlink_best_and_last_checkpoints=False, best_checkpoint_metric='nll_loss', maximize_best_checkpoint_metric=False, patience=-1, checkpoint_suffix='', checkpoint_shard_count=1, load_checkpoint_on_all_dp_ranks=False, write_checkpoints_asynchronously=False, s3_upload_path=None, replication_count=1, store_ema=False, ema_decay=0.9999, ema_start_update=0, ema_seed_model=None, ema_update_freq=1, ema_fp32=False, source_lang=None, target_lang=None, lang_pairs='eng_Latn-fra_Latn', keep_inference_langtok=False, one_dataset_per_batch=False, sampling_method='temperature', sampling_temperature=1.0, data='/home/x/projects/stopes/stopes/pipelines/prepare_data/outputs/2022-12-27/22-02-43/prepped_data_new_valid_skip_blank/data_bin/shard000', langs=['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn'], lang_dict=None, source_dict=None, target_dict=None, lang_tok_style='multilingual', load_alignments=False, left_pad_source='True', left_pad_target='False', upsample_primary=1, truncate_source=False, encoder_langtok='src', decoder_langtok=True, lang_tok_replacing_bos_eos=False, enable_lang_ids=False, enable_reservsed_directions_shared_datasets=False, extra_data=None, extra_lang_pairs=None, fixed_dictionary=None, langtoks_specs=['main'], langtoks=None, sampling_weights_from_file=None, sampling_weights=None, virtual_epoch_size=None, virtual_data_size=None, pad_to_fixed_length=False, use_local_shard_size=True, enable_m2m_validation=True, add_data_source_prefix_tags=True, add_ssl_task_tokens=False, tokens_per_sample=512, sample_break_mode='eos', mask=0.1, mask_random=0.0, insert=0.0, permute=0.0, rotate=0.0, poisson_lambda=3.0, permute_sentences=0.0, mask_length='subword', replace_length=1, ignore_mmt_main_data=False, mixed_multitask_denoising_prob=0.5, eval_lang_pairs=None, finetune_dict_specs=None, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.0, use_old_adam=False, fp16_adam_stats=True, block_wise=False, warmup_updates=10, warmup_init_lr=1e-07, pad=1, eos=2, unk=3, label_smoothing=0.1, report_accuracy=False, ignore_prefix_size=0, dropout=0.1, max_source_positions=512, max_target_positions=512, share_all_embeddings=True, decoder_normalize_before=True, encoder_normalize_before=True, min_params_to_wrap=100000000, encoder_layers=24, decoder_layers=24, encoder_ffn_embed_dim=8192, decoder_ffn_embed_dim=8192, encoder_embed_dim=2048, decoder_embed_dim=2048, encoder_attention_heads=16, decoder_attention_heads=16, attention_dropout=0.1, relu_dropout=0.0, no_seed_provided=False, encoder_embed_path=None, encoder_learned_pos=False, decoder_embed_path=None, decoder_learned_pos=False, activation_dropout=0.0, activation_fn='relu', adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, share_decoder_input_output_embed=False, no_token_positional_embeddings=False, adaptive_input=False, no_cross_attention=False, cross_self_attention=False, decoder_output_dim=2048, decoder_input_dim=2048, no_scale_embedding=False, layernorm_embedding=False, tie_adaptive_weights=False, checkpoint_activations=False, offload_activations=False, encoder_layers_to_keep=None, decoder_layers_to_keep=None, encoder_layerdrop=0, decoder_layerdrop=0, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, _name='transformer'), 'task': Namespace(no_progress_bar=False, log_interval=100, log_format='json', log_file=None, tensorboard_logdir='out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', wandb_project=None, azureml_logging=False, seed=2, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=True, memory_efficient_fp16=True, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', log_nvidia_smi=False, use_tutel_moe=False, tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='inverse_sqrt', scoring='bleu', criterion='label_smoothed_cross_entropy', task='translation_multi_simple_epoch', num_workers=1, num_workers_valid=0, skip_invalid_size_inputs_valid_test=True, max_tokens=100, batch_size=None, required_batch_size_multiple=8, required_seq_len_multiple=1, dataset_impl=None, data_buffer_size=10, train_subset='train', valid_subset='valid', combine_valid_subsets=None, ignore_unused_valid_subsets=False, validate_interval=1000, validate_interval_updates=10, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, max_tokens_valid=100, batch_size_valid=None, max_valid_steps=None, curriculum=0, gen_subset='test', num_shards=1, shard_id=0, grouped_shuffling=False, update_epoch_batch_itr=False, update_ordered_indices_seed=False, distributed_world_size=1, distributed_num_procs=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, device_id=0, distributed_no_spawn=False, ddp_backend='fully_sharded', ddp_comm_hook='none', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=False, gradient_as_bucket_view=False, fast_stat_sync=False, heartbeat_timeout=-1, broadcast_buffers=False, slowmo_momentum=None, slowmo_base_algorithm='localsgd', localsgd_frequency=3, nprocs_per_node=1, pipeline_model_parallel=False, pipeline_balance=None, pipeline_devices=None, pipeline_chunks=0, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_checkpoint='never', zero_sharding='none', no_reshard_after_forward=False, fp32_reduce_scatter=False, cpu_offload=False, use_sharded_state=False, not_fsdp_flatten_parameters=False, freeze_up_to_layer=None, arch='transformer', max_epoch=0, max_update=50, stop_time_hours=0, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[5e-05], stop_min_lr=1e-09, use_bmuf=False, train_with_epoch_remainder_batch=False, save_dir='out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', restore_file='checkpoint_last.pt', continue_once=None, finetune_from_model='/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', ignore_suffix=False, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1000, save_interval_updates=50, keep_interval_updates=1, keep_interval_updates_pattern=-1, keep_last_epochs=1, keep_best_checkpoints=-1, no_save=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_best_checkpoints=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, synchronize_checkpoints_before_copy=False, symlink_best_and_last_checkpoints=False, best_checkpoint_metric='nll_loss', maximize_best_checkpoint_metric=False, patience=-1, checkpoint_suffix='', checkpoint_shard_count=1, load_checkpoint_on_all_dp_ranks=False, write_checkpoints_asynchronously=False, s3_upload_path=None, replication_count=1, store_ema=False, ema_decay=0.9999, ema_start_update=0, ema_seed_model=None, ema_update_freq=1, ema_fp32=False, source_lang=None, target_lang=None, lang_pairs='eng_Latn-fra_Latn', keep_inference_langtok=False, one_dataset_per_batch=False, sampling_method='temperature', sampling_temperature=1.0, data='/home/x/projects/stopes/stopes/pipelines/prepare_data/outputs/2022-12-27/22-02-43/prepped_data_new_valid_skip_blank/data_bin/shard000', langs=['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn'], lang_dict=None, source_dict=None, target_dict=None, lang_tok_style='multilingual', load_alignments=False, left_pad_source='True', left_pad_target='False', upsample_primary=1, truncate_source=False, encoder_langtok='src', decoder_langtok=True, lang_tok_replacing_bos_eos=False, enable_lang_ids=False, enable_reservsed_directions_shared_datasets=False, extra_data=None, extra_lang_pairs=None, fixed_dictionary=None, langtoks_specs=['main'], langtoks=None, sampling_weights_from_file=None, sampling_weights=None, virtual_epoch_size=None, virtual_data_size=None, pad_to_fixed_length=False, use_local_shard_size=True, enable_m2m_validation=True, add_data_source_prefix_tags=True, add_ssl_task_tokens=False, tokens_per_sample=512, sample_break_mode='eos', mask=0.1, mask_random=0.0, insert=0.0, permute=0.0, rotate=0.0, poisson_lambda=3.0, permute_sentences=0.0, mask_length='subword', replace_length=1, ignore_mmt_main_data=False, mixed_multitask_denoising_prob=0.5, eval_lang_pairs=None, finetune_dict_specs=None, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.0, use_old_adam=False, fp16_adam_stats=True, block_wise=False, warmup_updates=10, warmup_init_lr=1e-07, pad=1, eos=2, unk=3, label_smoothing=0.1, report_accuracy=False, ignore_prefix_size=0, dropout=0.1, max_source_positions=512, max_target_positions=512, share_all_embeddings=True, decoder_normalize_before=True, encoder_normalize_before=True, min_params_to_wrap=100000000, encoder_layers=24, decoder_layers=24, encoder_ffn_embed_dim=8192, decoder_ffn_embed_dim=8192, encoder_embed_dim=2048, decoder_embed_dim=2048, encoder_attention_heads=16, decoder_attention_heads=16, attention_dropout=0.1, relu_dropout=0.0, no_seed_provided=False, encoder_embed_path=None, encoder_learned_pos=False, decoder_embed_path=None, decoder_learned_pos=False, activation_dropout=0.0, activation_fn='relu', adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, share_decoder_input_output_embed=False, no_token_positional_embeddings=False, adaptive_input=False, no_cross_attention=False, cross_self_attention=False, decoder_output_dim=2048, decoder_input_dim=2048, no_scale_embedding=False, layernorm_embedding=False, tie_adaptive_weights=False, checkpoint_activations=False, offload_activations=False, encoder_layers_to_keep=None, decoder_layers_to_keep=None, encoder_layerdrop=0, decoder_layerdrop=0, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, _name='translation_multi_simple_epoch'), 'criterion': {'_name': 'label_smoothed_cross_entropy', 'label_smoothing': 0.1, 'report_accuracy': False, 'ignore_prefix_size': 0, 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.0, 'use_old_adam': False, 'fp16_adam_stats': True, 'tpu': False, 'lr': [5e-05], 'block_wise': False}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 10, 'warmup_init_lr': 1e-07, 'lr': [5e-05]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}}
FayZ676 commented 1 year ago

For reference, I tried finetuning GPT-NeoX-20B on my setup (4x 3090's) and was told by the devs that I needed at least 13 Bytes of memory per parameter. The largest model I could successfully fine tune was up to the 2B parameter model.

It looks like youre using the config for a 3.3B param model on one 3090 so you may just not have enough memory to fine tune model's larger than 600M????

I don't know for sure, so if someone can confirm the memory requirements for Fairseq that would be great actually.

edvardasast commented 1 year ago

@zgerrard Hi, maybe you have step by step tutorial how to finetune 600M data model it will be really helpful for me? Could you share your finetune project via your git repository?

yugaljain1999 commented 1 year ago

@edvardasast Did you find any git repository for finetuning?

edvardasast commented 1 year ago

@edvardasast Did you find any git repository for finetuning?

unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.

robotsp commented 1 year ago

@edvardasast

@edvardasast Did you find any git repository for finetuning?

unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.

@edvardasast Would you please share your whole steps on finetuning nllb? Thanks!

martinbombin commented 1 year ago

@edvardasast Did you find any git repository for finetuning?

unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.

I am getting the same error, it seems that it is using the vocab of my data instead of the vocab of the NLLB trained model. That makes model have a different number of parameters.

zhanbaohang commented 5 months ago

Where is the code for fine-tuning the nllb model? ,thanks