Open zgerrard opened 1 year ago
What were your hyperparameter settings?
@FayZ676 I used default parameters from nllb200_dense3.3B_finetune_on_fbseed.yaml
, just changed the dataset path. Also, tried changing the max_tokens
to a smaller number, but it didn’t fix the error.
All hyperparameters:
{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': 'out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', 'wandb_project': None, 'azureml_logging': False, 'seed': 2, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False, 'moe_generation': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False, 'freeze_up_to_layer': None}, 'dataset': {'_name': None, 'num_workers': 1, 'num_workers_valid': 0, 'skip_invalid_size_inputs_valid_test': True, 'max_tokens': 100, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1000, 'validate_interval_updates': 10, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 100, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [5e-05], 'stop_min_lr': 1e-09, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': 'out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': '/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', 'ignore_suffix': False, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1000, 'save_interval_updates': 50, 'keep_interval_updates': 1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': 1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_best_checkpoints': False, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'synchronize_checkpoints_before_copy': False, 'symlink_best_and_last_checkpoints': False, 'best_checkpoint_metric': 'nll_loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 's3_upload_path': None, 'replication_count': 1, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'stats_path': None, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(no_progress_bar=False, log_interval=100, log_format='json', log_file=None, tensorboard_logdir='out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', wandb_project=None, azureml_logging=False, seed=2, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=True, memory_efficient_fp16=True, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', log_nvidia_smi=False, use_tutel_moe=False, tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='inverse_sqrt', scoring='bleu', criterion='label_smoothed_cross_entropy', task='translation_multi_simple_epoch', num_workers=1, num_workers_valid=0, skip_invalid_size_inputs_valid_test=True, max_tokens=100, batch_size=None, required_batch_size_multiple=8, required_seq_len_multiple=1, dataset_impl=None, data_buffer_size=10, train_subset='train', valid_subset='valid', combine_valid_subsets=None, ignore_unused_valid_subsets=False, validate_interval=1000, validate_interval_updates=10, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, max_tokens_valid=100, batch_size_valid=None, max_valid_steps=None, curriculum=0, gen_subset='test', num_shards=1, shard_id=0, grouped_shuffling=False, update_epoch_batch_itr=False, update_ordered_indices_seed=False, distributed_world_size=1, distributed_num_procs=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, device_id=0, distributed_no_spawn=False, ddp_backend='fully_sharded', ddp_comm_hook='none', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=False, gradient_as_bucket_view=False, fast_stat_sync=False, heartbeat_timeout=-1, broadcast_buffers=False, slowmo_momentum=None, slowmo_base_algorithm='localsgd', localsgd_frequency=3, nprocs_per_node=1, pipeline_model_parallel=False, pipeline_balance=None, pipeline_devices=None, pipeline_chunks=0, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_checkpoint='never', zero_sharding='none', no_reshard_after_forward=False, fp32_reduce_scatter=False, cpu_offload=False, use_sharded_state=False, not_fsdp_flatten_parameters=False, freeze_up_to_layer=None, arch='transformer', max_epoch=0, max_update=50, stop_time_hours=0, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[5e-05], stop_min_lr=1e-09, use_bmuf=False, train_with_epoch_remainder_batch=False, save_dir='out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', restore_file='checkpoint_last.pt', continue_once=None, finetune_from_model='/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', ignore_suffix=False, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1000, save_interval_updates=50, keep_interval_updates=1, keep_interval_updates_pattern=-1, keep_last_epochs=1, keep_best_checkpoints=-1, no_save=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_best_checkpoints=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, synchronize_checkpoints_before_copy=False, symlink_best_and_last_checkpoints=False, best_checkpoint_metric='nll_loss', maximize_best_checkpoint_metric=False, patience=-1, checkpoint_suffix='', checkpoint_shard_count=1, load_checkpoint_on_all_dp_ranks=False, write_checkpoints_asynchronously=False, s3_upload_path=None, replication_count=1, store_ema=False, ema_decay=0.9999, ema_start_update=0, ema_seed_model=None, ema_update_freq=1, ema_fp32=False, source_lang=None, target_lang=None, lang_pairs='eng_Latn-fra_Latn', keep_inference_langtok=False, one_dataset_per_batch=False, sampling_method='temperature', sampling_temperature=1.0, data='/home/x/projects/stopes/stopes/pipelines/prepare_data/outputs/2022-12-27/22-02-43/prepped_data_new_valid_skip_blank/data_bin/shard000', langs=['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn'], lang_dict=None, source_dict=None, target_dict=None, lang_tok_style='multilingual', load_alignments=False, left_pad_source='True', left_pad_target='False', upsample_primary=1, truncate_source=False, encoder_langtok='src', decoder_langtok=True, lang_tok_replacing_bos_eos=False, enable_lang_ids=False, enable_reservsed_directions_shared_datasets=False, extra_data=None, extra_lang_pairs=None, fixed_dictionary=None, langtoks_specs=['main'], langtoks=None, sampling_weights_from_file=None, sampling_weights=None, virtual_epoch_size=None, virtual_data_size=None, pad_to_fixed_length=False, use_local_shard_size=True, enable_m2m_validation=True, add_data_source_prefix_tags=True, add_ssl_task_tokens=False, tokens_per_sample=512, sample_break_mode='eos', mask=0.1, mask_random=0.0, insert=0.0, permute=0.0, rotate=0.0, poisson_lambda=3.0, permute_sentences=0.0, mask_length='subword', replace_length=1, ignore_mmt_main_data=False, mixed_multitask_denoising_prob=0.5, eval_lang_pairs=None, finetune_dict_specs=None, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.0, use_old_adam=False, fp16_adam_stats=True, block_wise=False, warmup_updates=10, warmup_init_lr=1e-07, pad=1, eos=2, unk=3, label_smoothing=0.1, report_accuracy=False, ignore_prefix_size=0, dropout=0.1, max_source_positions=512, max_target_positions=512, share_all_embeddings=True, decoder_normalize_before=True, encoder_normalize_before=True, min_params_to_wrap=100000000, encoder_layers=24, decoder_layers=24, encoder_ffn_embed_dim=8192, decoder_ffn_embed_dim=8192, encoder_embed_dim=2048, decoder_embed_dim=2048, encoder_attention_heads=16, decoder_attention_heads=16, attention_dropout=0.1, relu_dropout=0.0, no_seed_provided=False, encoder_embed_path=None, encoder_learned_pos=False, decoder_embed_path=None, decoder_learned_pos=False, activation_dropout=0.0, activation_fn='relu', adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, share_decoder_input_output_embed=False, no_token_positional_embeddings=False, adaptive_input=False, no_cross_attention=False, cross_self_attention=False, decoder_output_dim=2048, decoder_input_dim=2048, no_scale_embedding=False, layernorm_embedding=False, tie_adaptive_weights=False, checkpoint_activations=False, offload_activations=False, encoder_layers_to_keep=None, decoder_layers_to_keep=None, encoder_layerdrop=0, decoder_layerdrop=0, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, _name='transformer'), 'task': Namespace(no_progress_bar=False, log_interval=100, log_format='json', log_file=None, tensorboard_logdir='out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', wandb_project=None, azureml_logging=False, seed=2, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=True, memory_efficient_fp16=True, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', log_nvidia_smi=False, use_tutel_moe=False, tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='inverse_sqrt', scoring='bleu', criterion='label_smoothed_cross_entropy', task='translation_multi_simple_epoch', num_workers=1, num_workers_valid=0, skip_invalid_size_inputs_valid_test=True, max_tokens=100, batch_size=None, required_batch_size_multiple=8, required_seq_len_multiple=1, dataset_impl=None, data_buffer_size=10, train_subset='train', valid_subset='valid', combine_valid_subsets=None, ignore_unused_valid_subsets=False, validate_interval=1000, validate_interval_updates=10, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, max_tokens_valid=100, batch_size_valid=None, max_valid_steps=None, curriculum=0, gen_subset='test', num_shards=1, shard_id=0, grouped_shuffling=False, update_epoch_batch_itr=False, update_ordered_indices_seed=False, distributed_world_size=1, distributed_num_procs=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, device_id=0, distributed_no_spawn=False, ddp_backend='fully_sharded', ddp_comm_hook='none', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=False, gradient_as_bucket_view=False, fast_stat_sync=False, heartbeat_timeout=-1, broadcast_buffers=False, slowmo_momentum=None, slowmo_base_algorithm='localsgd', localsgd_frequency=3, nprocs_per_node=1, pipeline_model_parallel=False, pipeline_balance=None, pipeline_devices=None, pipeline_chunks=0, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_checkpoint='never', zero_sharding='none', no_reshard_after_forward=False, fp32_reduce_scatter=False, cpu_offload=False, use_sharded_state=False, not_fsdp_flatten_parameters=False, freeze_up_to_layer=None, arch='transformer', max_epoch=0, max_update=50, stop_time_hours=0, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[5e-05], stop_min_lr=1e-09, use_bmuf=False, train_with_epoch_remainder_batch=False, save_dir='out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', restore_file='checkpoint_last.pt', continue_once=None, finetune_from_model='/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', ignore_suffix=False, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1000, save_interval_updates=50, keep_interval_updates=1, keep_interval_updates_pattern=-1, keep_last_epochs=1, keep_best_checkpoints=-1, no_save=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_best_checkpoints=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, synchronize_checkpoints_before_copy=False, symlink_best_and_last_checkpoints=False, best_checkpoint_metric='nll_loss', maximize_best_checkpoint_metric=False, patience=-1, checkpoint_suffix='', checkpoint_shard_count=1, load_checkpoint_on_all_dp_ranks=False, write_checkpoints_asynchronously=False, s3_upload_path=None, replication_count=1, store_ema=False, ema_decay=0.9999, ema_start_update=0, ema_seed_model=None, ema_update_freq=1, ema_fp32=False, source_lang=None, target_lang=None, lang_pairs='eng_Latn-fra_Latn', keep_inference_langtok=False, one_dataset_per_batch=False, sampling_method='temperature', sampling_temperature=1.0, data='/home/x/projects/stopes/stopes/pipelines/prepare_data/outputs/2022-12-27/22-02-43/prepped_data_new_valid_skip_blank/data_bin/shard000', langs=['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn'], lang_dict=None, source_dict=None, target_dict=None, lang_tok_style='multilingual', load_alignments=False, left_pad_source='True', left_pad_target='False', upsample_primary=1, truncate_source=False, encoder_langtok='src', decoder_langtok=True, lang_tok_replacing_bos_eos=False, enable_lang_ids=False, enable_reservsed_directions_shared_datasets=False, extra_data=None, extra_lang_pairs=None, fixed_dictionary=None, langtoks_specs=['main'], langtoks=None, sampling_weights_from_file=None, sampling_weights=None, virtual_epoch_size=None, virtual_data_size=None, pad_to_fixed_length=False, use_local_shard_size=True, enable_m2m_validation=True, add_data_source_prefix_tags=True, add_ssl_task_tokens=False, tokens_per_sample=512, sample_break_mode='eos', mask=0.1, mask_random=0.0, insert=0.0, permute=0.0, rotate=0.0, poisson_lambda=3.0, permute_sentences=0.0, mask_length='subword', replace_length=1, ignore_mmt_main_data=False, mixed_multitask_denoising_prob=0.5, eval_lang_pairs=None, finetune_dict_specs=None, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.0, use_old_adam=False, fp16_adam_stats=True, block_wise=False, warmup_updates=10, warmup_init_lr=1e-07, pad=1, eos=2, unk=3, label_smoothing=0.1, report_accuracy=False, ignore_prefix_size=0, dropout=0.1, max_source_positions=512, max_target_positions=512, share_all_embeddings=True, decoder_normalize_before=True, encoder_normalize_before=True, min_params_to_wrap=100000000, encoder_layers=24, decoder_layers=24, encoder_ffn_embed_dim=8192, decoder_ffn_embed_dim=8192, encoder_embed_dim=2048, decoder_embed_dim=2048, encoder_attention_heads=16, decoder_attention_heads=16, attention_dropout=0.1, relu_dropout=0.0, no_seed_provided=False, encoder_embed_path=None, encoder_learned_pos=False, decoder_embed_path=None, decoder_learned_pos=False, activation_dropout=0.0, activation_fn='relu', adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, share_decoder_input_output_embed=False, no_token_positional_embeddings=False, adaptive_input=False, no_cross_attention=False, cross_self_attention=False, decoder_output_dim=2048, decoder_input_dim=2048, no_scale_embedding=False, layernorm_embedding=False, tie_adaptive_weights=False, checkpoint_activations=False, offload_activations=False, encoder_layers_to_keep=None, decoder_layers_to_keep=None, encoder_layerdrop=0, decoder_layerdrop=0, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, _name='translation_multi_simple_epoch'), 'criterion': {'_name': 'label_smoothed_cross_entropy', 'label_smoothing': 0.1, 'report_accuracy': False, 'ignore_prefix_size': 0, 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.0, 'use_old_adam': False, 'fp16_adam_stats': True, 'tpu': False, 'lr': [5e-05], 'block_wise': False}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 10, 'warmup_init_lr': 1e-07, 'lr': [5e-05]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}}
For reference, I tried finetuning GPT-NeoX-20B on my setup (4x 3090's) and was told by the devs that I needed at least 13 Bytes of memory per parameter. The largest model I could successfully fine tune was up to the 2B parameter model.
It looks like youre using the config for a 3.3B param model on one 3090 so you may just not have enough memory to fine tune model's larger than 600M????
I don't know for sure, so if someone can confirm the memory requirements for Fairseq that would be great actually.
@zgerrard Hi, maybe you have step by step tutorial how to finetune 600M data model it will be really helpful for me? Could you share your finetune project via your git repository?
@edvardasast Did you find any git repository for finetuning?
@edvardasast Did you find any git repository for finetuning?
unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.
@edvardasast
@edvardasast Did you find any git repository for finetuning?
unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.
@edvardasast Would you please share your whole steps on finetuning nllb? Thanks!
@edvardasast Did you find any git repository for finetuning?
unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.
I am getting the same error, it seems that it is using the vocab of my data instead of the vocab of the NLLB trained model. That makes model have a different number of parameters.
Where is the code for fine-tuning the nllb model? ,thanks
❓ Questions and Help
What is your question?
Hi, I am getting "OOM during optimization, irrecoverable" when trying to fine-tune the 3.3B parameter NLLB model.
Stack trace:
Any ideas? Any help will be greatly appreciated.
What have you tried?
Tried fine-tuning smaller models and only the 600M param. (smallest) model didn't cause the error above.
What's your environment?