NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.45k stars 2.39k forks source link

Unable to process dataset for Neo-megatron-GPT-20 Billion model #5542

Closed mayank-nference closed 1 year ago

mayank-nference commented 1 year ago

Describe the bug

home/mayanksharma/nemo_gpt_fintuning.py:261 in <module>                                         │
│                                                                                                  │
│   258                                                                                            │
│   259                                                                                            │
│   260 if __name__ == '__main__':                                                                 │
│ ❱ 261 │   main()                                                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py:104 in      │
│ wrapper                                                                                          │
│                                                                                                  │
│   101 │   │   │   │                                                                              │
│   102 │   │   │   │   # no return value from run_hydra() as it may sometime actually run the t   │
│   103 │   │   │   │   # multiple times (--multirun)                                              │
│ ❱ 104 │   │   │   │   _run_hydra(                                                                │
│   105 │   │   │   │   │   args_parser=_argparse_wrapper(args),                                   │
│   106 │   │   │   │   │   task_function=task_function,                                           │
│   107 │   │   │   │   │   config_path=config_path,                                               │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:377 in _run_hydra   │
│                                                                                                  │
│   374 │   │   if num_commands == 0:                                                              │
│   375 │   │   │   args.run = True                                                                │
│   376 │   │   if args.run:                                                                       │
│ ❱ 377 │   │   │   run_and_report(                                                                │
│   378 │   │   │   │   lambda: hydra.run(                                                         │
│   379 │   │   │   │   │   config_name=config_name,                                               │
│   380 │   │   │   │   │   task_function=task_function,                                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:214 in              │
│ run_and_report                                                                                   │
│                                                                                                  │
│   211 │   │   return func()                                                                      │
│   212 │   except Exception as ex:                                                                │
│   213 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│ ❱ 214 │   │   │   raise ex                                                                       │
│   215 │   │   else:                                                                              │
│   216 │   │   │   try:                                                                           │
│   217 │   │   │   │   if isinstance(ex, CompactHydraException):                                  │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:211 in              │
│ run_and_report                                                                                   │
│                                                                                                  │
│   208                                                                                            │
│   209 def run_and_report(func: Any) -> Any:                                                      │
│   210 │   try:                                                                                   │
│ ❱ 211 │   │   return func()                                                                      │
│   212 │   except Exception as ex:                                                                │
│   213 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│   214 │   │   │   raise ex                                                                       │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:378 in <lambda>     │
│                                                                                                  │
│   375 │   │   │   args.run = True                                                                │
│   376 │   │   if args.run:                                                                       │
│   377 │   │   │   run_and_report(                                                                │
│ ❱ 378 │   │   │   │   lambda: hydra.run(                                                         │
│   379 │   │   │   │   │   config_name=config_name,                                               │
│   380 │   │   │   │   │   task_function=task_function,                                           │
│   381 │   │   │   │   │   overrides=args.overrides,                                              │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/hydra.py:111 in run          │
│                                                                                                  │
│   108 │   │   callbacks.on_run_end(config=cfg, config_name=config_name, job_return=ret)          │
│   109 │   │                                                                                      │
│   110 │   │   # access the result to trigger an exception in case the job failed.                │
│ ❱ 111 │   │   _ = ret.return_value                                                               │
│   112 │   │                                                                                      │
│   113 │   │   return ret                                                                         │
│   114                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/core/utils.py:233 in return_value      │
│                                                                                                  │
│   230 │   │   │   sys.stderr.write(                                                              │
│   231 │   │   │   │   f"Error executing job with overrides: {self.overrides}" + os.linesep       │
│   232 │   │   │   )                                                                              │
│ ❱ 233 │   │   │   raise self._return_value                                                       │
│   234 │                                                                                          │
│   235 │   @return_value.setter                                                                   │
│   236 │   def return_value(self, value: Any) -> None:                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/core/utils.py:160 in run_job           │
│                                                                                                  │
│   157 │   │   with env_override(hydra_cfg.hydra.job.env_set):                                    │
│   158 │   │   │   callbacks.on_job_start(config=config)                                          │
│   159 │   │   │   try:                                                                           │
│ ❱ 160 │   │   │   │   ret.return_value = task_function(task_cfg)                                 │
│   161 │   │   │   │   ret.status = JobStatus.COMPLETED                                           │
│   162 │   │   │   except Exception as e:                                                         │
│   163 │   │   │   │   ret.return_value = e                                                       │
│                                                                                                  │
│ /home/mayanksharma/nemo_gpt_fintuning.py:256 in main                                             │
│                                                                                                  │
│   253 │                                                                                          │
│   254 │   model = MegatronGPTModel(cfg.model, trainer)                                           │
│   255 │                                                                                          │
│ ❱ 256 │   trainer.fit(model)                                                                     │
│   257 │   # trainer.fit(model, train_dataloaders=train_dataloader, val_dataloaders=valid_datal   │
│   258                                                                                            │
│   259                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:696 in  │
│ fit                                                                                              │
│                                                                                                  │
│    693 │   │   │   datamodule: An instance of :class:`~pytorch_lightning.core.datamodule.Lightn  │
│    694 │   │   """                                                                               │
│    695 │   │   self.strategy.model = model                                                       │
│ ❱  696 │   │   self._call_and_handle_interrupt(                                                  │
│    697 │   │   │   self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_  │
│    698 │   │   )                                                                                 │
│    699                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:648 in  │
│ _call_and_handle_interrupt                                                                       │
│                                                                                                  │
│    645 │   │   """                                                                               │
│    646 │   │   try:                                                                              │
│    647 │   │   │   if self.strategy.launcher is not None:                                        │
│ ❱  648 │   │   │   │   return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **  │
│    649 │   │   │   else:                                                                         │
│    650 │   │   │   │   return trainer_fn(*args, **kwargs)                                        │
│    651 │   │   # TODO(awaelchli): Unify both exceptions below, where `KeyboardError` doesn't re  │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subpr │
│ ocess_script.py:93 in launch                                                                     │
│                                                                                                  │
│    90 │   │   """                                                                                │
│    91 │   │   if not self.cluster_environment.creates_processes_externally:                      │
│    92 │   │   │   self._call_children_scripts()                                                  │
│ ❱  93 │   │   return function(*args, **kwargs)                                                   │
│    94 │                                                                                          │
│    95 │   def _call_children_scripts(self) -> None:                                              │
│    96 │   │   # bookkeeping of spawned processes                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:735 in  │
│ _fit_impl                                                                                        │
│                                                                                                  │
│    732 │   │   self._ckpt_path = self.__set_ckpt_path(                                           │
│    733 │   │   │   ckpt_path, model_provided=True, model_connected=self.lightning_module is not  │
│    734 │   │   )                                                                                 │
│ ❱  735 │   │   results = self._run(model, ckpt_path=self.ckpt_path)                              │
│    736 │   │                                                                                     │
│    737 │   │   assert self.state.stopped                                                         │
│    738 │   │   self.training = False                                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1105 in │
│ _run                                                                                             │
│                                                                                                  │
│   1102 │   │   self.strategy.setup_environment()                                                 │
│   1103 │   │   self.__setup_profiler()                                                           │
│   1104 │   │                                                                                     │
│ ❱ 1105 │   │   self._call_setup_hook()  # allow user to setup lightning_module in accelerator e  │
│   1106 │   │                                                                                     │
│   1107 │   │   # check if we should delay restoring checkpoint till later                        │
│   1108 │   │   if not self.strategy.restore_checkpoint_after_setup:                              │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1449 in │
│ _call_setup_hook                                                                                 │
│                                                                                                  │
│   1446 │   │   if self.datamodule is not None:                                                   │
│   1447 │   │   │   self._call_lightning_datamodule_hook("setup", stage=fn)                       │
│   1448 │   │   self._call_callback_hooks("setup", stage=fn)                                      │
│ ❱ 1449 │   │   self._call_lightning_module_hook("setup", stage=fn)                               │
│   1450 │   │                                                                                     │
│   1451 │   │   self.strategy.barrier("post_setup")                                               │
│   1452                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1550 in │
│ _call_lightning_module_hook                                                                      │
│                                                                                                  │
│   1547 │   │   pl_module._current_fx_name = hook_name                                            │
│   1548 │   │                                                                                     │
│   1549 │   │   with self.profiler.profile(f"[LightningModule]{pl_module.__class__.__name__}.{ho  │
│ ❱ 1550 │   │   │   output = fn(*args, **kwargs)                                                  │
│   1551 │   │                                                                                     │
│   1552 │   │   # restore current_fx when nested context                                          │
│   1553 │   │   pl_module._current_fx_name = prev_fx_name                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modelin │
│ g/megatron_gpt_model.py:604 in setup                                                             │
│                                                                                                  │
│   601 │   │   else:                                                                              │
│   602 │   │   │   # TODO: consider adding a ModelPT guard to check if model is being restored.   │
│   603 │   │   │   # allowing restored models to optionally setup datasets                        │
│ ❱ 604 │   │   │   self.build_train_valid_test_datasets()                                         │
│   605 │   │   │   self.setup_training_data(self.cfg.data)                                        │
│   606 │   │   │   self.setup_validation_data(self.cfg.data)                                      │
│   607 │   │   │   self.setup_test_data(self.cfg.data)                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modelin │
│ g/megatron_gpt_model.py:519 in build_train_valid_test_datasets                                   │
│                                                                                                  │
│   516 │   │   │   eval_iters * global_batch_size,                                                │
│   517 │   │   │   test_iters * global_batch_size,                                                │
│   518 │   │   ]                                                                                  │
│ ❱ 519 │   │   self._train_ds, self._validation_ds, self._test_ds = build_train_valid_test_data   │
│   520 │   │   │   cfg=self.cfg,                                                                  │
│   521 │   │   │   trainer=self.trainer,                                                          │
│   522 │   │   │   data_prefix=self.cfg.data.data_prefix,                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/ │
│ megatron/gpt_dataset.py:81 in build_train_valid_test_datasets                                    │
│                                                                                                  │
│    78 │   valid_datasets = []                                                                    │
│    79 │   test_datasets = []                                                                     │
│    80 │   for i in range(len(prefixes)):                                                         │
│ ❱  81 │   │   train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(                    │
│    82 │   │   │   cfg,                                                                           │
│    83 │   │   │   trainer,                                                                       │
│    84 │   │   │   prefixes[i],                                                                   │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/ │
│ megatron/gpt_dataset.py:131 in _build_train_valid_test_datasets                                  │
│                                                                                                  │
│   128 │   """Build train, valid, and test datasets."""                                           │
│   129 │                                                                                          │
│   130 │   # Indexed dataset.                                                                     │
│ ❱ 131 │   indexed_dataset = get_indexed_dataset_(data_prefix, data_impl, skip_warmup)            │
│   132 │                                                                                          │
│   133 │   total_num_of_documents = indexed_dataset.sizes.shape[0]                                │
│   134 │   splits = get_train_valid_test_split_(splits_string, total_num_of_documents)            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/ │
│ megatron/gpt_dataset.py:183 in get_indexed_dataset_                                              │
│                                                                                                  │
│   180 │   indexed_dataset = make_indexed_dataset(data_prefix, data_impl, skip_warmup)            │
│   181 │   logging.info(f"indexed_dataset: {indexed_dataset.__dict__}")                           │
│   182 │   logging.info(' > finished creating indexed dataset in {:4f} ' 'seconds'.format(time.   │
│ ❱ 183 │   logging.info('    number of documents: {}'.format(indexed_dataset.sizes.shape[0]))     │
│   184 │                                                                                          │
│   185 │   return indexed_dataset                                                                 │
│   186                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'CSVMemMapDataset' object has no attribute 'sizes'

Steps/Code to reproduce bug I am using scrip borrowed from - [https://github.com/NVIDIA/NeMo/blob/v1.12.0/examples/nlp/language_modeling/megatron_gpt_pretraining.py]()

@hydra_runner(config_path="conf", config_name="megatron_gpt_config.yaml")
def main(cfg) -> None:
    logging.info("\n\n************** Experiment configuration ***********")
    logging.info(f'\n{OmegaConf.to_yaml(cfg)}')

    cfg.model.data.data_prefix=[0.5,"/home/mayanksharma/nemo_data/small_train", 0.5, "/home/mayanksharma/nemo_data/small_valid", 0.5, "/home/mayanksharma/nemo_data/small_test"]
    cfg.model.data.seq_length=128
    cfg.model.data.data_impl="csv_mmap"
    cfg.trainer.devices=1

    cfg.model.sequence_parallel=True
    cfg.model.global_batch_size=16

    megatron_amp_o2 = cfg.model.get('megatron_amp_O2', False)
    with_distributed_adam = cfg.model.optim.get('name') == 'distributed_fused_adam'

    plugins = []
    strategy = NLPDDPStrategy(
        no_ddp_communication_hook=True,  # we don't use DDP for async grad allreduce
        gradient_as_bucket_view=cfg.model.gradient_as_bucket_view,
        find_unused_parameters=False,
    )
    if cfg.trainer.precision in [16, 'bf16']:
        scaler = None
        if cfg.trainer.precision == 16:
            scaler = GradScaler(
                init_scale=cfg.model.get('native_amp_init_scale', 2 ** 32),
                growth_interval=cfg.model.get('native_amp_growth_interval', 1000),
                hysteresis=cfg.model.get('hysteresis', 2),
            )
        if megatron_amp_o2 and not with_distributed_adam:
            plugins.append(MegatronHalfPrecisionPlugin(precision=cfg.trainer.precision, device='cuda', scaler=scaler))
        else:
            plugins.append(PipelineMixedPrecisionPlugin(precision=cfg.trainer.precision, device='cuda', scaler=scaler))

    if cfg.get('cluster_type', None) == 'BCP':
        plugins.append(TorchElasticEnvironment())

    trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer)

    exp_manager(trainer, cfg.exp_manager)

    # update resume from checkpoint found by exp_manager
    if cfg.model.resume_from_checkpoint is not None:
        resume_from_checkpoint = cfg.model.resume_from_checkpoint
    else:
        resume_from_checkpoint = trainer._checkpoint_connector.resume_from_checkpoint_fit_path

    logging.info(f'Resuming training from checkpoint: {resume_from_checkpoint}')

    trainer._checkpoint_connector = CheckpointConnector(trainer, resume_from_checkpoint=resume_from_checkpoint)
    # Override timer callback to a stateless one
    for idx, callback in enumerate(trainer.callbacks):
        if isinstance(callback, Timer):
            trainer.callbacks[idx] = StatelessTimer(cfg.trainer.max_time,)

    # hydra interpolation does not work here as the interpolation key is lost when PTL saves hparams
    with open_dict(cfg):
        cfg.model.precision = cfg.trainer.precision

    model = MegatronGPTModel(cfg.model, trainer)

    trainer.fit(model)

if __name__ == '__main__':
    main()

Config File That I am Using I have borrowed config from here [https://github.com/NVIDIA/NeMo/blob/v1.12.0/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml]()

name: megatron_gpt_20B
restore_from_path: "nemo_model/nemo-megatron-gpt-20B/nemo_gpt20B_bf16_tp4.nemo" # used when starting from a .nemo file

trainer:
  devices: 2
  num_nodes: 1
  accelerator: gpu
  precision: 16
  logger: False # logger provided by exp_manager
  enable_checkpointing: False
  replace_sampler_ddp: False
  max_epochs: -1 # PTL default. In practice, max_steps will be reached first. 
  max_steps: 100000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
  log_every_n_steps: 10
  val_check_interval: 100
  limit_val_batches: 50
  limit_test_batches: 500
  accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models
  gradient_clip_val: 1.0
  benchmark: False

exp_manager:
  explicit_log_dir: null
  exp_dir: null
  name: megatron_gpt
  create_wandb_logger: False
  wandb_logger_kwargs:
    project: null
    name: null
  resume_if_exists: True
  resume_ignore_no_checkpoint: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 10
    mode: min
    always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
    save_nemo_on_train_end: False # not recommended when training large models on clusters with short time limits
    filename: 'megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}'
    model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}

model:
  # specify micro_batch_size, global_batch_size, and model parallelism
  # gradient accumulation will be done automatically based on data_parallel_size
  micro_batch_size: 4 # limited by GPU memory
  global_batch_size: 8 # will use more micro batches to reach global batch size
  tensor_model_parallel_size: 1 # intra-layer model parallelism
  pipeline_model_parallel_size: 1 # inter-layer model parallelism
  resume_from_checkpoint: null # manually set the checkpoint file to load from

  # model architecture
  encoder_seq_length: 512
  max_position_embeddings: 1024 #${.encoder_seq_length} mayank changed
  num_layers: 12
  hidden_size: 768
  ffn_hidden_size: 3072 # Transformer FFN hidden size. Usually 4 * hidden_size.
  num_attention_heads: 12
  init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
  use_scaled_init_method: True # use scaled residuals initialization
  hidden_dropout: 0.1 # Dropout probability for hidden state transformer.
  kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
  apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.
  normalization: layernorm # Type of normalization layers
  layernorm_epsilon: 1e-5
  do_layer_norm_weight_decay: False # True means weight decay on all params
  make_vocab_size_divisible_by: 128 # Pad the vocab size to be divisible by this value for computation efficiency.
  pre_process: True # add embedding
  post_process: True # add pooler
  persist_layer_norm: True # Use of persistent fused layer norm kernel.
  bert_binary_head: 2 #used for classification

  tokenizer:
    library: 'huggingface'
    type: 'gpt2'
    model: null
    vocab_file: null
    merge_file: null 
    delimiter: null # only used for tabular tokenizer
    sentencepiece_legacy: True # Legacy=True allows you to add special tokens to sentencepiece tokenizers.

  # precision
  native_amp_init_scale: 4294967296 # 2 ** 32
  native_amp_growth_interval: 1000
  hysteresis: 2 # Gradient scale hysteresis
  fp32_residual_connection: False # Move residual connections to fp32
  fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16

  # Megatron O2-style half-precision
  megatron_amp_O2: False # Enable O2-level automatic mixed precision using main parameters
  grad_allreduce_chunk_size_mb: 125
  grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce

  # miscellaneous
  seed: 1234
  use_cpu_initialization: False # Init weights on the CPU (slow for large models)
  onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
  apex_transformer_log_level: 30 # Python logging level displays logs with severity greater than or equal to this
  gradient_as_bucket_view: True # PyTorch DDP argument. Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory)
  gradient_accumulation_fusion: False # Fuse weight gradient accumulation to GEMMs. Only used with pipeline parallelism.

  ## Activation Checkpointing
  # NeMo Megatron supports 'selective' activation checkpointing where only the memory intensive part of attention is checkpointed.
  # These memory intensive activations are also less compute intensive which makes activation checkpointing more efficient for LLMs (20B+).
  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
  # 'full' will checkpoint the entire transformer layer.
  activations_checkpoint_granularity: null # 'selective' or 'full' 
  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
  # of each chunk at the specified granularity
  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
  activations_checkpoint_num_layers: null # not used with 'selective'
  # when using 'uniform' this creates groups of transformer layers to checkpoint. Usually set to 1. Increase to save more memory.
  # when using 'block' this this will checkpoint the first activations_checkpoint_num_layers per pipeline stage.

  ## Sequence Parallelism
  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
  sequence_parallel: True

  data:
    # Path to data must be specified by the user.
    # can override from the CLI: "model.data.data_prefix=[.5,/raid/data/pile/my-gpt3_00_text_document,.5,/raid/data/pile/my-gpt3_01_text_document]",
    # Or see example below: 
    # data_prefix: 
    #   - .5
    #   - /raid/data/pile/my-gpt3_00_text_document
    #   - .5
    #   - /raid/data/pile/my-gpt3_01_text_document
    data_prefix: ???
    index_mapping_dir: null # path to save index mapping .npy files, by default will save in the same location as data_prefix
    data_impl: csv_map
    splits_string: 900,50,50
    seq_length: ${model.encoder_seq_length}
    skip_warmup: True
    num_workers: 2
    dataloader_type: single # cyclic
    reset_position_ids: False # Reset position ids after end-of-document token
    reset_attention_mask: False # Reset attention mask after end-of-document token
    eod_mask_loss: False # Mask loss for the end of document tokens
    # masked_lm_prob: 0.15
    # short_seq_prob: 0.1

  # Nsys profiling options
  nsys_profile:
    enabled: False
    start_step: 10  # Global batch to start profiling
    end_step: 10 # Global batch to end profiling
    ranks: [0] # Global rank IDs to profile
    gen_shape: False # Generate model and kernel details including input shapes

  optim:
    name: fused_adam
    lr: 2e-4
    weight_decay: 0.01 
    betas: 
    - 0.9
    - 0.98
    sched:
      name: CosineAnnealing
      warmup_steps: 500
      constant_steps: 50000
      min_lr: 2e-5

Inpu train,valid, test files I am Using Input small_train, small_valid, small_test files look like

0,|<startoftext>|sentence[LABEL]:Target|<endoftext>|
1,|<startoftext>|sentence[LABEL]:Target|<endoftext>|
2,|<startoftext>|sentence[LABEL]:Target|<endoftext>|

Expected behaviour The train, valid and test files for gpt model should be processed correctly, as a similar file properly formatted for megatron-bert got correctly processed. for the megatron-bert model, the input train file looked something like this. It was a binary classification problem.

0,sentence,label1
1,sentence,label2
2,sentence,label1

Environment overview (please complete the following information)

Environment details

If an NVIDIA docker image is used, you don't need to specify these. Otherwise, please provide:

Additional context Please correct me if my input is formatted incorrectly for a gpt model. I could not find any suitable sample example for fine-tuning/pretraining the neo-megatron-20 billion models. Neither could I find any sample input files, or how they should be formatted to process correctly for GPT

MaximumEntropy commented 1 year ago

Can you try binarizing your data (i.e. running when formatted as a JSONL file instead?) Using this script - https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py

File format:

{'text: "Example 1"}
{'text': "Example 2"}
mayank-nference commented 1 year ago

Thanks for the reply @MaximumEntropy I will convert the data as per the format you shared. Will ping here incase if still it doesn't work.

MaximumEntropy commented 1 year ago

Also, I noticed that you are trying to use the csv_map, I don't think we support that for GPT-based models yet, please use data_impl: mmap.

mayank-nference commented 1 year ago

@MaximumEntropy I am need small help in understanding following fields present in megatron_gpt_config.yaml

  1. model.data.data_prefix, what is correct way to pass this parameter. Currently I am passing the value as [1.0,"/home/mayanksharma/nemo_data/train_binary_files/_text_document",2,"/home/mayanksharma/nemo_data/valid_binary_files/_text_document",3, "/home/mayanksharma/nemo_data/test_binary_files/_text_document"]
  2. model.data.splits_string, what is split_string?, currently I am using default values only 900,50,50 can I change these values?

I need to understand how to load my fine-tuned GPT 20 Billion model for inference, I want to validate the trained model on the golden dataset.

I would really appreciate if you can help me with the above queries.

mayank-nference commented 1 year ago

@MaximumEntropy I want to fine-tune the pre-trained Nemo-Megatron-GPT 20Billlion model; on prompt classification, I got following error. The scrip which was using was for pre-training. The above script worked successfully after the changes you mentioned. However, I am trying to fintune on a downstream task for prompt classification for which I am referring the following pretraining script - https://github.com/NVIDIA/NeMo/blob/v1.12.0/examples/nlp/language_modeling/tuning/megatron_gpt_adapter_tuning.py and following config file - https://github.com/NVIDIA/NeMo/blob/v1.12.0/examples/nlp/language_modeling/tuning/conf/megatron_gpt_adapter_tuning_config.yaml. I am using following model for fine-tuning - https://huggingface.co/nvidia/nemo-megatron-gpt-20B/tree/main

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/mayanksharma/nemo_gpt_fintuning.py:109 in <module>                                         │
│                                                                                                  │
│   106                                                                                            │
│   107                                                                                            │
│   108 if __name__ == '__main__':                                                                 │
│ ❱ 109 │   main()                                                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py:104 in      │
│ wrapper                                                                                          │
│                                                                                                  │
│   101 │   │   │   │                                                                              │
│   102 │   │   │   │   # no return value from run_hydra() as it may sometime actually run the t   │
│   103 │   │   │   │   # multiple times (--multirun)                                              │
│ ❱ 104 │   │   │   │   _run_hydra(                                                                │
│   105 │   │   │   │   │   args_parser=_argparse_wrapper(args),                                   │
│   106 │   │   │   │   │   task_function=task_function,                                           │
│   107 │   │   │   │   │   config_path=config_path,                                               │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:377 in _run_hydra   │
│                                                                                                  │
│   374 │   │   if num_commands == 0:                                                              │
│   375 │   │   │   args.run = True                                                                │
│   376 │   │   if args.run:                                                                       │
│ ❱ 377 │   │   │   run_and_report(                                                                │
│   378 │   │   │   │   lambda: hydra.run(                                                         │
│   379 │   │   │   │   │   config_name=config_name,                                               │
│   380 │   │   │   │   │   task_function=task_function,                                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:214 in              │
│ run_and_report                                                                                   │
│                                                                                                  │
│   211 │   │   return func()                                                                      │
│   212 │   except Exception as ex:                                                                │
│   213 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│ ❱ 214 │   │   │   raise ex                                                                       │
│   215 │   │   else:                                                                              │
│   216 │   │   │   try:                                                                           │
│   217 │   │   │   │   if isinstance(ex, CompactHydraException):                                  │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:211 in              │
│ run_and_report                                                                                   │
│                                                                                                  │
│   208                                                                                            │
│   209 def run_and_report(func: Any) -> Any:                                                      │
│   210 │   try:                                                                                   │
│ ❱ 211 │   │   return func()                                                                      │
│   212 │   except Exception as ex:                                                                │
│   213 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│   214 │   │   │   raise ex                                                                       │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:378 in <lambda>     │
│                                                                                                  │
│   375 │   │   │   args.run = True                                                                │
│   376 │   │   if args.run:                                                                       │
│   377 │   │   │   run_and_report(                                                                │
│ ❱ 378 │   │   │   │   lambda: hydra.run(                                                         │
│   379 │   │   │   │   │   config_name=config_name,                                               │
│   380 │   │   │   │   │   task_function=task_function,                                           │
│   381 │   │   │   │   │   overrides=args.overrides,                                              │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/hydra.py:111 in run          │
│                                                                                                  │
│   108 │   │   callbacks.on_run_end(config=cfg, config_name=config_name, job_return=ret)          │
│   109 │   │                                                                                      │
│   110 │   │   # access the result to trigger an exception in case the job failed.                │
│ ❱ 111 │   │   _ = ret.return_value                                                               │
│   112 │   │                                                                                      │
│   113 │   │   return ret                                                                         │
│   114                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/core/utils.py:233 in return_value      │
│                                                                                                  │
│   230 │   │   │   sys.stderr.write(                                                              │
│   231 │   │   │   │   f"Error executing job with overrides: {self.overrides}" + os.linesep       │
│   232 │   │   │   )                                                                              │
│ ❱ 233 │   │   │   raise self._return_value                                                       │
│   234 │                                                                                          │
│   235 │   @return_value.setter                                                                   │
│   236 │   def return_value(self, value: Any) -> None:                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/core/utils.py:160 in run_job           │
│                                                                                                  │
│   157 │   │   with env_override(hydra_cfg.hydra.job.env_set):                                    │
│   158 │   │   │   callbacks.on_job_start(config=config)                                          │
│   159 │   │   │   try:                                                                           │
│ ❱ 160 │   │   │   │   ret.return_value = task_function(task_cfg)                                 │
│   161 │   │   │   │   ret.status = JobStatus.COMPLETED                                           │
│   162 │   │   │   except Exception as e:                                                         │
│   163 │   │   │   │   ret.return_value = e                                                       │
│                                                                                                  │
│ /home/mayanksharma/nemo_gpt_fintuning.py:103 in main                                             │
│                                                                                                  │
│   100 │   │   │   cfg.model.restore_path, cfg.model, trainer=trainer, save_restore_connector=N   │
│   101 │   │   )                                                                                  │
│   102 │   else:                                                                                  │
│ ❱ 103 │   │   model = MegatronGPTAdapterLearningModel(cfg.model, trainer=trainer)                │
│   104 │                                                                                          │
│   105 │   trainer.fit(model)                                                                     │
│   106                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modelin │
│ g/megatron_gpt_adapter_model.py:94 in __init__                                                   │
│                                                                                                  │
│    91 │   │   for _, module in self.frozen_model.named_modules():                                │
│    92 │   │   │   if isinstance(module, adapter_mixins.AdapterModuleMixin):                      │
│    93 │   │   │   │   for adapter_key in self.adapter_name_keys:                                 │
│ ❱  94 │   │   │   │   │   module.add_adapter(                                                    │
│    95 │   │   │   │   │   │   name=adapter_key, cfg=adapter_cfg,                                 │
│    96 │   │   │   │   │   )                                                                      │
│    97                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/core/classes/mixins/adapter_mixins.py:1 │
│ 64 in add_adapter                                                                                │
│                                                                                                  │
│   161 │   │   """                                                                                │
│   162 │   │   # Convert to DictConfig from dict or Dataclass                                     │
│   163 │   │   if is_dataclass(cfg):                                                              │
│ ❱ 164 │   │   │   cfg = OmegaConf.structured(cfg)                                                │
│   165 │   │                                                                                      │
│   166 │   │   if not isinstance(cfg, DictConfig):                                                │
│   167 │   │   │   cfg = DictConfig(cfg)                                                          │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:123 in structured     │
│                                                                                                  │
│    120 │   │   parent: Optional[BaseContainer] = None,                                           │
│    121 │   │   flags: Optional[Dict[str, bool]] = None,                                          │
│    122 │   ) -> Any:                                                                             │
│ ❱  123 │   │   return OmegaConf.create(obj, parent, flags)                                       │
│    124 │                                                                                         │
│    125 │   @staticmethod                                                                         │
│    126 │   @overload                                                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:176 in create         │
│                                                                                                  │
│    173 │   │   parent: Optional[BaseContainer] = None,                                           │
│    174 │   │   flags: Optional[Dict[str, bool]] = None,                                          │
│    175 │   ) -> Union[DictConfig, ListConfig]:                                                   │
│ ❱  176 │   │   return OmegaConf._create_impl(                                                    │
│    177 │   │   │   obj=obj,                                                                      │
│    178 │   │   │   parent=parent,                                                                │
│    179 │   │   │   flags=flags,                                                                  │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:885 in _create_impl   │
│                                                                                                  │
│    882 │   │   │   │   │   │   │   f"Object of unsupported type: '{type(obj).__name__}'"         │
│    883 │   │   │   │   │   │   )                                                                 │
│    884 │   │   except OmegaConfBaseException as e:                                               │
│ ❱  885 │   │   │   format_and_raise(node=None, key=None, value=None, msg=str(e), cause=e)        │
│    886 │   │   │   assert False                                                                  │
│    887 │                                                                                         │
│    888 │   @staticmethod                                                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:820 in format_and_raise  │
│                                                                                                  │
│    817 │   │   if type_override is not None:                                                     │
│    818 │   │   │   ex = type_override(str(cause))                                                │
│    819 │   │   │   ex.__dict__ = copy.deepcopy(cause.__dict__)                                   │
│ ❱  820 │   │   _raise(ex, cause)                                                                 │
│    821 │                                                                                         │
│    822 │   object_type: Optional[Type[Any]]                                                      │
│    823 │   object_type_str: Optional[str] = None                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:798 in _raise            │
│                                                                                                  │
│    795 │   │   ex.__cause__ = cause                                                              │
│    796 │   else:                                                                                 │
│    797 │   │   ex.__cause__ = None                                                               │
│ ❱  798 │   raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace   │
│    799                                                                                           │
│    800                                                                                           │
│    801 def format_and_raise(                                                                     │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:846 in _create_impl   │
│                                                                                                  │
│    843 │   │   │   │   │   else:                                                                 │
│    844 │   │   │   │   │   │   obj_type = OmegaConf.get_type(obj)                                │
│    845 │   │   │   │   │   │   key_type, element_type = get_dict_key_value_types(obj_type)       │
│ ❱  846 │   │   │   │   │   │   return DictConfig(                                                │
│    847 │   │   │   │   │   │   │   content=obj,                                                  │
│    848 │   │   │   │   │   │   │   parent=parent,                                                │
│    849 │   │   │   │   │   │   │   key_type=key_type,                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:111 in __init__      │
│                                                                                                  │
│   108 │   │   │   │   │   self.__dict__["_metadata"] = metadata                                  │
│   109 │   │   │   │   self._set_value(content, flags=flags)                                      │
│   110 │   │   except Exception as ex:                                                            │
│ ❱ 111 │   │   │   format_and_raise(node=None, key=key, value=None, cause=ex, msg=str(ex))        │
│   112 │                                                                                          │
│   113 │   def __deepcopy__(self, memo: Dict[int, Any]) -> "DictConfig":                          │
│   114 │   │   res = DictConfig(None)                                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:820 in format_and_raise  │
│                                                                                                  │
│    817 │   │   if type_override is not None:                                                     │
│    818 │   │   │   ex = type_override(str(cause))                                                │
│    819 │   │   │   ex.__dict__ = copy.deepcopy(cause.__dict__)                                   │
│ ❱  820 │   │   _raise(ex, cause)                                                                 │
│    821 │                                                                                         │
│    822 │   object_type: Optional[Type[Any]]                                                      │
│    823 │   object_type_str: Optional[str] = None                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:798 in _raise            │
│                                                                                                  │
│    795 │   │   ex.__cause__ = cause                                                              │
│    796 │   else:                                                                                 │
│    797 │   │   ex.__cause__ = None                                                               │
│ ❱  798 │   raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace   │
│    799                                                                                           │
│    800                                                                                           │
│    801 def format_and_raise(                                                                     │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:94 in __init__       │
│                                                                                                  │
│    91 │   │   │   │   raise KeyValidationError(f"Unsupported key type {key_type}")               │
│    92 │   │   │                                                                                  │
│    93 │   │   │   if is_structured_config(content) or is_structured_config(ref_type):            │
│ ❱  94 │   │   │   │   self._set_value(content, flags=flags)                                      │
│    95 │   │   │   │   if is_structured_config_frozen(content) or is_structured_config_frozen(    │
│    96 │   │   │   │   │   ref_type                                                               │
│    97 │   │   │   │   ):                                                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:647 in _set_value    │
│                                                                                                  │
│   644 │   │   │   self._set_value_impl(value, flags)                                             │
│   645 │   │   except Exception as e:                                                             │
│   646 │   │   │   self.__dict__["_content"] = previous_content                                   │
│ ❱ 647 │   │   │   raise e                                                                        │
│   648 │                                                                                          │
│   649 │   def _set_value_impl(                                                                   │
│   650 │   │   self, value: Any, flags: Optional[Dict[str, bool]] = None                          │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:644 in _set_value    │
│                                                                                                  │
│   641 │   def _set_value(self, value: Any, flags: Optional[Dict[str, bool]] = None) -> None:     │
│   642 │   │   try:                                                                               │
│   643 │   │   │   previous_content = self.__dict__["_content"]                                   │
│ ❱ 644 │   │   │   self._set_value_impl(value, flags)                                             │
│   645 │   │   except Exception as e:                                                             │
│   646 │   │   │   self.__dict__["_content"] = previous_content                                   │
│   647 │   │   │   raise e                                                                        │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:674 in               │
│ _set_value_impl                                                                                  │
│                                                                                                  │
│   671 │   │   │   if is_structured_config(value):                                                │
│   672 │   │   │   │   self._metadata.object_type = None                                          │
│   673 │   │   │   │   ao = self._get_flag("allow_objects")                                       │
│ ❱ 674 │   │   │   │   data = get_structured_config_data(value, allow_objects=ao)                 │
│   675 │   │   │   │   with flag_override(self, ["struct", "readonly"], False):                   │
│   676 │   │   │   │   │   for k, v in data.items():                                              │
│   677 │   │   │   │   │   │   self.__setitem__(k, v)                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:494 in                   │
│ get_structured_config_data                                                                       │
│                                                                                                  │
│    491 │   obj: Any, allow_objects: Optional[bool] = None                                        │
│    492 ) -> Dict[str, Any]:                                                                      │
│    493 │   if is_dataclass(obj):                                                                 │
│ ❱  494 │   │   return get_dataclass_data(obj, allow_objects=allow_objects)                       │
│    495 │   elif is_attr_class(obj):                                                              │
│    496 │   │   return get_attr_data(obj, allow_objects=allow_objects)                            │
│    497 │   else:                                                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:422 in                   │
│ get_dataclass_data                                                                               │
│                                                                                                  │
│    419 │   │   │   │   parent=dummy_parent,                                                      │
│    420 │   │   │   )                                                                             │
│    421 │   │   except (ValidationError, GrammarParseError) as ex:                                │
│ ❱  422 │   │   │   format_and_raise(                                                             │
│    423 │   │   │   │   node=dummy_parent, key=name, value=value, cause=ex, msg=str(ex)           │
│    424 │   │   │   )                                                                             │
│    425 │   │   d[name]._set_parent(None)                                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:820 in format_and_raise  │
│                                                                                                  │
│    817 │   │   if type_override is not None:                                                     │
│    818 │   │   │   ex = type_override(str(cause))                                                │
│    819 │   │   │   ex.__dict__ = copy.deepcopy(cause.__dict__)                                   │
│ ❱  820 │   │   _raise(ex, cause)                                                                 │
│    821 │                                                                                         │
│    822 │   object_type: Optional[Type[Any]]                                                      │
│    823 │   object_type_str: Optional[str] = None                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:798 in _raise            │
│                                                                                                  │
│    795 │   │   ex.__cause__ = cause                                                              │
│    796 │   else:                                                                                 │
│    797 │   │   ex.__cause__ = None                                                               │
│ ❱  798 │   raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace   │
│    799                                                                                           │
│    800                                                                                           │
│    801 def format_and_raise(                                                                     │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:414 in                   │
│ get_dataclass_data                                                                               │
│                                                                                                  │
│    411 │   │   │   )                                                                             │
│    412 │   │   │   format_and_raise(node=None, key=None, value=value, cause=e, msg=str(e))       │
│    413 │   │   try:                                                                              │
│ ❱  414 │   │   │   d[name] = _maybe_wrap(                                                        │
│    415 │   │   │   │   ref_type=type_,                                                           │
│    416 │   │   │   │   is_optional=is_optional,                                                  │
│    417 │   │   │   │   key=name,                                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:1090 in _maybe_wrap   │
│                                                                                                  │
│   1087 │   │   value._set_parent(parent)                                                         │
│   1088 │   │   return value                                                                      │
│   1089 │   else:                                                                                 │
│ ❱ 1090 │   │   return _node_wrap(                                                                │
│   1091 │   │   │   ref_type=ref_type,                                                            │
│   1092 │   │   │   parent=parent,                                                                │
│   1093 │   │   │   is_optional=is_optional,                                                      │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:989 in _node_wrap     │
│                                                                                                  │
│    986 │   node: Node                                                                            │
│    987 │   if is_dict_annotation(ref_type) or (is_primitive_dict(value) and ref_type is Any):    │
│    988 │   │   key_type, element_type = get_dict_key_value_types(ref_type)                       │
│ ❱  989 │   │   node = DictConfig(                                                                │
│    990 │   │   │   content=value,                                                                │
│    991 │   │   │   key=key,                                                                      │
│    992 │   │   │   parent=parent,                                                                │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:111 in __init__      │
│                                                                                                  │
│   108 │   │   │   │   │   self.__dict__["_metadata"] = metadata                                  │
│   109 │   │   │   │   self._set_value(content, flags=flags)                                      │
│   110 │   │   except Exception as ex:                                                            │
│ ❱ 111 │   │   │   format_and_raise(node=None, key=key, value=None, cause=ex, msg=str(ex))        │
│   112 │                                                                                          │
│   113 │   def __deepcopy__(self, memo: Dict[int, Any]) -> "DictConfig":                          │
│   114 │   │   res = DictConfig(None)                                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:900 in format_and_raise  │
│                                                                                                  │
│    897 │   │   ex.ref_type = ref_type                                                            │
│    898 │   │   ex.ref_type_str = ref_type_str                                                    │
│    899 │                                                                                         │
│ ❱  900 │   _raise(ex, cause)                                                                     │
│    901                                                                                           │
│    902                                                                                           │
│    903 def type_str(t: Any, include_module_name: bool = False) -> str:                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:798 in _raise            │
│                                                                                                  │
│    795 │   │   ex.__cause__ = cause                                                              │
│    796 │   else:                                                                                 │
│    797 │   │   ex.__cause__ = None                                                               │
│ ❱  798 │   raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace   │
│    799                                                                                           │
│    800                                                                                           │
│    801 def format_and_raise(                                                                     │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:94 in __init__       │
│                                                                                                  │
│    91 │   │   │   │   raise KeyValidationError(f"Unsupported key type {key_type}")               │
│    92 │   │   │                                                                                  │
│    93 │   │   │   if is_structured_config(content) or is_structured_config(ref_type):            │
│ ❱  94 │   │   │   │   self._set_value(content, flags=flags)                                      │
│    95 │   │   │   │   if is_structured_config_frozen(content) or is_structured_config_frozen(    │
│    96 │   │   │   │   │   ref_type                                                               │
│    97 │   │   │   │   ):                                                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:647 in _set_value    │
│                                                                                                  │
│   644 │   │   │   self._set_value_impl(value, flags)                                             │
│   645 │   │   except Exception as e:                                                             │
│   646 │   │   │   self.__dict__["_content"] = previous_content                                   │
│ ❱ 647 │   │   │   raise e                                                                        │
│   648 │                                                                                          │
│   649 │   def _set_value_impl(                                                                   │
│   650 │   │   self, value: Any, flags: Optional[Dict[str, bool]] = None                          │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:644 in _set_value    │
│                                                                                                  │
│   641 │   def _set_value(self, value: Any, flags: Optional[Dict[str, bool]] = None) -> None:     │
│   642 │   │   try:                                                                               │
│   643 │   │   │   previous_content = self.__dict__["_content"]                                   │
│ ❱ 644 │   │   │   self._set_value_impl(value, flags)                                             │
│   645 │   │   except Exception as e:                                                             │
│   646 │   │   │   self.__dict__["_content"] = previous_content                                   │
│   647 │   │   │   raise e                                                                        │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:658 in               │
│ _set_value_impl                                                                                  │
│                                                                                                  │
│   655 │   │   │   flags = {}                                                                     │
│   656 │   │                                                                                      │
│   657 │   │   assert not isinstance(value, ValueNode)                                            │
│ ❱ 658 │   │   self._validate_set(key=None, value=value)                                          │
│   659 │   │                                                                                      │
│   660 │   │   if _is_none(value, resolve=True):                                                  │
│   661 │   │   │   self.__dict__["_content"] = None                                               │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:200 in _validate_set │
│                                                                                                  │
│   197 │   │   if is_container_annotation(target_type) and not is_container_annotation(           │
│   198 │   │   │   value_type                                                                     │
│   199 │   │   ):                                                                                 │
│ ❱ 200 │   │   │   raise ValidationError(                                                         │
│   201 │   │   │   │   f"Cannot assign {type_str(value_type)} to {type_str(target_type)}"         │
│   202 │   │   │   )                                                                              │
│   203                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationError: Cannot assign ResidualAddAdapterStrategyConfig to Dict[Any, Any]
    full_key: adapter_strategy
    object_type=None

I am using following fine-tuning config

name: adapter_tuning_${model.new_tasks[0]}_max_epochs${trainer.max_epochs}_adapter_dim${model.adapter_tuning.adapter_dim}

trainer:
  devices: 4
  accelerator: gpu
  num_nodes: 1
  precision: 16
  logger: False # logger provided by exp_manager
  enable_checkpointing: False
  replace_sampler_ddp: False
  max_epochs: 10
  max_steps: -1 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
  log_every_n_steps: 10
  val_check_interval: 0.2
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
  benchmark: False

exp_manager:
  explicit_log_dir: null
  exp_dir: "nemo_experiment"
  name: ${name}
  create_wandb_logger: null
  wandb_logger_kwargs:
    project: null
    name: null
  resume_if_exists: True
  resume_ignore_no_checkpoint: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 1
    mode: min
    save_nemo_on_train_end: True # Should be false, correct prompt learning model file is saved at model.nemo_path set below, 
    filename: 'megatron_gpt_adapter_tuning--{val_loss:.3f}-{step}'
    model_parallel_size: ${model.tensor_model_parallel_size}
    save_best_model: True

model:
  seed: 1234
  nemo_path: ${exp_manager.exp_dir}/${name}.nemo # .nemo filename/absolute path to where the virtual prompt model parameters will be saved
  virtual_prompt_style: 'no-prompts' # adapter tuning requires no virtual prompts
  encoder_seq_length: 2048 
  gradient_as_bucket_view: false
  tensor_model_parallel_size: 1 # intra-layer model parallelism
  pipeline_model_parallel_size: 4 # inter-layer model parallelism
  global_batch_size: 4
  micro_batch_size: 1

  restore_path: null # Path to an existing adapter .nemo model you wish to add new tasks to or run inference with
  language_model_path: "/home/mayanksharma/nemo_model/nemo-megatron-gpt-20B/nemo_gpt20B_bf16_tp4.nemo" # Path to the GPT language model .nemo file, always required
  existing_tasks: [] # List of tasks the model has already been p-tuned/prompt-tuned for, needed when a restore path is given
  new_tasks: ["drug_to_target_cls"] # List of new tasknames to be prompt-tuned

  task_templates: # Add more/replace tasks as needed, these are just examples

  - taskname: "drug_to_target_cls" # Extreme Summarization
    prompt_template: "<|VIRTUAL_PROMPT_0|> Sentence: {sentence} \nLabel: {label}" 
    total_virtual_tokens: 2048
    virtual_token_splits: []
    truncate_field: null
    answer_only_loss: True
    answer_field: "label"

#   - taskname: "boolq" # The task name
#     prompt_template: "Passage: {passage} \nQuestion: {question} \nAnswer: {answer}" # Prompt template for task, specify virtual prompt positions with <|VIRTUAL_PROMPT_#|>
#     total_virtual_tokens: 0 # Sum of tokens in virtual_token_splits must add to this number. Can differ between new and existing tasks, but must match across all new tasks being tuned at the same time.
#     virtual_token_splits: [] # number of virtual tokens to be inserted at each VIRTUAL PROMPT location, must add to total_virtual_tokens
#     truncate_field: "passage" # The {field} in the prompt template whose text will be truncated if the input is too long, if null, inputs that are too long will just be skipped.
#     answer_only_loss: True 
#     answer_field: "answer"

#   - taskname: "intent_and_slot" # Intent Detection and Slot Filling
#     prompt_template: "intent options: {intent_options} slot options: {slot_options} {utterance} \nintent: {intent} \nslot: {slot}"
#     total_virtual_tokens: 0 
#     answer_only_loss: False 
#     virtual_token_splits: []
#     truncate_field: null

#   - taskname: "rte" # Recognizing Textual Entailment
#     prompt_template: "sentence1: {premise} sentence2: {hypothesis} Answer: {answer}" 
#     total_virtual_tokens: 0
#     virtual_token_splits: []
#     truncate_field: null
#     answer_only_loss: True
#     answer_field: "answer"

#   - taskname: "squad" # Standford Question-Answering
#     prompt_template: "context: {context} question: {question} answer: {answer}" 
#     total_virtual_tokens: 0
#     virtual_token_splits: []
#     truncate_field: null
#     answer_only_loss: True
#     answer_field: "answer"

#   - taskname: "arc-challenge" # Abstraction and Reasoning Challenge
#     prompt_template: "question: {question} choices: {choices} answer: {answer}" 
#     total_virtual_tokens: 0
#     virtual_token_splits: []
#     truncate_field: null
#     answer_only_loss: True
#     answer_field: "answer"

#   - taskname: "xsum" # Extreme Summarization
#     prompt_template: "{source} Summary: {target}" 
#     total_virtual_tokens: 0
#     virtual_token_splits: []
#     truncate_field: null
#     answer_only_loss: True
#     answer_field: "target"

  adapter_tuning:
    type: 'parallel_adapter' # this should be either 'parallel_adapter' or 'linear_adapter'
    adapter_dim: 50
    adapter_dropout: 0.1
    norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is normally what is used.
    column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal
    row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal
    norm_type: 'mixedfusedlayernorm' # IGNORED if layer_adapter is used,  options are ['layernorm', 'mixedfusedlayernorm']

  data:
    train_ds: "nemo_data/json_files/train.jsonl" # expects a list of paths to training data files
    validation_ds: "nemo_data/json_files/valid.jsonl"  # expects a paths to validation data files
    add_eos: True
    shuffle: True
    num_workers: 24
    pin_memory: True

  optim:
    name: fused_adam
    lr: 1e-4
    weight_decay: 0.01 
    betas: 
    - 0.9
    - 0.98
    sched:
      name: CosineAnnealing
      warmup_steps: 50
      constant_steps: 0 # Constant steps should also be 0 when min_lr=0
      min_lr: 0.0 # min_lr must be 0.0 for prompt learning
      monitor: val_loss
      reduce_on_plateau: false

Please let me know in case I am using the wrong config or pertaining script.

MaximumEntropy commented 1 year ago

Regarding your first two questions

  1. Regarding how to provide data_prefix - That is the correct format to provide the data prefix however it is not meant to take train/val/test separately. It is meant to take a dataset and internally create train/val/test splits from it. So if you specify something like 0.3,"/path/to/data_1",0.4,"/path/to/data_2",0.3, "/path/to/data_3"] it will sample from these three different datasets with ratio 0.3,0.4,0.3 respectively.
  2. What is split_string?, currently I am using default values only 900,50,50 can I change these values?" Split string is used to specify the ratio of train/val/test data. So you can do something like 80,10,10. for 80% train and 10% for valid and test.

@arendu do you have any idea about the adapter error?

mayank-nference commented 1 year ago

Thanks for the clear explanation, @MaximumEntropy.

mayank-nference commented 1 year ago

@MaximumEntropy @arendu, it's been 5 days can someone please help me with the adapter issue?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.