Unable to process dataset for Neo-megatron-GPT-20 Billion model

mayank-nference commented 1 year ago

Describe the bug

home/mayanksharma/nemo_gpt_fintuning.py:261 in <module>                                         │
│                                                                                                  │
│   258                                                                                            │
│   259                                                                                            │
│   260 if __name__ == '__main__':                                                                 │
│ ❱ 261 │   main()                                                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py:104 in      │
│ wrapper                                                                                          │
│                                                                                                  │
│   101 │   │   │   │                                                                              │
│   102 │   │   │   │   # no return value from run_hydra() as it may sometime actually run the t   │
│   103 │   │   │   │   # multiple times (--multirun)                                              │
│ ❱ 104 │   │   │   │   _run_hydra(                                                                │
│   105 │   │   │   │   │   args_parser=_argparse_wrapper(args),                                   │
│   106 │   │   │   │   │   task_function=task_function,                                           │
│   107 │   │   │   │   │   config_path=config_path,                                               │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:377 in _run_hydra   │
│                                                                                                  │
│   374 │   │   if num_commands == 0:                                                              │
│   375 │   │   │   args.run = True                                                                │
│   376 │   │   if args.run:                                                                       │
│ ❱ 377 │   │   │   run_and_report(                                                                │
│   378 │   │   │   │   lambda: hydra.run(                                                         │
│   379 │   │   │   │   │   config_name=config_name,                                               │
│   380 │   │   │   │   │   task_function=task_function,                                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:214 in              │
│ run_and_report                                                                                   │
│                                                                                                  │
│   211 │   │   return func()                                                                      │
│   212 │   except Exception as ex:                                                                │
│   213 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│ ❱ 214 │   │   │   raise ex                                                                       │
│   215 │   │   else:                                                                              │
│   216 │   │   │   try:                                                                           │
│   217 │   │   │   │   if isinstance(ex, CompactHydraException):                                  │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:211 in              │
│ run_and_report                                                                                   │
│                                                                                                  │
│   208                                                                                            │
│   209 def run_and_report(func: Any) -> Any:                                                      │
│   210 │   try:                                                                                   │
│ ❱ 211 │   │   return func()                                                                      │
│   212 │   except Exception as ex:                                                                │
│   213 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│   214 │   │   │   raise ex                                                                       │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:378 in <lambda>     │
│                                                                                                  │
│   375 │   │   │   args.run = True                                                                │
│   376 │   │   if args.run:                                                                       │
│   377 │   │   │   run_and_report(                                                                │
│ ❱ 378 │   │   │   │   lambda: hydra.run(                                                         │
│   379 │   │   │   │   │   config_name=config_name,                                               │
│   380 │   │   │   │   │   task_function=task_function,                                           │
│   381 │   │   │   │   │   overrides=args.overrides,                                              │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/hydra.py:111 in run          │
│                                                                                                  │
│   108 │   │   callbacks.on_run_end(config=cfg, config_name=config_name, job_return=ret)          │
│   109 │   │                                                                                      │
│   110 │   │   # access the result to trigger an exception in case the job failed.                │
│ ❱ 111 │   │   _ = ret.return_value                                                               │
│   112 │   │                                                                                      │
│   113 │   │   return ret                                                                         │
│   114                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/core/utils.py:233 in return_value      │
│                                                                                                  │
│   230 │   │   │   sys.stderr.write(                                                              │
│   231 │   │   │   │   f"Error executing job with overrides: {self.overrides}" + os.linesep       │
│   232 │   │   │   )                                                                              │
│ ❱ 233 │   │   │   raise self._return_value                                                       │
│   234 │                                                                                          │
│   235 │   @return_value.setter                                                                   │
│   236 │   def return_value(self, value: Any) -> None:                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/core/utils.py:160 in run_job           │
│                                                                                                  │
│   157 │   │   with env_override(hydra_cfg.hydra.job.env_set):                                    │
│   158 │   │   │   callbacks.on_job_start(config=config)                                          │
│   159 │   │   │   try:                                                                           │
│ ❱ 160 │   │   │   │   ret.return_value = task_function(task_cfg)                                 │
│   161 │   │   │   │   ret.status = JobStatus.COMPLETED                                           │
│   162 │   │   │   except Exception as e:                                                         │
│   163 │   │   │   │   ret.return_value = e                                                       │
│                                                                                                  │
│ /home/mayanksharma/nemo_gpt_fintuning.py:256 in main                                             │
│                                                                                                  │
│   253 │                                                                                          │
│   254 │   model = MegatronGPTModel(cfg.model, trainer)                                           │
│   255 │                                                                                          │
│ ❱ 256 │   trainer.fit(model)                                                                     │
│   257 │   # trainer.fit(model, train_dataloaders=train_dataloader, val_dataloaders=valid_datal   │
│   258                                                                                            │
│   259                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:696 in  │
│ fit                                                                                              │
│                                                                                                  │
│    693 │   │   │   datamodule: An instance of :class:`~pytorch_lightning.core.datamodule.Lightn  │
│    694 │   │   """                                                                               │
│    695 │   │   self.strategy.model = model                                                       │
│ ❱  696 │   │   self._call_and_handle_interrupt(                                                  │
│    697 │   │   │   self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_  │
│    698 │   │   )                                                                                 │
│    699                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:648 in  │
│ _call_and_handle_interrupt                                                                       │
│                                                                                                  │
│    645 │   │   """                                                                               │
│    646 │   │   try:                                                                              │
│    647 │   │   │   if self.strategy.launcher is not None:                                        │
│ ❱  648 │   │   │   │   return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **  │
│    649 │   │   │   else:                                                                         │
│    650 │   │   │   │   return trainer_fn(*args, **kwargs)                                        │
│    651 │   │   # TODO(awaelchli): Unify both exceptions below, where `KeyboardError` doesn't re  │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subpr │
│ ocess_script.py:93 in launch                                                                     │
│                                                                                                  │
│    90 │   │   """                                                                                │
│    91 │   │   if not self.cluster_environment.creates_processes_externally:                      │
│    92 │   │   │   self._call_children_scripts()                                                  │
│ ❱  93 │   │   return function(*args, **kwargs)                                                   │
│    94 │                                                                                          │
│    95 │   def _call_children_scripts(self) -> None:                                              │
│    96 │   │   # bookkeeping of spawned processes                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:735 in  │
│ _fit_impl                                                                                        │
│                                                                                                  │
│    732 │   │   self._ckpt_path = self.__set_ckpt_path(                                           │
│    733 │   │   │   ckpt_path, model_provided=True, model_connected=self.lightning_module is not  │
│    734 │   │   )                                                                                 │
│ ❱  735 │   │   results = self._run(model, ckpt_path=self.ckpt_path)                              │
│    736 │   │                                                                                     │
│    737 │   │   assert self.state.stopped                                                         │
│    738 │   │   self.training = False                                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1105 in │
│ _run                                                                                             │
│                                                                                                  │
│   1102 │   │   self.strategy.setup_environment()                                                 │
│   1103 │   │   self.__setup_profiler()                                                           │
│   1104 │   │                                                                                     │
│ ❱ 1105 │   │   self._call_setup_hook()  # allow user to setup lightning_module in accelerator e  │
│   1106 │   │                                                                                     │
│   1107 │   │   # check if we should delay restoring checkpoint till later                        │
│   1108 │   │   if not self.strategy.restore_checkpoint_after_setup:                              │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1449 in │
│ _call_setup_hook                                                                                 │
│                                                                                                  │
│   1446 │   │   if self.datamodule is not None:                                                   │
│   1447 │   │   │   self._call_lightning_datamodule_hook("setup", stage=fn)                       │
│   1448 │   │   self._call_callback_hooks("setup", stage=fn)                                      │
│ ❱ 1449 │   │   self._call_lightning_module_hook("setup", stage=fn)                               │
│   1450 │   │                                                                                     │
│   1451 │   │   self.strategy.barrier("post_setup")                                               │
│   1452                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1550 in │
│ _call_lightning_module_hook                                                                      │
│                                                                                                  │
│   1547 │   │   pl_module._current_fx_name = hook_name                                            │
│   1548 │   │                                                                                     │
│   1549 │   │   with self.profiler.profile(f"[LightningModule]{pl_module.__class__.__name__}.{ho  │
│ ❱ 1550 │   │   │   output = fn(*args, **kwargs)                                                  │
│   1551 │   │                                                                                     │
│   1552 │   │   # restore current_fx when nested context                                          │
│   1553 │   │   pl_module._current_fx_name = prev_fx_name                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modelin │
│ g/megatron_gpt_model.py:604 in setup                                                             │
│                                                                                                  │
│   601 │   │   else:                                                                              │
│   602 │   │   │   # TODO: consider adding a ModelPT guard to check if model is being restored.   │
│   603 │   │   │   # allowing restored models to optionally setup datasets                        │
│ ❱ 604 │   │   │   self.build_train_valid_test_datasets()                                         │
│   605 │   │   │   self.setup_training_data(self.cfg.data)                                        │
│   606 │   │   │   self.setup_validation_data(self.cfg.data)                                      │
│   607 │   │   │   self.setup_test_data(self.cfg.data)                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modelin │
│ g/megatron_gpt_model.py:519 in build_train_valid_test_datasets                                   │
│                                                                                                  │
│   516 │   │   │   eval_iters * global_batch_size,                                                │
│   517 │   │   │   test_iters * global_batch_size,                                                │
│   518 │   │   ]                                                                                  │
│ ❱ 519 │   │   self._train_ds, self._validation_ds, self._test_ds = build_train_valid_test_data   │
│   520 │   │   │   cfg=self.cfg,                                                                  │
│   521 │   │   │   trainer=self.trainer,                                                          │
│   522 │   │   │   data_prefix=self.cfg.data.data_prefix,                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/ │
│ megatron/gpt_dataset.py:81 in build_train_valid_test_datasets                                    │
│                                                                                                  │
│    78 │   valid_datasets = []                                                                    │
│    79 │   test_datasets = []                                                                     │
│    80 │   for i in range(len(prefixes)):                                                         │
│ ❱  81 │   │   train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(                    │
│    82 │   │   │   cfg,                                                                           │
│    83 │   │   │   trainer,                                                                       │
│    84 │   │   │   prefixes[i],                                                                   │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/ │
│ megatron/gpt_dataset.py:131 in _build_train_valid_test_datasets                                  │
│                                                                                                  │
│   128 │   """Build train, valid, and test datasets."""                                           │
│   129 │                                                                                          │
│   130 │   # Indexed dataset.                                                                     │
│ ❱ 131 │   indexed_dataset = get_indexed_dataset_(data_prefix, data_impl, skip_warmup)            │
│   132 │                                                                                          │
│   133 │   total_num_of_documents = indexed_dataset.sizes.shape[0]                                │
│   134 │   splits = get_train_valid_test_split_(splits_string, total_num_of_documents)            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/ │
│ megatron/gpt_dataset.py:183 in get_indexed_dataset_                                              │
│                                                                                                  │
│   180 │   indexed_dataset = make_indexed_dataset(data_prefix, data_impl, skip_warmup)            │
│   181 │   logging.info(f"indexed_dataset: {indexed_dataset.__dict__}")                           │
│   182 │   logging.info(' > finished creating indexed dataset in {:4f} ' 'seconds'.format(time.   │
│ ❱ 183 │   logging.info('    number of documents: {}'.format(indexed_dataset.sizes.shape[0]))     │
│   184 │                                                                                          │
│   185 │   return indexed_dataset                                                                 │
│   186                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'CSVMemMapDataset' object has no attribute 'sizes'

Steps/Code to reproduce bug I am using scrip borrowed from - [https://github.com/NVIDIA/NeMo/blob/v1.12.0/examples/nlp/language_modeling/megatron_gpt_pretraining.py]()

@hydra_runner(config_path="conf", config_name="megatron_gpt_config.yaml")
def main(cfg) -> None:
    logging.info("\n\n************** Experiment configuration ***********")
    logging.info(f'\n{OmegaConf.to_yaml(cfg)}')

    cfg.model.data.data_prefix=[0.5,"/home/mayanksharma/nemo_data/small_train", 0.5, "/home/mayanksharma/nemo_data/small_valid", 0.5, "/home/mayanksharma/nemo_data/small_test"]
    cfg.model.data.seq_length=128
    cfg.model.data.data_impl="csv_mmap"
    cfg.trainer.devices=1

    cfg.model.sequence_parallel=True
    cfg.model.global_batch_size=16

    megatron_amp_o2 = cfg.model.get('megatron_amp_O2', False)
    with_distributed_adam = cfg.model.optim.get('name') == 'distributed_fused_adam'

    plugins = []
    strategy = NLPDDPStrategy(
        no_ddp_communication_hook=True,  # we don't use DDP for async grad allreduce
        gradient_as_bucket_view=cfg.model.gradient_as_bucket_view,
        find_unused_parameters=False,
    )
    if cfg.trainer.precision in [16, 'bf16']:
        scaler = None
        if cfg.trainer.precision == 16:
            scaler = GradScaler(
                init_scale=cfg.model.get('native_amp_init_scale', 2 ** 32),
                growth_interval=cfg.model.get('native_amp_growth_interval', 1000),
                hysteresis=cfg.model.get('hysteresis', 2),
            )
        if megatron_amp_o2 and not with_distributed_adam:
            plugins.append(MegatronHalfPrecisionPlugin(precision=cfg.trainer.precision, device='cuda', scaler=scaler))
        else:
            plugins.append(PipelineMixedPrecisionPlugin(precision=cfg.trainer.precision, device='cuda', scaler=scaler))

    if cfg.get('cluster_type', None) == 'BCP':
        plugins.append(TorchElasticEnvironment())

    trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer)

    exp_manager(trainer, cfg.exp_manager)

    # update resume from checkpoint found by exp_manager
    if cfg.model.resume_from_checkpoint is not None:
        resume_from_checkpoint = cfg.model.resume_from_checkpoint
    else:
        resume_from_checkpoint = trainer._checkpoint_connector.resume_from_checkpoint_fit_path

    logging.info(f'Resuming training from checkpoint: {resume_from_checkpoint}')

    trainer._checkpoint_connector = CheckpointConnector(trainer, resume_from_checkpoint=resume_from_checkpoint)
    # Override timer callback to a stateless one
    for idx, callback in enumerate(trainer.callbacks):
        if isinstance(callback, Timer):
            trainer.callbacks[idx] = StatelessTimer(cfg.trainer.max_time,)

    # hydra interpolation does not work here as the interpolation key is lost when PTL saves hparams
    with open_dict(cfg):
        cfg.model.precision = cfg.trainer.precision

    model = MegatronGPTModel(cfg.model, trainer)

    trainer.fit(model)

if __name__ == '__main__':
    main()

Config File That I am Using I have borrowed config from here [https://github.com/NVIDIA/NeMo/blob/v1.12.0/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml]()

name: megatron_gpt_20B
restore_from_path: "nemo_model/nemo-megatron-gpt-20B/nemo_gpt20B_bf16_tp4.nemo" # used when starting from a .nemo file

trainer:
  devices: 2
  num_nodes: 1
  accelerator: gpu
  precision: 16
  logger: False # logger provided by exp_manager
  enable_checkpointing: False
  replace_sampler_ddp: False
  max_epochs: -1 # PTL default. In practice, max_steps will be reached first. 
  max_steps: 100000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
  log_every_n_steps: 10
  val_check_interval: 100
  limit_val_batches: 50
  limit_test_batches: 500
  accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models
  gradient_clip_val: 1.0
  benchmark: False

exp_manager:
  explicit_log_dir: null
  exp_dir: null
  name: megatron_gpt
  create_wandb_logger: False
  wandb_logger_kwargs:
    project: null
    name: null
  resume_if_exists: True
  resume_ignore_no_checkpoint: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 10
    mode: min
    always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
    save_nemo_on_train_end: False # not recommended when training large models on clusters with short time limits
    filename: 'megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}'
    model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}

model:
  # specify micro_batch_size, global_batch_size, and model parallelism
  # gradient accumulation will be done automatically based on data_parallel_size
  micro_batch_size: 4 # limited by GPU memory
  global_batch_size: 8 # will use more micro batches to reach global batch size
  tensor_model_parallel_size: 1 # intra-layer model parallelism
  pipeline_model_parallel_size: 1 # inter-layer model parallelism
  resume_from_checkpoint: null # manually set the checkpoint file to load from

  # model architecture
  encoder_seq_length: 512
  max_position_embeddings: 1024 #${.encoder_seq_length} mayank changed
  num_layers: 12
  hidden_size: 768
  ffn_hidden_size: 3072 # Transformer FFN hidden size. Usually 4 * hidden_size.
  num_attention_heads: 12
  init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
  use_scaled_init_method: True # use scaled residuals initialization
  hidden_dropout: 0.1 # Dropout probability for hidden state transformer.
  kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
  apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.
  normalization: layernorm # Type of normalization layers
  layernorm_epsilon: 1e-5
  do_layer_norm_weight_decay: False # True means weight decay on all params
  make_vocab_size_divisible_by: 128 # Pad the vocab size to be divisible by this value for computation efficiency.
  pre_process: True # add embedding
  post_process: True # add pooler
  persist_layer_norm: True # Use of persistent fused layer norm kernel.
  bert_binary_head: 2 #used for classification

  tokenizer:
    library: 'huggingface'
    type: 'gpt2'
    model: null
    vocab_file: null
    merge_file: null 
    delimiter: null # only used for tabular tokenizer
    sentencepiece_legacy: True # Legacy=True allows you to add special tokens to sentencepiece tokenizers.

  # precision
  native_amp_init_scale: 4294967296 # 2 ** 32
  native_amp_growth_interval: 1000
  hysteresis: 2 # Gradient scale hysteresis
  fp32_residual_connection: False # Move residual connections to fp32
  fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16

  # Megatron O2-style half-precision
  megatron_amp_O2: False # Enable O2-level automatic mixed precision using main parameters
  grad_allreduce_chunk_size_mb: 125
  grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce

  # miscellaneous
  seed: 1234
  use_cpu_initialization: False # Init weights on the CPU (slow for large models)
  onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
  apex_transformer_log_level: 30 # Python logging level displays logs with severity greater than or equal to this
  gradient_as_bucket_view: True # PyTorch DDP argument. Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory)
  gradient_accumulation_fusion: False # Fuse weight gradient accumulation to GEMMs. Only used with pipeline parallelism.

  ## Activation Checkpointing
  # NeMo Megatron supports 'selective' activation checkpointing where only the memory intensive part of attention is checkpointed.
  # These memory intensive activations are also less compute intensive which makes activation checkpointing more efficient for LLMs (20B+).
  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
  # 'full' will checkpoint the entire transformer layer.
  activations_checkpoint_granularity: null # 'selective' or 'full' 
  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
  # of each chunk at the specified granularity
  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
  activations_checkpoint_num_layers: null # not used with 'selective'
  # when using 'uniform' this creates groups of transformer layers to checkpoint. Usually set to 1. Increase to save more memory.
  # when using 'block' this this will checkpoint the first activations_checkpoint_num_layers per pipeline stage.

  ## Sequence Parallelism
  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
  sequence_parallel: True

  data:
    # Path to data must be specified by the user.
    # can override from the CLI: "model.data.data_prefix=[.5,/raid/data/pile/my-gpt3_00_text_document,.5,/raid/data/pile/my-gpt3_01_text_document]",
    # Or see example below: 
    # data_prefix: 
    #   - .5
    #   - /raid/data/pile/my-gpt3_00_text_document
    #   - .5
    #   - /raid/data/pile/my-gpt3_01_text_document
    data_prefix: ???
    index_mapping_dir: null # path to save index mapping .npy files, by default will save in the same location as data_prefix
    data_impl: csv_map
    splits_string: 900,50,50
    seq_length: ${model.encoder_seq_length}
    skip_warmup: True
    num_workers: 2
    dataloader_type: single # cyclic
    reset_position_ids: False # Reset position ids after end-of-document token
    reset_attention_mask: False # Reset attention mask after end-of-document token
    eod_mask_loss: False # Mask loss for the end of document tokens
    # masked_lm_prob: 0.15
    # short_seq_prob: 0.1

  # Nsys profiling options
  nsys_profile:
    enabled: False
    start_step: 10  # Global batch to start profiling
    end_step: 10 # Global batch to end profiling
    ranks: [0] # Global rank IDs to profile
    gen_shape: False # Generate model and kernel details including input shapes

  optim:
    name: fused_adam
    lr: 2e-4
    weight_decay: 0.01 
    betas: 
    - 0.9
    - 0.98
    sched:
      name: CosineAnnealing
      warmup_steps: 500
      constant_steps: 50000
      min_lr: 2e-5

Inpu train,valid, test files I am Using Input small_train, small_valid, small_test files look like

0,|<startoftext>|sentence[LABEL]:Target|<endoftext>|
1,|<startoftext>|sentence[LABEL]:Target|<endoftext>|
2,|<startoftext>|sentence[LABEL]:Target|<endoftext>|

Expected behaviour The train, valid and test files for gpt model should be processed correctly, as a similar file properly formatted for megatron-bert got correctly processed. for the megatron-bert model, the input train file looked something like this. It was a binary classification problem.

0,sentence,label1
1,sentence,label2
2,sentence,label1

Environment overview (please complete the following information)

I am using GCP machine with 4 A100 Nvidia gpus Linux nemo-training 4.19.0-22-cloud-amd64 #1 SMP Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux
Environment location: Cloud(specify cloud provider - GCP)
Method of NeMo install: [pip install or from source]. pip install nemo-toolkit["all"] and nemo-version: 1.12.0.

Environment details

If an NVIDIA docker image is used, you don't need to specify these. Otherwise, please provide:

OS version - Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux
PyTorch version - 1.13.0
PyTorch Lightning version - 1.7.6
Python version - python3.8

Additional context Please correct me if my input is formatted incorrectly for a gpt model. I could not find any suitable sample example for fine-tuning/pretraining the neo-megatron-20 billion models. Neither could I find any sample input files, or how they should be formatted to process correctly for GPT

MaximumEntropy commented 1 year ago

Can you try binarizing your data (i.e. running when formatted as a JSONL file instead?) Using this script - https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py

File format:

{'text: "Example 1"}
{'text': "Example 2"}

mayank-nference commented 1 year ago

Thanks for the reply @MaximumEntropy I will convert the data as per the format you shared. Will ping here incase if still it doesn't work.

MaximumEntropy commented 1 year ago

Also, I noticed that you are trying to use the csv_map, I don't think we support that for GPT-based models yet, please use data_impl: mmap.

mayank-nference commented 1 year ago

@MaximumEntropy I am need small help in understanding following fields present in megatron_gpt_config.yaml

model.data.data_prefix, what is correct way to pass this parameter. Currently I am passing the value as [1.0,"/home/mayanksharma/nemo_data/train_binary_files/_text_document",2,"/home/mayanksharma/nemo_data/valid_binary_files/_text_document",3, "/home/mayanksharma/nemo_data/test_binary_files/_text_document"]
model.data.splits_string, what is split_string?, currently I am using default values only 900,50,50 can I change these values?

I need to understand how to load my fine-tuned GPT 20 Billion model for inference, I want to validate the trained model on the golden dataset.

I would really appreciate if you can help me with the above queries.

mayank-nference commented 1 year ago

@MaximumEntropy I want to fine-tune the pre-trained Nemo-Megatron-GPT 20Billlion model; on prompt classification, I got following error. The scrip which was using was for pre-training. The above script worked successfully after the changes you mentioned. However, I am trying to fintune on a downstream task for prompt classification for which I am referring the following pretraining script - https://github.com/NVIDIA/NeMo/blob/v1.12.0/examples/nlp/language_modeling/tuning/megatron_gpt_adapter_tuning.py and following config file - https://github.com/NVIDIA/NeMo/blob/v1.12.0/examples/nlp/language_modeling/tuning/conf/megatron_gpt_adapter_tuning_config.yaml. I am using following model for fine-tuning - https://huggingface.co/nvidia/nemo-megatron-gpt-20B/tree/main

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/mayanksharma/nemo_gpt_fintuning.py:109 in <module>                                         │
│                                                                                                  │
│   106                                                                                            │
│   107                                                                                            │
│   108 if __name__ == '__main__':                                                                 │
│ ❱ 109 │   main()                                                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py:104 in      │
│ wrapper                                                                                          │
│                                                                                                  │
│   101 │   │   │   │                                                                              │
│   102 │   │   │   │   # no return value from run_hydra() as it may sometime actually run the t   │
│   103 │   │   │   │   # multiple times (--multirun)                                              │
│ ❱ 104 │   │   │   │   _run_hydra(                                                                │
│   105 │   │   │   │   │   args_parser=_argparse_wrapper(args),                                   │
│   106 │   │   │   │   │   task_function=task_function,                                           │
│   107 │   │   │   │   │   config_path=config_path,                                               │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:377 in _run_hydra   │
│                                                                                                  │
│   374 │   │   if num_commands == 0:                                                              │
│   375 │   │   │   args.run = True                                                                │
│   376 │   │   if args.run:                                                                       │
│ ❱ 377 │   │   │   run_and_report(                                                                │
│   378 │   │   │   │   lambda: hydra.run(                                                         │
│   379 │   │   │   │   │   config_name=config_name,                                               │
│   380 │   │   │   │   │   task_function=task_function,                                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:214 in              │
│ run_and_report                                                                                   │
│                                                                                                  │
│   211 │   │   return func()                                                                      │
│   212 │   except Exception as ex:                                                                │
│   213 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│ ❱ 214 │   │   │   raise ex                                                                       │
│   215 │   │   else:                                                                              │
│   216 │   │   │   try:                                                                           │
│   217 │   │   │   │   if isinstance(ex, CompactHydraException):                                  │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:211 in              │
│ run_and_report                                                                                   │
│                                                                                                  │
│   208                                                                                            │
│   209 def run_and_report(func: Any) -> Any:                                                      │
│   210 │   try:                                                                                   │
│ ❱ 211 │   │   return func()                                                                      │
│   212 │   except Exception as ex:                                                                │
│   213 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│   214 │   │   │   raise ex                                                                       │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/utils.py:378 in <lambda>     │
│                                                                                                  │
│   375 │   │   │   args.run = True                                                                │
│   376 │   │   if args.run:                                                                       │
│   377 │   │   │   run_and_report(                                                                │
│ ❱ 378 │   │   │   │   lambda: hydra.run(                                                         │
│   379 │   │   │   │   │   config_name=config_name,                                               │
│   380 │   │   │   │   │   task_function=task_function,                                           │
│   381 │   │   │   │   │   overrides=args.overrides,                                              │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/_internal/hydra.py:111 in run          │
│                                                                                                  │
│   108 │   │   callbacks.on_run_end(config=cfg, config_name=config_name, job_return=ret)          │
│   109 │   │                                                                                      │
│   110 │   │   # access the result to trigger an exception in case the job failed.                │
│ ❱ 111 │   │   _ = ret.return_value                                                               │
│   112 │   │                                                                                      │
│   113 │   │   return ret                                                                         │
│   114                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/core/utils.py:233 in return_value      │
│                                                                                                  │
│   230 │   │   │   sys.stderr.write(                                                              │
│   231 │   │   │   │   f"Error executing job with overrides: {self.overrides}" + os.linesep       │
│   232 │   │   │   )                                                                              │
│ ❱ 233 │   │   │   raise self._return_value                                                       │
│   234 │                                                                                          │
│   235 │   @return_value.setter                                                                   │
│   236 │   def return_value(self, value: Any) -> None:                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/hydra/core/utils.py:160 in run_job           │
│                                                                                                  │
│   157 │   │   with env_override(hydra_cfg.hydra.job.env_set):                                    │
│   158 │   │   │   callbacks.on_job_start(config=config)                                          │
│   159 │   │   │   try:                                                                           │
│ ❱ 160 │   │   │   │   ret.return_value = task_function(task_cfg)                                 │
│   161 │   │   │   │   ret.status = JobStatus.COMPLETED                                           │
│   162 │   │   │   except Exception as e:                                                         │
│   163 │   │   │   │   ret.return_value = e                                                       │
│                                                                                                  │
│ /home/mayanksharma/nemo_gpt_fintuning.py:103 in main                                             │
│                                                                                                  │
│   100 │   │   │   cfg.model.restore_path, cfg.model, trainer=trainer, save_restore_connector=N   │
│   101 │   │   )                                                                                  │
│   102 │   else:                                                                                  │
│ ❱ 103 │   │   model = MegatronGPTAdapterLearningModel(cfg.model, trainer=trainer)                │
│   104 │                                                                                          │
│   105 │   trainer.fit(model)                                                                     │
│   106                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modelin │
│ g/megatron_gpt_adapter_model.py:94 in __init__                                                   │
│                                                                                                  │
│    91 │   │   for _, module in self.frozen_model.named_modules():                                │
│    92 │   │   │   if isinstance(module, adapter_mixins.AdapterModuleMixin):                      │
│    93 │   │   │   │   for adapter_key in self.adapter_name_keys:                                 │
│ ❱  94 │   │   │   │   │   module.add_adapter(                                                    │
│    95 │   │   │   │   │   │   name=adapter_key, cfg=adapter_cfg,                                 │
│    96 │   │   │   │   │   )                                                                      │
│    97                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/nemo/core/classes/mixins/adapter_mixins.py:1 │
│ 64 in add_adapter                                                                                │
│                                                                                                  │
│   161 │   │   """                                                                                │
│   162 │   │   # Convert to DictConfig from dict or Dataclass                                     │
│   163 │   │   if is_dataclass(cfg):                                                              │
│ ❱ 164 │   │   │   cfg = OmegaConf.structured(cfg)                                                │
│   165 │   │                                                                                      │
│   166 │   │   if not isinstance(cfg, DictConfig):                                                │
│   167 │   │   │   cfg = DictConfig(cfg)                                                          │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:123 in structured     │
│                                                                                                  │
│    120 │   │   parent: Optional[BaseContainer] = None,                                           │
│    121 │   │   flags: Optional[Dict[str, bool]] = None,                                          │
│    122 │   ) -> Any:                                                                             │
│ ❱  123 │   │   return OmegaConf.create(obj, parent, flags)                                       │
│    124 │                                                                                         │
│    125 │   @staticmethod                                                                         │
│    126 │   @overload                                                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:176 in create         │
│                                                                                                  │
│    173 │   │   parent: Optional[BaseContainer] = None,                                           │
│    174 │   │   flags: Optional[Dict[str, bool]] = None,                                          │
│    175 │   ) -> Union[DictConfig, ListConfig]:                                                   │
│ ❱  176 │   │   return OmegaConf._create_impl(                                                    │
│    177 │   │   │   obj=obj,                                                                      │
│    178 │   │   │   parent=parent,                                                                │
│    179 │   │   │   flags=flags,                                                                  │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:885 in _create_impl   │
│                                                                                                  │
│    882 │   │   │   │   │   │   │   f"Object of unsupported type: '{type(obj).__name__}'"         │
│    883 │   │   │   │   │   │   )                                                                 │
│    884 │   │   except OmegaConfBaseException as e:                                               │
│ ❱  885 │   │   │   format_and_raise(node=None, key=None, value=None, msg=str(e), cause=e)        │
│    886 │   │   │   assert False                                                                  │
│    887 │                                                                                         │
│    888 │   @staticmethod                                                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:820 in format_and_raise  │
│                                                                                                  │
│    817 │   │   if type_override is not None:                                                     │
│    818 │   │   │   ex = type_override(str(cause))                                                │
│    819 │   │   │   ex.__dict__ = copy.deepcopy(cause.__dict__)                                   │
│ ❱  820 │   │   _raise(ex, cause)                                                                 │
│    821 │                                                                                         │
│    822 │   object_type: Optional[Type[Any]]                                                      │
│    823 │   object_type_str: Optional[str] = None                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:798 in _raise            │
│                                                                                                  │
│    795 │   │   ex.__cause__ = cause                                                              │
│    796 │   else:                                                                                 │
│    797 │   │   ex.__cause__ = None                                                               │
│ ❱  798 │   raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace   │
│    799                                                                                           │
│    800                                                                                           │
│    801 def format_and_raise(                                                                     │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:846 in _create_impl   │
│                                                                                                  │
│    843 │   │   │   │   │   else:                                                                 │
│    844 │   │   │   │   │   │   obj_type = OmegaConf.get_type(obj)                                │
│    845 │   │   │   │   │   │   key_type, element_type = get_dict_key_value_types(obj_type)       │
│ ❱  846 │   │   │   │   │   │   return DictConfig(                                                │
│    847 │   │   │   │   │   │   │   content=obj,                                                  │
│    848 │   │   │   │   │   │   │   parent=parent,                                                │
│    849 │   │   │   │   │   │   │   key_type=key_type,                                            │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:111 in __init__      │
│                                                                                                  │
│   108 │   │   │   │   │   self.__dict__["_metadata"] = metadata                                  │
│   109 │   │   │   │   self._set_value(content, flags=flags)                                      │
│   110 │   │   except Exception as ex:                                                            │
│ ❱ 111 │   │   │   format_and_raise(node=None, key=key, value=None, cause=ex, msg=str(ex))        │
│   112 │                                                                                          │
│   113 │   def __deepcopy__(self, memo: Dict[int, Any]) -> "DictConfig":                          │
│   114 │   │   res = DictConfig(None)                                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:820 in format_and_raise  │
│                                                                                                  │
│    817 │   │   if type_override is not None:                                                     │
│    818 │   │   │   ex = type_override(str(cause))                                                │
│    819 │   │   │   ex.__dict__ = copy.deepcopy(cause.__dict__)                                   │
│ ❱  820 │   │   _raise(ex, cause)                                                                 │
│    821 │                                                                                         │
│    822 │   object_type: Optional[Type[Any]]                                                      │
│    823 │   object_type_str: Optional[str] = None                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:798 in _raise            │
│                                                                                                  │
│    795 │   │   ex.__cause__ = cause                                                              │
│    796 │   else:                                                                                 │
│    797 │   │   ex.__cause__ = None                                                               │
│ ❱  798 │   raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace   │
│    799                                                                                           │
│    800                                                                                           │
│    801 def format_and_raise(                                                                     │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:94 in __init__       │
│                                                                                                  │
│    91 │   │   │   │   raise KeyValidationError(f"Unsupported key type {key_type}")               │
│    92 │   │   │                                                                                  │
│    93 │   │   │   if is_structured_config(content) or is_structured_config(ref_type):            │
│ ❱  94 │   │   │   │   self._set_value(content, flags=flags)                                      │
│    95 │   │   │   │   if is_structured_config_frozen(content) or is_structured_config_frozen(    │
│    96 │   │   │   │   │   ref_type                                                               │
│    97 │   │   │   │   ):                                                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:647 in _set_value    │
│                                                                                                  │
│   644 │   │   │   self._set_value_impl(value, flags)                                             │
│   645 │   │   except Exception as e:                                                             │
│   646 │   │   │   self.__dict__["_content"] = previous_content                                   │
│ ❱ 647 │   │   │   raise e                                                                        │
│   648 │                                                                                          │
│   649 │   def _set_value_impl(                                                                   │
│   650 │   │   self, value: Any, flags: Optional[Dict[str, bool]] = None                          │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:644 in _set_value    │
│                                                                                                  │
│   641 │   def _set_value(self, value: Any, flags: Optional[Dict[str, bool]] = None) -> None:     │
│   642 │   │   try:                                                                               │
│   643 │   │   │   previous_content = self.__dict__["_content"]                                   │
│ ❱ 644 │   │   │   self._set_value_impl(value, flags)                                             │
│   645 │   │   except Exception as e:                                                             │
│   646 │   │   │   self.__dict__["_content"] = previous_content                                   │
│   647 │   │   │   raise e                                                                        │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:674 in               │
│ _set_value_impl                                                                                  │
│                                                                                                  │
│   671 │   │   │   if is_structured_config(value):                                                │
│   672 │   │   │   │   self._metadata.object_type = None                                          │
│   673 │   │   │   │   ao = self._get_flag("allow_objects")                                       │
│ ❱ 674 │   │   │   │   data = get_structured_config_data(value, allow_objects=ao)                 │
│   675 │   │   │   │   with flag_override(self, ["struct", "readonly"], False):                   │
│   676 │   │   │   │   │   for k, v in data.items():                                              │
│   677 │   │   │   │   │   │   self.__setitem__(k, v)                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:494 in                   │
│ get_structured_config_data                                                                       │
│                                                                                                  │
│    491 │   obj: Any, allow_objects: Optional[bool] = None                                        │
│    492 ) -> Dict[str, Any]:                                                                      │
│    493 │   if is_dataclass(obj):                                                                 │
│ ❱  494 │   │   return get_dataclass_data(obj, allow_objects=allow_objects)                       │
│    495 │   elif is_attr_class(obj):                                                              │
│    496 │   │   return get_attr_data(obj, allow_objects=allow_objects)                            │
│    497 │   else:                                                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:422 in                   │
│ get_dataclass_data                                                                               │
│                                                                                                  │
│    419 │   │   │   │   parent=dummy_parent,                                                      │
│    420 │   │   │   )                                                                             │
│    421 │   │   except (ValidationError, GrammarParseError) as ex:                                │
│ ❱  422 │   │   │   format_and_raise(                                                             │
│    423 │   │   │   │   node=dummy_parent, key=name, value=value, cause=ex, msg=str(ex)           │
│    424 │   │   │   )                                                                             │
│    425 │   │   d[name]._set_parent(None)                                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:820 in format_and_raise  │
│                                                                                                  │
│    817 │   │   if type_override is not None:                                                     │
│    818 │   │   │   ex = type_override(str(cause))                                                │
│    819 │   │   │   ex.__dict__ = copy.deepcopy(cause.__dict__)                                   │
│ ❱  820 │   │   _raise(ex, cause)                                                                 │
│    821 │                                                                                         │
│    822 │   object_type: Optional[Type[Any]]                                                      │
│    823 │   object_type_str: Optional[str] = None                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:798 in _raise            │
│                                                                                                  │
│    795 │   │   ex.__cause__ = cause                                                              │
│    796 │   else:                                                                                 │
│    797 │   │   ex.__cause__ = None                                                               │
│ ❱  798 │   raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace   │
│    799                                                                                           │
│    800                                                                                           │
│    801 def format_and_raise(                                                                     │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:414 in                   │
│ get_dataclass_data                                                                               │
│                                                                                                  │
│    411 │   │   │   )                                                                             │
│    412 │   │   │   format_and_raise(node=None, key=None, value=value, cause=e, msg=str(e))       │
│    413 │   │   try:                                                                              │
│ ❱  414 │   │   │   d[name] = _maybe_wrap(                                                        │
│    415 │   │   │   │   ref_type=type_,                                                           │
│    416 │   │   │   │   is_optional=is_optional,                                                  │
│    417 │   │   │   │   key=name,                                                                 │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:1090 in _maybe_wrap   │
│                                                                                                  │
│   1087 │   │   value._set_parent(parent)                                                         │
│   1088 │   │   return value                                                                      │
│   1089 │   else:                                                                                 │
│ ❱ 1090 │   │   return _node_wrap(                                                                │
│   1091 │   │   │   ref_type=ref_type,                                                            │
│   1092 │   │   │   parent=parent,                                                                │
│   1093 │   │   │   is_optional=is_optional,                                                      │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/omegaconf.py:989 in _node_wrap     │
│                                                                                                  │
│    986 │   node: Node                                                                            │
│    987 │   if is_dict_annotation(ref_type) or (is_primitive_dict(value) and ref_type is Any):    │
│    988 │   │   key_type, element_type = get_dict_key_value_types(ref_type)                       │
│ ❱  989 │   │   node = DictConfig(                                                                │
│    990 │   │   │   content=value,                                                                │
│    991 │   │   │   key=key,                                                                      │
│    992 │   │   │   parent=parent,                                                                │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:111 in __init__      │
│                                                                                                  │
│   108 │   │   │   │   │   self.__dict__["_metadata"] = metadata                                  │
│   109 │   │   │   │   self._set_value(content, flags=flags)                                      │
│   110 │   │   except Exception as ex:                                                            │
│ ❱ 111 │   │   │   format_and_raise(node=None, key=key, value=None, cause=ex, msg=str(ex))        │
│   112 │                                                                                          │
│   113 │   def __deepcopy__(self, memo: Dict[int, Any]) -> "DictConfig":                          │
│   114 │   │   res = DictConfig(None)                                                             │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:900 in format_and_raise  │
│                                                                                                  │
│    897 │   │   ex.ref_type = ref_type                                                            │
│    898 │   │   ex.ref_type_str = ref_type_str                                                    │
│    899 │                                                                                         │
│ ❱  900 │   _raise(ex, cause)                                                                     │
│    901                                                                                           │
│    902                                                                                           │
│    903 def type_str(t: Any, include_module_name: bool = False) -> str:                           │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/_utils.py:798 in _raise            │
│                                                                                                  │
│    795 │   │   ex.__cause__ = cause                                                              │
│    796 │   else:                                                                                 │
│    797 │   │   ex.__cause__ = None                                                               │
│ ❱  798 │   raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace   │
│    799                                                                                           │
│    800                                                                                           │
│    801 def format_and_raise(                                                                     │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:94 in __init__       │
│                                                                                                  │
│    91 │   │   │   │   raise KeyValidationError(f"Unsupported key type {key_type}")               │
│    92 │   │   │                                                                                  │
│    93 │   │   │   if is_structured_config(content) or is_structured_config(ref_type):            │
│ ❱  94 │   │   │   │   self._set_value(content, flags=flags)                                      │
│    95 │   │   │   │   if is_structured_config_frozen(content) or is_structured_config_frozen(    │
│    96 │   │   │   │   │   ref_type                                                               │
│    97 │   │   │   │   ):                                                                         │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:647 in _set_value    │
│                                                                                                  │
│   644 │   │   │   self._set_value_impl(value, flags)                                             │
│   645 │   │   except Exception as e:                                                             │
│   646 │   │   │   self.__dict__["_content"] = previous_content                                   │
│ ❱ 647 │   │   │   raise e                                                                        │
│   648 │                                                                                          │
│   649 │   def _set_value_impl(                                                                   │
│   650 │   │   self, value: Any, flags: Optional[Dict[str, bool]] = None                          │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:644 in _set_value    │
│                                                                                                  │
│   641 │   def _set_value(self, value: Any, flags: Optional[Dict[str, bool]] = None) -> None:     │
│   642 │   │   try:                                                                               │
│   643 │   │   │   previous_content = self.__dict__["_content"]                                   │
│ ❱ 644 │   │   │   self._set_value_impl(value, flags)                                             │
│   645 │   │   except Exception as e:                                                             │
│   646 │   │   │   self.__dict__["_content"] = previous_content                                   │
│   647 │   │   │   raise e                                                                        │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:658 in               │
│ _set_value_impl                                                                                  │
│                                                                                                  │
│   655 │   │   │   flags = {}                                                                     │
│   656 │   │                                                                                      │
│   657 │   │   assert not isinstance(value, ValueNode)                                            │
│ ❱ 658 │   │   self._validate_set(key=None, value=value)                                          │
│   659 │   │                                                                                      │
│   660 │   │   if _is_none(value, resolve=True):                                                  │
│   661 │   │   │   self.__dict__["_content"] = None                                               │
│                                                                                                  │
│ /opt/conda/envs/llm_env/lib/python3.8/site-packages/omegaconf/dictconfig.py:200 in _validate_set │
│                                                                                                  │
│   197 │   │   if is_container_annotation(target_type) and not is_container_annotation(           │
│   198 │   │   │   value_type                                                                     │
│   199 │   │   ):                                                                                 │
│ ❱ 200 │   │   │   raise ValidationError(                                                         │
│   201 │   │   │   │   f"Cannot assign {type_str(value_type)} to {type_str(target_type)}"         │
│   202 │   │   │   )                                                                              │
│   203                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationError: Cannot assign ResidualAddAdapterStrategyConfig to Dict[Any, Any]
    full_key: adapter_strategy
    object_type=None

I am using following fine-tuning config

name: adapter_tuning_${model.new_tasks[0]}_max_epochs${trainer.max_epochs}_adapter_dim${model.adapter_tuning.adapter_dim}

trainer:
  devices: 4
  accelerator: gpu
  num_nodes: 1
  precision: 16
  logger: False # logger provided by exp_manager
  enable_checkpointing: False
  replace_sampler_ddp: False
  max_epochs: 10
  max_steps: -1 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
  log_every_n_steps: 10
  val_check_interval: 0.2
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
  benchmark: False

exp_manager:
  explicit_log_dir: null
  exp_dir: "nemo_experiment"
  name: ${name}
  create_wandb_logger: null
  wandb_logger_kwargs:
    project: null
    name: null
  resume_if_exists: True
  resume_ignore_no_checkpoint: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 1
    mode: min
    save_nemo_on_train_end: True # Should be false, correct prompt learning model file is saved at model.nemo_path set below, 
    filename: 'megatron_gpt_adapter_tuning--{val_loss:.3f}-{step}'
    model_parallel_size: ${model.tensor_model_parallel_size}
    save_best_model: True

model:
  seed: 1234
  nemo_path: ${exp_manager.exp_dir}/${name}.nemo # .nemo filename/absolute path to where the virtual prompt model parameters will be saved
  virtual_prompt_style: 'no-prompts' # adapter tuning requires no virtual prompts
  encoder_seq_length: 2048 
  gradient_as_bucket_view: false
  tensor_model_parallel_size: 1 # intra-layer model parallelism
  pipeline_model_parallel_size: 4 # inter-layer model parallelism
  global_batch_size: 4
  micro_batch_size: 1

  restore_path: null # Path to an existing adapter .nemo model you wish to add new tasks to or run inference with
  language_model_path: "/home/mayanksharma/nemo_model/nemo-megatron-gpt-20B/nemo_gpt20B_bf16_tp4.nemo" # Path to the GPT language model .nemo file, always required
  existing_tasks: [] # List of tasks the model has already been p-tuned/prompt-tuned for, needed when a restore path is given
  new_tasks: ["drug_to_target_cls"] # List of new tasknames to be prompt-tuned

  task_templates: # Add more/replace tasks as needed, these are just examples

  - taskname: "drug_to_target_cls" # Extreme Summarization
    prompt_template: "<|VIRTUAL_PROMPT_0|> Sentence: {sentence} \nLabel: {label}" 
    total_virtual_tokens: 2048
    virtual_token_splits: []
    truncate_field: null
    answer_only_loss: True
    answer_field: "label"

#   - taskname: "boolq" # The task name
#     prompt_template: "Passage: {passage} \nQuestion: {question} \nAnswer: {answer}" # Prompt template for task, specify virtual prompt positions with <|VIRTUAL_PROMPT_#|>
#     total_virtual_tokens: 0 # Sum of tokens in virtual_token_splits must add to this number. Can differ between new and existing tasks, but must match across all new tasks being tuned at the same time.
#     virtual_token_splits: [] # number of virtual tokens to be inserted at each VIRTUAL PROMPT location, must add to total_virtual_tokens
#     truncate_field: "passage" # The {field} in the prompt template whose text will be truncated if the input is too long, if null, inputs that are too long will just be skipped.
#     answer_only_loss: True 
#     answer_field: "answer"

#   - taskname: "intent_and_slot" # Intent Detection and Slot Filling
#     prompt_template: "intent options: {intent_options} slot options: {slot_options} {utterance} \nintent: {intent} \nslot: {slot}"
#     total_virtual_tokens: 0 
#     answer_only_loss: False 
#     virtual_token_splits: []
#     truncate_field: null

#   - taskname: "rte" # Recognizing Textual Entailment
#     prompt_template: "sentence1: {premise} sentence2: {hypothesis} Answer: {answer}" 
#     total_virtual_tokens: 0
#     virtual_token_splits: []
#     truncate_field: null
#     answer_only_loss: True
#     answer_field: "answer"

#   - taskname: "squad" # Standford Question-Answering
#     prompt_template: "context: {context} question: {question} answer: {answer}" 
#     total_virtual_tokens: 0
#     virtual_token_splits: []
#     truncate_field: null
#     answer_only_loss: True
#     answer_field: "answer"

#   - taskname: "arc-challenge" # Abstraction and Reasoning Challenge
#     prompt_template: "question: {question} choices: {choices} answer: {answer}" 
#     total_virtual_tokens: 0
#     virtual_token_splits: []
#     truncate_field: null
#     answer_only_loss: True
#     answer_field: "answer"

#   - taskname: "xsum" # Extreme Summarization
#     prompt_template: "{source} Summary: {target}" 
#     total_virtual_tokens: 0
#     virtual_token_splits: []
#     truncate_field: null
#     answer_only_loss: True
#     answer_field: "target"

  adapter_tuning:
    type: 'parallel_adapter' # this should be either 'parallel_adapter' or 'linear_adapter'
    adapter_dim: 50
    adapter_dropout: 0.1
    norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is normally what is used.
    column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal
    row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal
    norm_type: 'mixedfusedlayernorm' # IGNORED if layer_adapter is used,  options are ['layernorm', 'mixedfusedlayernorm']

  data:
    train_ds: "nemo_data/json_files/train.jsonl" # expects a list of paths to training data files
    validation_ds: "nemo_data/json_files/valid.jsonl"  # expects a paths to validation data files
    add_eos: True
    shuffle: True
    num_workers: 24
    pin_memory: True

  optim:
    name: fused_adam
    lr: 1e-4
    weight_decay: 0.01 
    betas: 
    - 0.9
    - 0.98
    sched:
      name: CosineAnnealing
      warmup_steps: 50
      constant_steps: 0 # Constant steps should also be 0 when min_lr=0
      min_lr: 0.0 # min_lr must be 0.0 for prompt learning
      monitor: val_loss
      reduce_on_plateau: false

Please let me know in case I am using the wrong config or pertaining script.

MaximumEntropy commented 1 year ago

Regarding your first two questions

Regarding how to provide data_prefix - That is the correct format to provide the data prefix however it is not meant to take train/val/test separately. It is meant to take a dataset and internally create train/val/test splits from it. So if you specify something like 0.3,"/path/to/data_1",0.4,"/path/to/data_2",0.3, "/path/to/data_3"] it will sample from these three different datasets with ratio 0.3,0.4,0.3 respectively.
What is split_string?, currently I am using default values only 900,50,50 can I change these values?" Split string is used to specify the ratio of train/val/test data. So you can do something like 80,10,10. for 80% train and 10% for valid and test.

@arendu do you have any idea about the adapter error?

mayank-nference commented 1 year ago

Thanks for the clear explanation, @MaximumEntropy.

mayank-nference commented 1 year ago

@MaximumEntropy @arendu, it's been 5 days can someone please help me with the adapter issue?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

NVIDIA / NeMo

Unable to process dataset for Neo-megatron-GPT-20 Billion model #5542