Lightning-Universe / lightning-transformers

Flexible components pairing 🤗 Transformers with :zap: Pytorch Lightning
https://lightning-transformers.readthedocs.io
Apache License 2.0
610 stars 77 forks source link

training language model on custom data - Missing key block_size #159

Closed enpassanty closed 3 years ago

enpassanty commented 3 years ago

trying to train a mlm on custom data. the sequences in the csv are long - when training on huggingface run_mlm.py, I truncate at 512 tokens. How do I access max_length arg? why am I hitting a block_size key error? is this required for custom data?


! python train.py \
  task=nlp/language_modeling \
  dataset.cfg.train_file="/content/gdrive/MyDrive/nlp-chart/train charts.csv" \
  dataset.cfg.validation_file="/content/gdrive/MyDrive/nlp-chart/test charts.csv" \
  backbone.pretrained_model_name_or_path=roberta-base \
  training.batch_size=8

traceback:

`2021-04-24 13:40:50.319301: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
dataset:
  _target_: lightning_transformers.task.nlp.language_modeling.LanguageModelingDataModule
  cfg:
    batch_size: ${training.batch_size}
    num_workers: ${training.num_workers}
    dataset_name: null
    dataset_config_name: null
    train_file: /content/gdrive/MyDrive/nlp-chart/train charts.csv
    validation_file: /content/gdrive/MyDrive/nlp-chart/test charts.csv
    test_file: null
    train_val_split: null
    max_samples: null
    cache_dir: null
    padding: max_length
    truncation: only_first
    preprocessing_num_workers: 1
    load_from_cache_file: true
    max_length: 128
    limit_train_samples: null
    limit_val_samples: null
    limit_test_samples: null
task:
  _recursive_: false
  _target_: lightning_transformers.task.nlp.language_modeling.LanguageModelingTransformer
  optimizer: ${optimizer}
  scheduler: ${scheduler}
  backbone: ${backbone}
  downstream_model_type: transformers.AutoModelForCausalLM
tokenizer:
  _target_: transformers.AutoTokenizer.from_pretrained
  pretrained_model_name_or_path: ${backbone.pretrained_model_name_or_path}
  use_fast: true
backbone:
  pretrained_model_name_or_path: roberta-base
optimizer:
  _target_: torch.optim.AdamW
  lr: ${training.lr}
  weight_decay: 0.001
scheduler:
  _target_: transformers.get_linear_schedule_with_warmup
  num_training_steps: -1
  num_warmup_steps: 0.1
training:
  run_test_after_fit: true
  lr: 5.0e-05
  output_dir: .
  batch_size: 8
  num_workers: 16
trainer:
  _target_: pytorch_lightning.Trainer
  logger: true
  checkpoint_callback: true
  callbacks: null
  default_root_dir: null
  gradient_clip_val: 0.0
  process_position: 0
  num_nodes: 1
  num_processes: 1
  gpus: null
  auto_select_gpus: false
  tpu_cores: null
  log_gpu_memory: null
  progress_bar_refresh_rate: 1
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 1
  min_epochs: 1
  max_steps: null
  min_steps: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  val_check_interval: 1.0
  flush_logs_every_n_steps: 100
  log_every_n_steps: 50
  accelerator: null
  sync_batchnorm: false
  precision: 32
  weights_summary: top
  weights_save_path: null
  num_sanity_val_steps: 2
  truncated_bptt_steps: null
  resume_from_checkpoint: null
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_epoch: false
  auto_lr_find: false
  replace_sampler_ddp: true
  terminate_on_nan: false
  auto_scale_batch_size: false
  prepare_data_per_node: true
  plugins: null
  amp_backend: native
  amp_level: O2
  move_metrics_to_cpu: false
experiment_name: ${now:%Y-%m-%d}_${now:%H-%M-%S}
log: false
ignore_warnings: true

[2021-04-24 13:40:53,946][datasets.builder][WARNING] - Using custom data configuration default-a4347468916cb6de
[2021-04-24 13:40:53,948][datasets.builder][WARNING] - Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-a4347468916cb6de/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
  0% 0/10 [00:00<?, ?ba/s]Token indices sequence length is longer than the specified maximum sequence length for this model (2655 > 512). Running this sequence through the model will result in indexing errors
100% 10/10 [00:33<00:00,  3.34s/ba]
100% 1/1 [00:01<00:00,  1.78s/ba]
Traceback (most recent call last):
  File "train.py", line 88, in <module>
    hydra_entry()
  File "/usr/local/lib/python3.7/dist-packages/hydra/main.py", line 33, in decorated_main
    config_name=config_name,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 370, in _run_hydra
    lambda: hydra.run(
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 373, in <lambda>
    overrides=args.overrides,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/hydra.py", line 98, in run
    configure_logging=with_log_configuration,
  File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 129, in run_job
    ret.return_value = task_function(task_cfg)
  File "train.py", line 84, in hydra_entry
    main(cfg)
  File "train.py", line 78, in main
    logger=logger,
  File "train.py", line 53, in run
    data_module.setup("fit")
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/core/nlp/data.py", line 33, in setup
    dataset = self.process_data(dataset, stage=stage)
  File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/task/nlp/language_modeling/data.py", line 54, in process_data
    convert_to_features = partial(self.convert_to_features, block_size=self.effective_block_size)
  File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/task/nlp/language_modeling/data.py", line 67, in effective_block_size
    if self.cfg.block_size is None:
  File "/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py", line 352, in __getattr__
    key=key, value=None, cause=e, type_override=ConfigAttributeError
  File "/usr/local/lib/python3.7/dist-packages/omegaconf/base.py", line 195, in _format_and_raise
    type_override=type_override,
  File "/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py", line 701, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py", line 599, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py", line 349, in __getattr__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py", line 416, in _get_impl
    node = self._get_node(key=key, throw_on_missing_key=True)
  File "/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py", line 448, in _get_node
    raise ConfigKeyError(f"Missing key {key}")
omegaconf.errors.ConfigAttributeError: Missing key block_size
    full_key: block_size
    object_type=dict
SeanNaren commented 3 years ago

Could you try the branch here? https://github.com/PyTorchLightning/lightning-transformers/pull/160

You can set the block size using this branch like: dataset.cfg.block_size=512 from the command line!

enpassanty commented 3 years ago

this solved the problem for me. thanks!