Debug switch doesn't work with pretrained datasets

its5Q commented 6 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I expect to be able to debug pretraining dataset preprocessing.

Current behaviour

Axolotl tries to use .select() on an IterableDataset object

[2024-04-16 14:43:19,625] [INFO] [axolotl.scripts.load_datasets:402] [PID:5393] [RANK:0] check_dataset_labels...
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/its5q/axolotl/src/axolotl/cli/preprocess.py", line 70, in <module>
    fire.Fire(do_cli)
  File "/home/its5q/gpt/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/its5q/gpt/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/its5q/gpt/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/its5q/axolotl/src/axolotl/cli/preprocess.py", line 60, in do_cli
    load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
  File "/home/its5q/axolotl/src/axolotl/cli/__init__.py", line 404, in load_datasets
    train_dataset.select(
AttributeError: 'IterableDataset' object has no attribute 'select'

Steps to reproduce

Run python -m axolotl.cli.preprocess --debug on any config that has a pretraining_dataset

Config yaml

base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

max_steps: 200
pretraining_dataset:
  path: c4
  name: en
  type: pretrain
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./model-out

sequence_len: 2048
sample_packing: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch:
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main/132eb74

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 commented 6 months ago

Does this mean, if you don't use --debug, it would not error?

Also, I believe the use of pertaining_dataset was to perform streaming (which is the opposite of trying to preprocess in advance).

its5Q commented 6 months ago

Does this mean, if you don't use --debug, it would not error?

Also, I believe the use of pertaining_dataset was to perform streaming (which is the opposite of trying to preprocess in advance).

Yeah, it wouldn't, at least in that specific function, because the debugging code tries to select samples from an iterable dataset, which is unsupported. I couldn't personally get it to work either way. Also I stumbled across some other training issues I couldn't resolve due to lack of or outdated documentation/examples, so I switched to using HF Trainer instead.

axolotl-ai-cloud / axolotl