axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.87k stars 866 forks source link

Preprocess with debug flag fails. #1544

Closed amitagh closed 6 months ago

amitagh commented 6 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Preprocess with debug flag should work. python -m axolotl.cli.preprocess /content/test_axolotl.yaml --debug

Current behaviour

Gives error. Have json file with each example in the json file is with {"text": }. I am doing Pretraining with Lora for Non-Eng lang.

[2024-04-19 09:05:02,918] [DEBUG] [axolotl.log:61] [PID:2346] [RANK:0] max_input_len: 600 Dropping Long Sequences (num_proc=2): 100% 17/17 [00:00<00:00, 99.19 examples/s] Add position_id column (Sample Packing) (num_proc=2): 100% 17/17 [00:00<00:00, 70.88 examples/s] [2024-04-19 09:05:03,502] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:2346] [RANK:0] Saving merged prepared dataset to disk... /content/d538aae6e42c7df428d20d3ff2685ad0 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 70, in fire.Fire(do_cli) File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 60, in do_cli load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args) File "/content/src/axolotl/src/axolotl/cli/init.py", line 397, in load_datasets train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset( File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset train_dataset, eval_dataset, prompters = load_prepare_datasets( File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets dataset, prompters = load_tokenized_prepared_datasets( File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 424, in load_tokenized_prepared_datasets dataset.save_to_disk(prepared_ds_path) File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 1515, in save_todisk fs, = url_to_fs(dataset_path, (storage_options or {})) File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 363, in url_to_fs chain = _un_chain(url, kwargs) File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 316, in _un_chain if "::" in path TypeError: argument of type 'PosixPath' is not iterable

Steps to reproduce

Use json with with each example in the json file is with {"text": }.

Preprocess with debug flag. python -m axolotl.cli.preprocess /content/test_axolotl.yaml --debug But i get the error.

Config yaml

base_model: google/gemma-7b
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: ./test_txt_data.json
    type: completion
    field: text
dataset_prepared_path: data/last_run_prepared
dataset_processes: 16
val_set_size: 0
output_dir: ./lora-out

adapter: lora
lora_model_dir:

gpu_memory_limit: 76

sequence_len: 1100
sample_packing: true
pad_to_sequence_len: true

lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_modules: 
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head
lora_target_linear: true
lora_fan_in_fan_out:

save_safetensors: True

gradient_accumulation_steps: 2
micro_batch_size: 10
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002
warmup_steps: 20
save_steps: 5000

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 500
xformers_attention:
flash_attention: True

evals_per_epoch: 1
eval_table_size:
eval_max_new_tokens: 128
eval_sample_packing: False
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

Latest

Acknowledgements

Napuh commented 6 months ago

+1

Napuh commented 6 months ago

Update:

downgrading datasets to 2.15.0 seems to work for me.

jorge-tromero commented 6 months ago

+1

FrankRuis commented 6 months ago

See my change in #1548, you can wrap prepared_ds_path in dataset.save_to_disk(prepared_ds_path) with str() in the src/axolotl/utils/data/sft.py file, then you don't need to downgrade any packages.

monk1337 commented 6 months ago

2.15.0

It worked for me as well!

qiyuangong commented 6 months ago

Update:

downgrading datasets to 2.15.0 seems to work for me.

Work for me. :)