Continual pretraining for custom data is not working. Not recognizing TextFiles as a data attribute.

karkeranikitha commented 3 weeks ago

Hi

When I am trying to run litgpt pretrain command for continual finetuning purpose, I am getting below error. For custom data training, data parameter should be TextFiles and data.train_data_path should be folder with all text files as mentioned in readme file.

command: litgpt pretrain --model_name Llama-2-7b-hf --initial_checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --data TextFiles --data.train_data_path custom_texts --out_dir out/custom-model

Error: usage: litgpt [options] pretrain [-h] [-c CONFIG] [--print_config[=flags]] [--model_name MODEL_NAME] [--model_config MODEL_CONFIG] [--out_dir OUT_DIR] [--initial_checkpoint_dir INITIAL_CHECKPOINT_DIR] [--resume RESUME] [--data.help CLASS_PATH_OR_NAME] [--data DATA] [--train CONFIG] [--train.save_interval SAVE_INTERVAL] [--train.log_interval LOG_INTERVAL] [--train.global_batch_size GLOBAL_BATCH_SIZE] [--train.micro_batch_size MICRO_BATCH_SIZE] [--train.lr_warmup_steps LR_WARMUP_STEPS] [--train.epochs EPOCHS] [--train.max_tokens MAX_TOKENS] [--train.max_steps MAX_STEPS] [--train.max_seq_length MAX_SEQ_LENGTH] [--train.tie_embeddings {true,false,null}] [--train.learning_rate LEARNING_RATE] [--train.weight_decay WEIGHT_DECAY] [--train.beta1 BETA1] [--train.beta2 BETA2] [--train.max_norm MAX_NORM] [--train.min_lr MIN_LR] [--eval CONFIG] [--eval.interval INTERVAL] [--eval.max_new_tokens MAX_NEW_TOKENS] [--eval.max_iters MAX_ITERS] [--devices DEVICES] [--tokenizer_dir TOKENIZER_DIR] [--logger_name {wandb,tensorboard,csv}] [--seed SEED] error: Parser key "data": Does not validate against any of the Union subtypes Subtypes: (<class 'litgpt.data.base.DataModule'>, <class 'NoneType'>) Errors:

Expected a dot import path string: TextFiles
Expected a <class 'NoneType'> Given value type: <class 'str'> Given value: TextFiles

Reference: https://github.com/Lightning-AI/litgpt?tab=readme-ov-file#continue-pretraining-an-llm https://lightning.ai/lightning-ai/studios/litgpt-continue-pretraining?tab=files&layout=column&path=cloudspaces%2F01hvpn545vfd8615mxjf3zsbgh&y=4&x=0

Can someone please help with the issue

Thanks in advance

rasbt commented 3 weeks ago

Hi there,

the only issue I am seeing in your code is that you are missing the --tokenizer_dir. You can set it to

--tokenizer_dir checkpoints/meta-llama/Llama-2-7b-hf

assuming that your checkpoints/meta-llama/Llama-2-7b-hf folder has a tokenizer (it should be downloaded by default).

However, if you are missing the tokenizer_dir there is also a separate error:

Otherwise, the code you have there looks fine. I just tested it and it works without a problem:

My best explanation is that you perhaps have an older version of LitGPT installed that doesn't support TextFiles yet. I recommend installing litgpt from GitHub directly:

pip install -U git+https://github.com/Lightning-AI/litgpt.git

karkeranikitha commented 3 weeks ago

@rasbt Thanks alot! It's working now. Installing with github directly resolved the issue

rasbt commented 3 weeks ago

Awesome, that's great to hear!

Lightning-AI / litgpt

Continual pretraining for custom data is not working. Not recognizing TextFiles as a data attribute. #1339