Closed karkeranikitha closed 3 weeks ago
Hi there,
the only issue I am seeing in your code is that you are missing the --tokenizer_dir
. You can set it to
--tokenizer_dir checkpoints/meta-llama/Llama-2-7b-hf
assuming that your checkpoints/meta-llama/Llama-2-7b-hf
folder has a tokenizer (it should be downloaded by default).
However, if you are missing the tokenizer_dir
there is also a separate error:
Otherwise, the code you have there looks fine. I just tested it and it works without a problem:
My best explanation is that you perhaps have an older version of LitGPT installed that doesn't support TextFiles
yet. I recommend installing litgpt from GitHub directly:
pip install -U git+https://github.com/Lightning-AI/litgpt.git
@rasbt Thanks alot! It's working now. Installing with github directly resolved the issue
Awesome, that's great to hear!
Hi
When I am trying to run litgpt pretrain command for continual finetuning purpose, I am getting below error. For custom data training, data parameter should be TextFiles and data.train_data_path should be folder with all text files as mentioned in readme file.
command:
litgpt pretrain --model_name Llama-2-7b-hf --initial_checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --data TextFiles --data.train_data_path custom_texts --out_dir out/custom-model
Error: usage: litgpt [options] pretrain [-h] [-c CONFIG] [--print_config[=flags]] [--model_name MODEL_NAME] [--model_config MODEL_CONFIG] [--out_dir OUT_DIR] [--initial_checkpoint_dir INITIAL_CHECKPOINT_DIR] [--resume RESUME] [--data.help CLASS_PATH_OR_NAME] [--data DATA] [--train CONFIG] [--train.save_interval SAVE_INTERVAL] [--train.log_interval LOG_INTERVAL] [--train.global_batch_size GLOBAL_BATCH_SIZE] [--train.micro_batch_size MICRO_BATCH_SIZE] [--train.lr_warmup_steps LR_WARMUP_STEPS] [--train.epochs EPOCHS] [--train.max_tokens MAX_TOKENS] [--train.max_steps MAX_STEPS] [--train.max_seq_length MAX_SEQ_LENGTH] [--train.tie_embeddings {true,false,null}] [--train.learning_rate LEARNING_RATE] [--train.weight_decay WEIGHT_DECAY] [--train.beta1 BETA1] [--train.beta2 BETA2] [--train.max_norm MAX_NORM] [--train.min_lr MIN_LR] [--eval CONFIG] [--eval.interval INTERVAL] [--eval.max_new_tokens MAX_NEW_TOKENS] [--eval.max_iters MAX_ITERS] [--devices DEVICES] [--tokenizer_dir TOKENIZER_DIR] [--logger_name {wandb,tensorboard,csv}] [--seed SEED] error: Parser key "data": Does not validate against any of the Union subtypes Subtypes: (<class 'litgpt.data.base.DataModule'>, <class 'NoneType'>) Errors:
Reference: https://github.com/Lightning-AI/litgpt?tab=readme-ov-file#continue-pretraining-an-llm https://lightning.ai/lightning-ai/studios/litgpt-continue-pretraining?tab=files&layout=column&path=cloudspaces%2F01hvpn545vfd8615mxjf3zsbgh&y=4&x=0
Can someone please help with the issue
Thanks in advance