Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
6.95k stars 733 forks source link

Pretraining example from readme fails in Colab #1402

Open AndisDraguns opened 3 weeks ago

AndisDraguns commented 3 weeks ago

Running the pretraining example from GitHub fails when run in Google Colab.

!pip install 'litgpt[all]'

!mkdir -p custom_texts
!curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
!curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
!litgpt download \
  --repo_id EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
!litgpt pretrain \
  --model_name pythia-160m \
  --tokenizer_dir checkpoints/EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model

# 3) Chat with the model
!litgpt chat \
  --checkpoint_dir out/custom-model/final

@awaelchli

rasbt commented 3 weeks ago

Hi there, I was just running this code on a A10G in a Lightning Studio, and it worked fine. Do you have the error message you got or more explanation on how or why it failed? One hypothesis is that the memory in Colab may not be sufficient depending on the GPU. However, based on my test, it should only require 8.62 GB on a GPU that supports bfloat16.

awaelchli commented 3 weeks ago

@rasbt we created the issue together at iclr. I will look into it. This is in colab, not studios.

awaelchli commented 3 weeks ago

There is a cache path resolution issue in LitData, it needs to be fixed there. Thanks for the repro. https://github.com/Lightning-AI/litdata/issues/126