Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
321 stars 38 forks source link

RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin'] #367

Open rasbt opened 1 week ago

rasbt commented 1 week ago

🐛 Bug

When using LitData on non-Studio machines, I am getting a RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin'] error.

To Reproduce

This error occurs when running the following example from the LitGPT README:

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model

# 3) Test the model
litgpt chat out/custom-model/final

I made a simpler example to reproduce the issue with a standalone code snippet.

1) Download sample data

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

2) Run the following code

import glob
import random
from pathlib import Path
from litdata import optimize

def tokenize(filename: str):
    with open(filename, "r", encoding="utf-8") as file:
        text = file.read()
    text = text.strip().split(" ")
    word_to_int = {word: random.randint(1, 1000) for word in set(text)}
    tokenized = [word_to_int[word] for word in text]

    yield tokenized

train_files = sorted(glob.glob(str(Path("custom_texts") / "*.txt")))

if __name__ == "__main__":
    optimize(
        fn=tokenize,
        inputs=train_files,
        output_dir="temp",
        num_workers=1,
        chunk_bytes="50MB",
    )

This results in the following on a Studio machine

Setting multiprocessing start_method to fork. Tip: Libraries relying on lock can hang with `fork`. To use `spawn` in notebooks, move your code to files and import it within the notebook.
Storing the files under /teamspace/studios/this_studio/temp
Setup started with fast_dev_run=False.
Worker 0 gets 1.2 MB (2 files)
Setup finished in 0.002 seconds. Found 2 items to process.
Starting 1 workers with 2 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
Rank 0 inferred the following `['int', 'int', 'int',...
int', 'int', 'int', 'int', 'int', 'int', 'int', 'int']` data format.
Worker 0 is terminating.
Worker 0 is done.
Workers are finished.
Finished data processing!

However, on a non-Studio machine, I am getting:

...
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.47it/s]
Workers are finished.
Traceback (most recent call last):
  File "/home/sebastian/miniforge3/envs/litgp2/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/home/sebastian/test-litgpt/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 154, in setup
    main(
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 214, in main
    train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 409, in get_dataloaders
    data.prepare_data()
  File "/home/sebastian/test-litgpt/litgpt/litgpt/data/text_files.py", line 72, in prepare_data
    optimize(
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/functions.py", line 375, in optimize
    data_processor.run(
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 1016, in run
    result = data_recipe._done(len(user_items), self.delete_cached_files, self.output_dir)
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 736, in _done
    raise RuntimeError(f"All the chunks should have been deleted. Found {chunks}")
RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin']

Environment

github-actions[bot] commented 1 week ago

Hi! thanks for your contribution!, great first issue!

bhimrazy commented 1 week ago

Thank you, @rasbt, for bringing up this issue.

Interestingly, I was able to reproduce a similar problem as reported by ByteBrigand (in GitHub Codespace), but couldn't reproduce the issue with deleted chunks. I'll test on other devices to see if I can replicate it.

image

rasbt commented 1 week ago

Thanks for sharing. May I ask which version you are using? Is this with the latest main branch or the latest stable release?

bhimrazy commented 1 week ago

Sure @rasbt , I tested with both the latest main branch and v0.2.26, leading to the same error.

rasbt commented 1 week ago

Thanks, and that's so weird.

deependujha commented 1 week ago

@rasbt consider moving optimize function in the main block.

if __name__ == "__main__":
  optimize(
      fn=tokenize,
      inputs=train_files,
      output_dir="temp",
      num_workers=1,
      chunk_bytes="50MB",
  )

And, for me, it worked perfectly fine.

But, we are aware of the issue:

RuntimeError: All the chunks should have been deleted. Found ['chunk-x-x.bin'] error

AFAIK, this is a non-deterministic bug. It has occurred in multiple CI tests. Even re-running the code will do the trick, but we will permanently fix this weird bug very soon.

Probably, there is some issue in the way the uploader queue passes chunk files to the remover queue.

And, if you're on a Mac, consider upgrading litdata version to the current latest.

rasbt commented 6 days ago

Thanks for the response @deependujha . After updating the code as you suggested, the example now works on my MacBook. On the Linux machine, even on the latest LitData version (I even tried to use the latest version from the main branch), I am still having the same bug.

AFAIK, this is a non-deterministic bug. It has occurred in multiple CI tests. Even re-running the code will do the trick, but we will permanently fix this weird bug very soon.

Actually, on that Linux machine, it happens every single time. So weird.

srikhetramohanty commented 1 day ago

Hi, any resolution to this on linux systems?

rasbt commented 1 day ago

Unfortunately, no. It seems that the issue still persists on some Linux machines. Maybe the best solution for now is to use an older litdata version (assuming this error doesn't exist in older versions) on those machines.

deependujha commented 1 day ago

Unfortunately, no. It seems that the issue still persists on some Linux machines. Maybe the best solution for now is to use an older litdata version (assuming this error doesn't exist in older versions) on those machines.

sorry for the delay. I am involved in other stuffs. I'll start working on this immediately.

rasbt commented 1 day ago

No worries @deependujha , I can totally understand that there are other priorities and commitments at the moment. So please don't get yourself in trouble working on it. But in case you have some time and are able to, that'd be super appreciated. Also let us know if you have any ideas we could test out (since our machines seem to reproduce the error deterministically, we could help testing potential solutions).