Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
6.95k stars 733 forks source link

Error in "_merge_no_wait": The config isn't consistent between chunks. This shouldn't have happened. #1117

Open eljanmahammadli opened 2 months ago

eljanmahammadli commented 2 months ago

Hello,

I am pretraining Tinyllama on Lightning AI studio on my custom dataset. I am using prepare_starcoder.py to convert the parquet files because my data has one folder of parquet files. After it writes .bin files it raises an error in the commented section below.

Error:

raise Exception("The config isn't consistent between chunks. This shouldn't have happened."

File location:

/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/data/streaming/writer.py

I could not resolve the problem, commented it out, and trained the model. But, I want to ensure it does not affect anything bad. I would appreciate it if you could address the issue.

    def _merge_no_wait(self, node_rank: Optional[int] = None) -> None:
        """Once all the workers have written their own index, the merge function is responsible to read and merge them
        into a single index."""
        files = os.listdir(self._cache_dir)
        index_files = [f for f in files if f.endswith(_INDEX_FILENAME)]

        chunks_info = []
        config = None
        for index_filename in sorted(index_files):
            chunk_path = os.path.join(self._cache_dir, index_filename)
            with open(chunk_path) as f:
                data = json.load(f)

                if config is None:
                    config = data["config"]

                #elif config != data["config"]:

                   # print(config)
                   # print("\n\n\n")
                   # print(data['config'])

                   # breakpoint()
                    #raise Exception("The config isn't consistent between chunks. This shouldn't have happened.")

                chunks_info.extend(data["chunks"])

            os.remove(chunk_path)

        if node_rank is None:
            with open(os.path.join(self._cache_dir, _INDEX_FILENAME), "w") as f:
                json.dump({"chunks": chunks_info, "config": config}, f, sort_keys=True)
        else:
            with open(os.path.join(self._cache_dir, f"{node_rank}-{_INDEX_FILENAME}"), "w") as f:
                json.dump({"chunks": chunks_info, "config": config}, f, sort_keys=True)
carmocca commented 2 months ago

cc @tchaton or @awaelchli

awaelchli commented 2 months ago

@eljanmahammadli Are you using one of our Studio templates for this? Would you mind sharing your prepare_starcoder.py implementation?

tchaton commented 2 months ago

Hey @eljanmahammadli, did you use the LitData App to prepare your data ? This happens if the type of your data isn't deterministic among workers.

eljanmahammadli commented 2 months ago

@awaelchli I am using the "Pretrain LLMs - TinyLlama 1.1B" template from the studio. Below is the code with minimal changes. I have changed the column name. And my data is just one .parquet file.

import os
import sys
import time
import traceback
from pathlib import Path

import pyarrow.parquet as pq
from lightning.data.streaming import DataChunkRecipe, DataProcessor

# support running without installing as a package
wd = Path(__file__).parent.parent.resolve()
sys.path.append(str(wd))

from lit_gpt import Tokenizer

class StarcoderDataRecipe(DataChunkRecipe):
    def __init__(self, tokenizer: Tokenizer, chunk_size: int):
        super().__init__(chunk_size)
        self.tokenizer = tokenizer

    def prepare_structure(self, input_dir):
        files = Path(input_dir).rglob("*.parquet")
        print(files)
        return [str(file) for file in files]

    def prepare_item(self, item_metadata):
        filepath = item_metadata
        start = time.time()

        try:
            parquet_file = pq.ParquetFile(filepath)
            # reduce RAM usage
            for batch in parquet_file.iter_batches(batch_size=8192, columns=["text"]):
                for text in batch.to_pandas()["text"]:
                    yield self.tokenizer.encode(text, bos=False, eos=True)

        except Exception:
            print(traceback.format_exc())
            print(f"Error reading {filepath}")
            return

        parquet_file.close()
        end = time.time()
        print(f"Took {end - start:.2f} seconds total", filepath)

def prepare(
    input_dir: Path = Path("data/starcoderdata"),
    output_dir: Path = Path("data/starcoder"),
    tokenizer_path: Path = Path("checkpoints/Llama-2-7b-hf/"),
    chunk_size: int = (2049 * 8192),
    fast_dev_run: bool = False,
) -> None:
    tokenizer = Tokenizer(tokenizer_path)
    data_recipe = StarcoderDataRecipe(tokenizer=tokenizer, chunk_size=chunk_size)
    data_processor = DataProcessor(
        input_dir=str(input_dir),
        output_dir=str(output_dir),
        fast_dev_run=fast_dev_run,
        num_workers=os.cpu_count(),
        num_downloaders=1,
    )

    start_time = time.time()
    data_processor.run(data_recipe)
    elapsed_time = time.time() - start_time
    print(f"Time taken: {elapsed_time:.2f} seconds")

if __name__ == "__main__":
    from jsonargparse import CLI

    CLI(prepare)

Beside, I am using my own tokenizer which I have used below code to trained following HuggingFace tutorial.

def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["text"]

training_corpus = get_training_corpus()
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
tchaton commented 2 months ago

@eljanmahammadli Do you think you could provide a full reproducible script with synthetic data, so I can debug it ?

eljanmahammadli commented 1 month ago

I am in the "Pretrain LLMs - TinyLlama 1.1B" on Lightning AI Studios.

First we get custom tokenizer:

python lit-gpt/scripts/download.py \
   --repo_id eljanmahammadli/simhash_dedup_tokenizer_1_5M \
   --access_token HF_TOKEN_HERE \
   --tokenizer_only true

Then clone the data:

git clone https://huggingface.co/datasets/eljanmahammadli/sample_data data/sample-raw

You have to change the column name to the "text" on the line below in the "lit-gpt/scripts/prepare_starcoder.py":

for batch in parquet_file.iter_batches(batch_size=8192, columns=["text"]):

Finally using "prepare_starcoder.py" converting parquet files to bins. But

python lit-gpt/scripts/prepare_starcoder.py \
  --input_dir data/sample-raw/data \
  --output_dir data/sample \
  --tokenizer_path checkpoints/simhash_dedup_tokenizer_1_5M

I want to know what kind of effect does this error has in the training the model as @tchaton pointed out there is inconsistency between data types.

tchaton commented 1 month ago

Hey @eljanmahammadli,

This error indicates the StreamingDataset won't know what de-serializers to use during training and would fail at some point when reaching the outlier samples.

The optimize script print the inferred types when starting processing. Did you see any anomalies ?

Do you think you could invite me (thomasgridai) to your teamspace, so I can duplicate your Studio and try to figure out the source of the bug ?

Best, T.C

eljanmahammadli commented 1 month ago

I don't see any option to specify the username for sharing. Could you please elaborate?

tchaton commented 1 month ago

Hey @eljanmahammadli

If you go in your Teamspace Settings > Click on Members Tab, you can invite people to your Teamspace.

eljanmahammadli commented 1 month ago

Hey @tchaton. On this quote "would fail at some point when reaching the outlier samples". Until you spot any bugs, am I good to go ahead and train the model if it does not fail?