IndexError in OLMo-7B pre-training dataset

Bread0288 commented 5 months ago

❓ The question

Hello, while using the code to check what sequences exist on a specific batch index in the OLMo-7B pre-training dataset, IndexError: 925801835 is out of bounds for dataset of size 925201012 occurred, so I would like to inquire.

1. Preparation

The .npy files in the data.path of ./configs/official/OLMo-7B.yaml were saved to disk using the wget command.
I changed data.path in OLMo-7B.yaml to the local path where I just downloaded the data.

2. Executing the code

Here is the code I used:

data_order_file_path = cached_path("https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy")
train_config_path = os.path.join(os.path.dirname(FILE_PATH), "OLMo_config/OLMo-7B.yaml")    
cfg = TrainConfig.load(train_config_path)
batch_size = cfg.global_train_batch_size
global_indices = np.memmap(data_order_file_path, mode="r+", dtype=np.uint32)
dataset = build_memmap_dataset(cfg, cfg.data)

def get_batch_instances(batch_idx: int) -> list[list[int]]:
    batch_start = batch_idx * batch_size
    batch_end = (batch_idx + 1) * batch_size
    batch_indices = global_indices[batch_start:batch_end]
    batch_instances = []
    for index in batch_indices:
        token_ids = dataset[index]["input_ids"].tolist()
        batch_instances.append(token_ids)
    return batch_instances

def main():
    steps = [1]
    results = [False for i in range(len(steps))]

    tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B", trust_remote_code=True)
    for i, step in enumerate(steps):
        batch = torch.tensor(get_batch_instances(batch_idx=step))
        # <class: 'list'>, len : 2048
        batch_in_text = tokenizer.batch_decode(batch, skip_special_tokens=True)
        for sequence in batch_in_text:
            if 'apple'.lower() in sequence.lower():
                results[i] = True
                continue
    print(results)

if __name__=="__main__":
    main()

3. Detailed Error Message

> Traceback (most recent call last):
>   File "test.py", line 96, in <module>
>     main()
>   File "test.py", line 83, in main
>     batch = torch.tensor(get_batch_instances(batch_idx=step))
>   File "test.py", line 60, in get_batch_instances
>     token_ids = dataset[index]["input_ids"].tolist()
>   File "site-packages/olmo/data/memmap_dataset.py", line 176, in __getitem__
>     raise IndexError(f"{index} is out of bounds for dataset of size {len(self)}")
> IndexError: 925801835 is out of bounds for dataset of size 925201012

Is the OLMo-7B pre-training corpus saved at this urls wrong? Or is there a problem with the dataset saved at this url and something went wrong when I downloaded it?

4. Additional Question

If you look at the page related to Huggingface's Dolma dataset, the URL where the corpus used to pre-train OLMo-7B is stored is different from the one specified in ./configs/official/OLMo-7B.yaml. Do these two urls play the same role?

txy77 commented 1 month ago

I wonder whether this issue has been resolved

txy77 commented 1 month ago

Is the path you provided, https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy, correct?

allenai / OLMo