AI-Hypercomputer / maxtext

A simple, performant and scalable Jax LLM!
Apache License 2.0
1.47k stars 275 forks source link

How to load tfrecords from local file system for Mlperf training? #844

Closed gramesh-amd closed 4 weeks ago

gramesh-amd commented 4 weeks ago

Hello,

I downloaded the c4 dataset to my local with something like this:

mkdir -p /storage/c4_jax/c4/en/3.0.1/
gsutil  -u 'gcp_project_name' -m cp 'gs://allennlp-tensorflow-datasets/c4/en/3.0.1/*' /storage/c4_jax/c4/en/3.0.1/

I am now wondering how do i use these files from local to test mlperf training? (I would also have to use thelast 256 of 1024 files of C4/en/3.0.1/train for training)

I tried:

dataset_type: "tfds"
dataset_path: "/storage/c4_jax" # also tried the full path instead
dataset_name: "c4/en:3.0.1" # tried some variation of this
split: "train"

but this runs into

File "/pyenv/versions/3.10.14/lib/python3.10/site-packages/tensorflow_datasets/core/dataset_builder.py", line 380, in _pick_version                                                                                                                                                                                                                                                     AssertionError: Failed to construct dataset "c4", builder_kwargs "{'config': 'en', 'version': '3.0.1'}": Dataset c4 cannot be loaded at version 3.0.1, only: 3.1.0, 2.3.1, 2.3.0, 2.2.1, 2.2.0.
gramesh-amd commented 4 weeks ago

cc: @rwitten @aireenmei

aireenmei commented 4 weeks ago

The config looks correct to me. That error usually means the program was not able to read the files. It can be due to permission, path, or the env var "TFDS_DATA_DIR" (https://github.com/google/maxtext/blob/main/MaxText/train.py#L643)

gramesh-amd commented 4 weeks ago

Thanks it does work after i also export TFDS_DATA_DIR