Cannot load the paxml gpt3 tokenizer

gramesh-amd commented 2 weeks ago

Hello, I have followed the instructions in here to download paxml weights of gpt3 and its tokenizer (vocab folder) and tried using it in tokenizer_path like this. But it results in the following error:

0: Traceback (most recent call last):
0:   File "/home/goramesh/maxtext/MaxText/train.py", line 682, in <module>
0:     app.run(main)
0:   File "/home/goramesh/.local/lib/python3.10/site-packages/absl/app.py", line 308, in run
0:     _run_main(main, args)
0:   File "/home/goramesh/.local/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
0:     sys.exit(main(argv))
0:   File "/home/goramesh/maxtext/MaxText/train.py", line 678, in main
0:     train_loop(config)
0:   File "/home/goramesh/maxtext/MaxText/train.py", line 590, in train_loop
0:     example_batch = load_next_batch(data_iterator, example_batch, config)
0:   File "/home/goramesh/maxtext/MaxText/train.py", line 99, in load_next_batch
0:     return next(train_iter)
0:   File "/home/goramesh/maxtext/MaxText/multihost_dataloading.py", line 119, in __next__
0:     return get_next_batch_sharded(self.local_iterator, self.global_mesh)
0:   File "/home/goramesh/maxtext/MaxText/multihost_dataloading.py", line 78, in get_next_batch_sharded
0:     local_data = next(local_iterator)
0:   File "/pyenv/versions/3.10.14/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4700, in __next__
0:     return nest.map_structure(to_numpy, next(self._iterator))
0:   File "/pyenv/versions/3.10.14/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 814, in __next__
0:     return self._next_internal()
0:   File "/pyenv/versions/3.10.14/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 777, in _next_internal
0:     ret = gen_dataset_ops.iterator_get_next(
0:   File "/pyenv/versions/3.10.14/lib/python3.10/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3028, in iterator_get_next
0:     _ops.raise_from_not_ok_status(e, name)
0:   File "/pyenv/versions/3.10.14/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 6656, in raise_from_not_ok_status
0:     raise core._status_to_exception(e) from None  # pylint: disable=protected-access
0: tensorflow.python.framework.errors_impl.InternalError: {{function_node __wrapped__IteratorGetNext_output_types_6_device_/job:localhost/replica:0/task:0/device:CPU:0}} external/com_google_sentencepiece/src/sentencepiece_processor.cc(842) [!IsUnknown(PieceToId(absl::string_view(model_->bos_piece().data())))] id for `<s>` is not defined.
0:       [[{{node SentenceTokenizer/SentenceTokenizer/SentencepieceTokenizeOp}}]] [Op:IteratorGetNext] name:

So is the path in s3 the right path to the tokenizer?

ZhiyuLi-goog commented 2 weeks ago

Thank you @gramesh-amd. It might be an permission issue.

Have you tried to download the tokenizer file to your own bucket and change the path tokenizer_path accordingly? In parallel, we are looking at bucket permission gs://mlperf-llm-public2/.

gramesh-amd commented 2 weeks ago

Thanks @ZhiyuLi-goog

Yes I have downloaded the tokenizer from S3 bucket that was provided here and the name seems to match what I see in google's mlperf submission scripts. I use my local path instead of s3 bucket for tokenizer_path. The logs show that its able to load the tokenizer correctly

But I get the above error (id for <s> is not defined), so im not sure if its the right tokenizer

ZhiyuLi-goog commented 2 weeks ago

Hi, @gramesh-amd

We haven't seen this error before. Do you have a service account in your project, and we can grant you the access to the original bucket gs://mlperf-llm-public2/.

gramesh-amd commented 2 weeks ago

gowtham.ramesh@amd.com is my email (ive also created a google account with this same address)

ZhiyuLi-goog commented 2 weeks ago

We should have granted you the access. Could you take another try?

gramesh-amd commented 2 weeks ago

Thanks, will check

gramesh-amd commented 2 weeks ago

it works after i download from the gs://mlperf-llm-public2/ bucket

Thank you

AI-Hypercomputer / maxtext

Cannot load the paxml gpt3 tokenizer #875