ZitongYang / Synthetic_Continued_Pretraining

Code implementation of synthetic continued pretraining
https://arxiv.org/abs/2409.07431
Apache License 2.0
43 stars 4 forks source link

Data Preparation of Step2 Tokenization #5

Open ghtaro opened 1 week ago

ghtaro commented 1 week ago

Thank you very much for sharing all the codes on your brilliant work.

I would like to replicate the result by using the below dataset in hugginface.

The resulting synthetic data will be saved in data/dataset/raw/quality_entigraph_gpt-4-turbo/. We release the generated synthetic data at https://huggingface.co/datasets/zitongyang/entigraph-quality-corpus.

I copied the above (parquet files) in data/dataset/raw/quality_entigraph_gpt-4-turbo/ and run the following commands.

mkdir -p data/dataset/bins/ python data/tokenize_entigraph.py

I have got the following error messages:

Writing to data/dataset/bins/quality_all-entigraphgpt-4-turbo.bin with length 0
Traceback (most recent call last):
  File "/Workspace/Users/<user>/Synthetic_Continued_Pretraining/data/tokenize_entigraph.py", line 59, in <module>
    tokenize_quality_graph('gpt-4-turbo')
  File "/Workspace/Users/<user>/Synthetic_Continued_Pretraining/data/tokenize_entigraph.py", line 55, in tokenize_quality_graph
    write_to_memmap_single(tokenize_list(quality), f'quality_all-entigraph{model_name}.bin')
  File "/Workspace/Users/<user>/Synthetic_Continued_Pretraining/data/tokenize_entigraph.py", line 38, in write_to_memmap_single
    arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
  File "/databricks/python/lib/python3.10/site-packages/numpy/core/memmap.py", line 267, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
ValueError: cannot mmap an empty file

It looks like we need the input files in json format. Is it correct? If so, could you tell me how to convert the parquet file to json format?

ZitongYang commented 1 week ago

Hi,

Yes, the huggingface format is not in json format. We briefly describe huggingface format here https://huggingface.co/datasets/zitongyang/entigraph-quality-corpus/blob/main/README.md?code=true#L55.

To convert the huggingface format:

I hope this helps!

ghtaro commented 22 hours ago

Thanks! I will try procedure you told me and get you back later.