akanyaani / gpt-2-tensorflow2.0

OpenAI GPT2 pre-training and sequence prediction implementation in Tensorflow 2.0
https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
MIT License
259 stars 83 forks source link

RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()] #4

Open vincsous opened 4 years ago

vincsous commented 4 years ago

Hi, Fisrt thanks for your work.

When I am trying to do preprocessing. I get following error message: RuntimeError: Internal: /sentencepiece/src/trainerinterface.cc(336) [!sentences.empty()]

I am using a *.txt file uploaded on my colab. I would like to know what does it mean and how to fix it. Thanks

Vincent

RomanPlusPlus commented 4 years ago

I have the same problem while doing preprocessing locally.

I cd'ed to the gpt-2-tensorflow2.0 dir and run the following command: python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data" --vocab-size=32000

Tried it with with the data from the "scraped" dir provided with the repo.

Please find the log in the attached file.

log.txt

I've installed the dependencies using conda, as follows: conda install setuptools ftfy tqdm Click tensorflow numpy pip install sentencepiece

conda list output:

packages_versions.txt

akanyaani commented 4 years ago

Hi @vincsous and @RomanPlusPlus

Thanks for reporting the issue. I have fixed the issue please pull the code and test.

Thanks

vincsous commented 4 years ago

Hi @akanyaani and thank you. Preprocessing is working for me now. But I have another problem for the training. First, as I am using Colab, I do not have multiple GPU so I choose --distributed=False. It seems that it starts to train but training stops ("Training Done....") at step 20, 11% accuracy. Here is the log. log_train.txt

Thanks again

RomanPlusPlus commented 4 years ago

Hi @akanyaani, thank you for your speedy response.

Unfortunately, the problem persists. I still get the same [!sentences_.empty()] error.

Please find the log in the attached file.

log200517.txt

akanyaani commented 4 years ago

Hi @RomanPlusPlus

But it's working on my system could you please print files in that directory.

Add print in the pre_process.py train method.

text_files = glob.glob((data_dir + "/*.txt"))

print(text_files) #Add this and see does it print text files

process_text(text_files)
train_byte_pair_encoding(vocab_size)
create_tf_records(min_seq_len, max_seq_len)
print("Pre-processing is done............")

This error comes when text_files does not have any text files. If text_files is an empty list then try to resolve path issues.

akanyaani commented 4 years ago

Hi @vincsous

I will look into that.

Thanks

RomanPlusPlus commented 4 years ago

Hi @akanyaani ,

I added the line you suggested. It prints out the following:

['/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/processed.txt']

I also checked the "processed.txt" file. It's empty.

akanyaani commented 4 years ago

Hi @RomanPlusPlus

You are getting this error because you are passing the wrong data directory. This repo has sample data in /data/scraped so now try this.

python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/scraped" --vocab-size=32000

apteryxlabs commented 4 years ago

I am also getting this error. My command: python pre_process.py --data-dir=/media/b/F:/patent_data_v2/patent_data_joined --vocab-size=50000

Checked the processed.txt file - it's got PLENTY of data.

Notably, this ran fine on my Mac (running Catalina). However, Macs don't have GPUs, so I'm moving all this over to a client's Linux machine.

My os: Linux Ubuntu (latest version, 20)

Running in conda custom environment.

My conda env.yaml file: `name: tf channels:

elbowdonkey commented 4 years ago

You can run into this error even if your path is correct because the train method assumes your data files use the txt file extension. If you don't have files with txt as their extension, they won't be considered, causing the error.

I'd recommend that the train method be changed to:

def train(data_dir, vocab_size, min_seq_len, max_seq_len):
    text_files = glob.glob((data_dir + "/*"))
    process_text(text_files)
    train_byte_pair_encoding(vocab_size)
    create_tf_records(min_seq_len, max_seq_len)
    print("Pre-processing is done............")

In other words, change "/*.txt" to "/*".

Better yet, gather the file paths recursively like so:

text_files = glob.glob((data_dir + "/**/*"))

This allows you to have your data files within their own directories - useful if you have thousands of them and want to work with subsets of those thousands sometimes.

tkahn commented 2 years ago

I encountered this error when running the code on Windows. I fixed this by editing all calls to with open like this:

with open(PROCESS_DATA_PATH, 'r', encoding = 'utf-8') as f:
with open(BPE_TSV_PATH, 'w', encoding = 'utf-8', newline='') as f_output:

The files that are read need to be encoded in UTF-8, but I guess that goes without saying.