RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()]

vincsous commented 4 years ago

Hi, Fisrt thanks for your work.

When I am trying to do preprocessing. I get following error message: RuntimeError: Internal: /sentencepiece/src/trainerinterface.cc(336) [!sentences.empty()]

I am using a *.txt file uploaded on my colab. I would like to know what does it mean and how to fix it. Thanks

Vincent

RomanPlusPlus commented 4 years ago

I have the same problem while doing preprocessing locally.

I cd'ed to the gpt-2-tensorflow2.0 dir and run the following command: python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data" --vocab-size=32000

Tried it with with the data from the "scraped" dir provided with the repo.

Please find the log in the attached file.

log.txt

I've installed the dependencies using conda, as follows: conda install setuptools ftfy tqdm Click tensorflow numpy pip install sentencepiece

conda list output:

packages_versions.txt

akanyaani commented 4 years ago

Hi @vincsous and @RomanPlusPlus

Thanks for reporting the issue. I have fixed the issue please pull the code and test.

Thanks

vincsous commented 4 years ago

Hi @akanyaani and thank you. Preprocessing is working for me now. But I have another problem for the training. First, as I am using Colab, I do not have multiple GPU so I choose --distributed=False. It seems that it starts to train but training stops ("Training Done....") at step 20, 11% accuracy. Here is the log. log_train.txt

Thanks again

RomanPlusPlus commented 4 years ago

Hi @akanyaani, thank you for your speedy response.

Unfortunately, the problem persists. I still get the same [!sentences_.empty()] error.

Please find the log in the attached file.

log200517.txt

akanyaani commented 4 years ago

Hi @RomanPlusPlus

But it's working on my system could you please print files in that directory.

Add print in the pre_process.py train method.

text_files = glob.glob((data_dir + "/*.txt"))

print(text_files) #Add this and see does it print text files

process_text(text_files)
train_byte_pair_encoding(vocab_size)
create_tf_records(min_seq_len, max_seq_len)
print("Pre-processing is done............")

This error comes when text_files does not have any text files. If text_files is an empty list then try to resolve path issues.

akanyaani commented 4 years ago

Hi @vincsous

I will look into that.

Thanks

RomanPlusPlus commented 4 years ago

Hi @akanyaani ,

I added the line you suggested. It prints out the following:

['/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/processed.txt']

I also checked the "processed.txt" file. It's empty.

akanyaani commented 4 years ago

Hi @RomanPlusPlus

You are getting this error because you are passing the wrong data directory. This repo has sample data in /data/scraped so now try this.

python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/scraped" --vocab-size=32000

apteryxlabs commented 4 years ago

I am also getting this error. My command: python pre_process.py --data-dir=/media/b/F:/patent_data_v2/patent_data_joined --vocab-size=50000

Checked the processed.txt file - it's got PLENTY of data.

Notably, this ran fine on my Mac (running Catalina). However, Macs don't have GPUs, so I'm moving all this over to a client's Linux machine.

My os: Linux Ubuntu (latest version, 20)

Running in conda custom environment.

My conda env.yaml file: `name: tf channels:

anaconda
defaults dependencies:
_libgcc_mutex=0.1=main
_tflow_select=2.1.0=gpu
absl-py=0.9.0=py36_0
astunparse=1.6.3=py_0
blas=1.0=mkl
blinker=1.4=py36_0
brotlipy=0.7.0=py36h7b6447c_1000
c-ares=1.15.0=h7b6447c_1001
ca-certificates=2020.6.24=0
cachetools=4.1.0=py_1
certifi=2020.6.20=py36_0
cffi=1.14.0=py36he30daa8_1
chardet=3.0.4=py36_1003
click=7.1.2=py_0
cryptography=2.9.2=py36h1ba5d50_0
cudatoolkit=10.1.243=h6bb024c_0
cudnn=7.6.5=cuda10.1_0
cupti=10.1.168=0
ftfy=5.7=py_0
gast=0.3.3=py_0
google-auth=1.14.1=py_0
google-auth-oauthlib=0.4.1=py_2
google-pasta=0.2.0=py_0
grpcio=1.27.2=py36hf8bcb03_0
h5py=2.10.0=py36hd6299e0_1
hdf5=1.10.6=hb1b8bf9_0
idna=2.10=py_0
intel-openmp=2020.1=217
keras-preprocessing=1.1.0=py_1
ld_impl_linux-64=2.33.1=h53a641e_7
libedit=3.1.20191231=h14c3975_1
libffi=3.3=he6710b0_2
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libprotobuf=3.12.3=hd408876_0
libstdcxx-ng=9.1.0=hdf63c60_0
markdown=3.1.1=py36_0
mkl=2019.4=243
mkl-service=2.3.0=py36he904b0f_0
mkl_fft=1.1.0=py36h23d657b_0
mkl_random=1.1.0=py36hd6b4f25_0
ncurses=6.2=he6710b0_1
numpy=1.18.5=py36ha1c710e_0
numpy-base=1.18.5=py36hde5b4d6_0
oauthlib=3.1.0=py_0
openssl=1.1.1g=h7b6447c_0
opt_einsum=3.1.0=py_0
pip=20.1.1=py36_1
protobuf=3.12.3=py36he6710b0_0
pyasn1=0.4.8=py_0
pyasn1-modules=0.2.7=py_0
pycparser=2.20=py_0
pyjwt=1.7.1=py36_0
pyopenssl=19.1.0=py36_0
pysocks=1.7.1=py36_0
python=3.6.10=h7579374_2
readline=8.0=h7b6447c_0
requests=2.24.0=py_0
requests-oauthlib=1.3.0=py_0
rsa=4.0=py_0
scipy=1.5.0=py36h0b6359f_0
setuptools=47.3.1=py36_0
six=1.15.0=py_0
sqlite=3.32.3=h62c20be_0
tensorboard=2.2.1=pyh532a8cf_0
tensorboard-plugin-wit=1.6.0=py_0
tensorflow=2.2.0=gpu_py36hf933387_0
tensorflow-base=2.2.0=gpu_py36h8a81be8_0
tensorflow-estimator=2.2.0=pyh208ff02_0
tensorflow-gpu=2.2.0=h0d30ee6_0
termcolor=1.1.0=py36_1
tk=8.6.10=hbc83047_0
tqdm=4.47.0=py_0
urllib3=1.25.9=py_0
wcwidth=0.2.5=py_0
werkzeug=1.0.1=py_0
wheel=0.34.2=py36_0
wrapt=1.12.1=py36h7b6447c_1
xz=5.2.5=h7b6447c_0
zlib=1.2.11=h7b6447c_3
pip:
- sentencepiece==0.1.85 prefix: /home/b/anaconda3/envs/tf `

elbowdonkey commented 4 years ago

You can run into this error even if your path is correct because the train method assumes your data files use the txt file extension. If you don't have files with txt as their extension, they won't be considered, causing the error.

I'd recommend that the train method be changed to:

def train(data_dir, vocab_size, min_seq_len, max_seq_len):
    text_files = glob.glob((data_dir + "/*"))
    process_text(text_files)
    train_byte_pair_encoding(vocab_size)
    create_tf_records(min_seq_len, max_seq_len)
    print("Pre-processing is done............")

In other words, change "/*.txt" to "/*".

Better yet, gather the file paths recursively like so:

text_files = glob.glob((data_dir + "/**/*"))

This allows you to have your data files within their own directories - useful if you have thousands of them and want to work with subsets of those thousands sometimes.

tkahn commented 2 years ago

I encountered this error when running the code on Windows. I fixed this by editing all calls to with open like this:

with open(PROCESS_DATA_PATH, 'r', encoding = 'utf-8') as f:
with open(BPE_TSV_PATH, 'w', encoding = 'utf-8', newline='') as f_output:

The files that are read need to be encoded in UTF-8, but I guess that goes without saying.

akanyaani / gpt-2-tensorflow2.0

RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()] #4