Closed nicexw closed 3 years ago
Hi -- Danny (@daniel-perry) will be looking into this and get back to you on this thread.
Thanks,
-Adrian
@nicexw - Thanks for trying Bort and sorry you have run into this issue. While we work on addressing the issue, here is a temporary workaround: try running with --num_workers 1 on each file in turn.
For example if you are running in bash, you can do something like:
i=0
mkdir output
for file in `ls train/*txt*`
do
python create_pretraining_data.py --input_file ${file} --output_dir tmp --dupe_factor 1 --num_workers 1 --num_output 1
mv tmp/part-000.npz output/part-${i}.npz
i=$((i+1))
done
rm -r tmp
@nicexw - Thanks for trying Bort and sorry you have run into this issue. While we work on addressing the issue, here is a temporary workaround: try running with --num_workers 1 on each file in turn.
For example if you are running in bash, you can do something like:
i=0 mkdir output for file in `ls train/*txt*` do python create_pretraining_data.py --input_file ${file} --output_dir tmp --dupe_factor 1 --num_workers 1 --num_output 1 mv tmp/part-000.npz output/part-${i}.npz i=$((i+1)) done rm -r tmp
thank you very much! the above is caused by "data contains only one document", NSP task need negtive example from other document. when i set blanks in diffent q-d, there is no error
this issue can be closed
Super! Glad to know you got it solved.
i try to use create_pretraining_data.py for bort pretrain
python create_pretraining_data.py --input_file ./train/train.txt0,./train/train.txt1,./train/train.txt2,./train/train.txt3,./train/train.txt4,./train/train.txt5,./train/train.txt6,./train/train.txt7,./train/train.txt8,./train/train.txt9 --output_dir output --dupe_factor 1 INFO:root:Namespace(dataset_name='openwebtext_ccnews_stories_books_cased', dupe_factor=1, input_file='./train/train.txt0,./train/train.txt1,./train/train.txt2,./train/train.txt3,./train/train.txt4,./train/train.txt5,./train/train.txt6,./train/train.txt7,./train/train.txt8,./train/train.txt9', masked_lm_prob=0.15, max_predictions_per_seq=80, max_seq_length=512, num_outputs=1, num_workers=8, output_dir='output', random_seed=12345, short_seq_prob=0.1, verbose=False, whole_word_mask=False) INFO:root: ./train/train.txt0 INFO:root: ./train/train.txt1 INFO:root: ./train/train.txt2 INFO:root: ./train/train.txt3 INFO:root: ./train/train.txt4 INFO:root: ./train/train.txt5 INFO:root: ./train/train.txt6 INFO:root: ./train/train.txt7 INFO:root: ./train/train.txt8 INFO:root: ./train/train.txt9 INFO:root: Reading from 10 input files
multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, *kwds)) File "/usr/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "create_pretraining_data.py", line 304, in create_training_instances vocab, tokenizer))) File "create_pretraining_data.py", line 385, in create_instances_from_document 0, len(all_documents) - 2) File "/export/sdb/xiongwei/tfmxnet/lib64/python3.6/random.py", line 221, in randint return self.randrange(a, b+1) File "/export/sdb/xiongwei/tfmxnet/lib64/python3.6/random.py", line 199, in randrange raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width)) ValueError: empty range for randrange() (0,0, 0) """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "create_pretraining_data.py", line 691, in
main()
File "create_pretraining_data.py", line 597, in main
pool.map(create_training_instances, process_args)
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
ValueError: empty range for randrange() (0,0, 0)