ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 444 forks source link

Pretraining script can't find train data. #21

Closed airogachev closed 3 years ago

airogachev commented 3 years ago

I'm running this code

!python3 pretrain_transformers.py \
    --output_dir ="/content/output" \
    --model_type=gpt2 \
    --model_name_or_path=sberbank-ai/rugpt2large \
    --do_train \
    --train_data_file=./dataset/train/train.txt \
    --do_eval \
    --eval_data_file=./dataset/validation/validation.txt \
    --fp16

And I've got an error

10/26/2020 19:24:43 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1000000000000, cache_dir=None, config_name=None, device=device(type='cuda'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='./dataset/validation/validation.txt', evaluate_during_training=False, fp16=True, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=False, mlm_probability=0.15, model_name_or_path='sberbank-ai/rugpt2large', model_type='gpt2', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='=/content/output', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=4, save_steps=500, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name=None, train_data_file='./dataset/train/train.txt', warmup_steps=0, weight_decay=0.01) 10/26/2020 19:24:43 - INFO - __main__ - Creating features from dataset file at ./dataset/train 10/26/2020 19:24:53 - INFO - __main__ - Saving features into cached file ./dataset/train/gpt2_cached_lm_1000000000000_train.txt Traceback (most recent call last): File "pretrain_transformers.py", line 782, in <module> main() File "pretrain_transformers.py", line 731, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "pretrain_transformers.py", line 212, in train train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 96, in __init__ "value, but got num_samples={}".format(self.num_samples)) ValueError: num_samples should be a positive integer value, but got num_samples=0

Anyway, !cat ./dataset/train/train.txt runs properly.

fen0s commented 3 years ago

You need to set up a block size. Add --block_size argument. Default number for that would be 512.

airogachev commented 3 years ago

You need to set up a block size. Add --block_size argument. Default number for that would be 512.

Thanks