ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 442 forks source link

Ошибка при попытке зафайнтюнить GPT3XL #58

Closed exelents closed 3 years ago

exelents commented 3 years ago

При попытке запустить файнтюнинг на небольшом дебажном датасете получаю ошибку при вызове forward прохода. В чём может быть проблема?

Traceback (most recent call last):
  File "../pretrain_gpt3.py", line 832, in <module>
    main()
  File "../pretrain_gpt3.py", line 812, in main
    tokenizer)
  File "../pretrain_gpt3.py", line 472, in train
    args, timers, tokenizer, iteration, tb_writer)
  File "../pretrain_gpt3.py", line 406, in train_step
    lm_loss = forward_step(sample, model, args, timers, tokenizer, iteration, tb_writer)
  File "../pretrain_gpt3.py", line 298, in forward_step
    output = model(tokens, position_ids, attention_mask)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/DeepSpeed-triton2/deepspeed/runtime/engine.py", line 972, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/model/distributed.py", line 79, in forward
    return self.module(*inputs, **kwargs)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/fp16/fp16.py", line 72, in forward
    return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/model/gpt3_modeling.py", line 108, in forward
    transformer_output = self.transformer(embeddings, attention_mask)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/mpu/transformer.py", line 445, in forward
    hidden_states, attention_mask)
  File "/export/DeepSpeed-triton2/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 682, in checkpoint
    CheckpointFunction.apply(function, all_outputs, *args)
  File "/export/DeepSpeed-triton2/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 486, in forward
    outputs = run_function(*inputs_cuda)
  File "/export/data/ipynb/ru-gpts/src/mpu/transformer.py", line 434, in custom_forward
    x_ = layer(x_, inputs[1])
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/mpu/transformer.py", line 301, in forward
    attention_output = self.attention(layernorm_output, ltor_mask)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/mpu/transformer.py", line 116, in forward
    mixed_x_layer = self.query_key_value(hidden_states)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/mpu/layers.py", line 243, in forward
    output_parallel = F.linear(input_parallel, self.weight, self.bias)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
exelents commented 3 years ago

@mgrankin Вы не сталкивались с такой ошибкой?

exelents commented 3 years ago

Что характерно, меньшие модели тренируются нормально, запускаясь из соответствующих скриптов. А самая большая модель валится.

exelents commented 3 years ago

Причину выяснил - эта ошибка от недостатка памяти на видеокарте.