allenai / RL4LMs

A modular RL library to fine-tune language models to human preferences
https://rl4lms.apps.allenai.org/
Apache License 2.0
2.21k stars 191 forks source link

CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` #3

Closed tatiana-iazykova closed 2 years ago

tatiana-iazykova commented 2 years ago

@jmhessel @dirkgr @schmmd @iellenberger

Ran python scripts/training/train_text_generation.py --config_path scripts/training/task_configs/iwslt2017/t5_ppo.yml

with the following config:

tokenizer:
  model_name: "google/mt5-base"
  padding_side: right
  truncation_side: right
  truncation: True
  padding: True
  max_length: 128
  # pad_token_as_eos_token: False

reward_fn:
  id: meteor 

datapool:
  id: wmt16
  args:
    train_path: "data/train.csv"
    eval_path: "data/eval.csv"
    test_path: "data/test.xlsx"

env:
  n_envs: 10
  args:
    max_prompt_length: 128
    max_episode_length: 128
    terminate_on_eos: True
    prompt_truncation_side: "right"
    context_start_token: 0

alg:
  id: ppo
  args: 
    n_steps: 2
    batch_size: 20
    verbose: 2
    learning_rate: 0.000001
    n_epochs: 5
    ent_coef: 0.0
  kl_div:
    coeff: 0.001
    target_kl: 0.2
  policy:
    id: seq2seq_lm_actor_critic_policy
    args:
      model_name: "google/mt5-base"
      apply_model_parallel: True
      prompt_truncation_side: "right"
      generation_kwargs:
        do_sample: True
        num_beams: 3
        max_length: 128
        length_penalty: 0.85
        repetition_penalty: 2.0
        max_new_tokens: 128

train_evaluation:
  eval_batch_size: 1
  n_iters: 10
  eval_every: 10
  save_every: 1
  metrics:
    - id: meteor
      args: {}
    - id: sacre_bleu
      args:
        tokenize: "intl"
  generation_kwargs:
    do_sample: True
    num_beams: 3
    max_length: 128
    length_penalty: 0.85
    max_new_tokens: 128
    repetition_penalty: 2.0

data_pool:

class WMT(TextGenPool):

    @classmethod
    def prepare(cls,
                split: str,
                train_path: str,
                eval_path: str,
                test_path: str
                ):

        if split == 'train':
            dataset = pd.read_csv(train_path, nrows=100)
        elif split == 'val':
            dataset = pd.read_csv(eval_path, nrows=100)
        elif split == 'test':
            dataset = pd.read_excel(test_path, engine='openpyxl')

        samples = []
        for ix, item in tqdm(dataset.iterrows(),
                             desc="Preparing dataset",
                             total=len(dataset)):

            prompt = item['prefix'] + item['input_text']
            reference = item['target_text']

            sample = Sample(id=f"{split}_{ix}",
                            prompt_or_input_text=prompt,
                            references=[reference]
                            )
            samples.append(sample)

        pool_instance = cls(samples)
        return pool_instance

However it results in:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [595,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [595,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [596,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [596,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [596,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [596,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Evaluating:   0%|                                                                                            | 0/100 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "scripts/training/train_text_generation.py", line 71, in <module>
    args.log_to_wandb)
  File "scripts/training/train_text_generation.py", line 42, in main
    trainer.train_and_eval()
  File "/home/jovyan/yazykova-tv/rl_allen/RL4LMs/rl4lms/envs/text_generation/training_utils.py", line 198, in train_and_eval
    self._evaluate_on_datapools(epoch=iter_start)
  File "/home/jovyan/yazykova-tv/rl_allen/RL4LMs/rl4lms/envs/text_generation/training_utils.py", line 193, in _evaluate_on_datapools
    gen_kwargs=self._eval_gen_kwargs)
  File "/home/jovyan/yazykova-tv/rl_allen/RL4LMs/rl4lms/envs/text_generation/evaluation_utils.py", line 41, in evaluate_on_samples
    dt_control_token, gen_kwargs)
  File "/home/jovyan/yazykova-tv/rl_allen/RL4LMs/rl4lms/envs/text_generation/evaluation_utils.py", line 99, in generate_text
    gen_kwargs=gen_kwargs)["gen_texts"]
  File "/home/jovyan/yazykova-tv/rl_allen/RL4LMs/rl4lms/envs/text_generation/policy.py", line 304, in generate
    **generation_kwargs_)
  File "/home/user/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/yazykova-tv/rl_allen/RL4LMs/rl4lms/envs/text_generation/hf_generation_utils.py", line 1199, in generate
    inputs_tensor, model_kwargs, model_input_name
  File "/home/jovyan/yazykova-tv/rl_allen/RL4LMs/rl4lms/envs/text_generation/hf_generation_utils.py", line 535, in _prepare_encoder_decoder_kwargs_for_generation
    **encoder_kwargs)
  File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 1044, in forward
    output_attentions=output_attentions,
  File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 675, in forward
    output_attentions=output_attentions,
  File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 581, in forward
    output_attentions=output_attentions,
  File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 500, in forward
    query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
  File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/user/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
rajcscw commented 2 years ago

Could you may be run your program with CUDA_LAUNCH_BLOCKING=1 python scripts/training/train_text_generation.py --config_path scripts/training/task_configs/iwslt2017/t5_ppo.yml so that we can get a better error reports?

dirkgr commented 2 years ago

How did you create the Python environment?

tatiana-iazykova commented 2 years ago

The problem somehow solved itself

promiseve commented 2 years ago

Hey Tatiana, can you send a mail? I am testing this and perhaps we can become collaborators, here is my email promisevekpo1@gmail.com. Anyone can feel free to mail as well if they are interested in collaborating, we could be working on similar problems.