aub-mind / arabert

Pre-trained Transformers for Arabic Language Understanding and Generation (Arabic BERT, Arabic GPT2, Arabic ELECTRA)
https://huggingface.co/aubmindlab
624 stars 138 forks source link

AraBERT-training error (CUDA error: device-side assert triggered) #24

Closed En-J-A closed 4 years ago

En-J-A commented 4 years ago

I want to apply AraBERT on SA for 3 classes (+ve , -ve and Neutral) I want to be sure about can I use the AraBERT_PyTorch_Demo.ipynb file and make some changes :

  1. In compute_metrics function  I updated these values:

    f1_Positive = f1_score(labels,preds,average='samples') 
    f1_Negative = f1_score(labels,preds,average='samples')
    f1_Neutral  = f1_score(labels,preds,average='samples')
  2. Can I use the class of class BinaryProcessor(DataProcessor)after update its function like:

    def get_labels(self):
        return ["0", "1", "2"]

    and renamed it as class MnliProcessor(DataProcessor) and then updated

    processors = {
    "binary": MnliProcessor
    }
  3. I updated num_labels in config = config_class.from_pretrained(args['model_name'], num_labels=3, finetuning_task=args['task_name'])

  4. In defining the Model Parameters I found that 'task_name': 'binary', Should I replace it with another value?

When I tried to train the model at this step

global_step, tr_loss = train(model, tokenizer)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

After doing all of these changes I still have the following error

RuntimeError                              Traceback (most recent call last)
<ipython-input-26-f96ffee0923d> in <module>()
----> 1 global_step, tr_loss = train(model, tokenizer)
      2 logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

<ipython-input-16-fcd875a8c180> in train(model, tokenizer)
     64 
     65             loss = outputs[0]  # model outputs are always tuple in transformers (see doc)
---> 66             print("\r%f" % loss, end='')
     67 
     68             if args['fp16']:

RuntimeError: CUDA error: device-side assert triggered

what can I do to fix this error?

WissamAntoun commented 4 years ago

when doing multilabel you need to one-hot encode the labels in the label processor (https://github.com/kaushaltrivedi/fast-bert/blob/master/fast_bert/data_cls.py#L306) and try to follow other changes related to num_labels.

My suggestion is to use the Fast-BERT demo notebook, you can change the metric to whatever you need to the BertLearner. (EX: metrics = [{'name': 'accuracy', 'function': accuracy}, {'name': 'f1', 'function': F1}])

WissamAntoun commented 4 years ago

If you are running on COLAB make sure the runtime is set to GPU.

If you are running on your computer replace torch.device("cuda") by torch.device("cpu")

En-J-A commented 4 years ago

I tried to run Fast-BERT demo notebook, but I have an error at this step

learner.fit(epochs=5,
            lr=2e-5,
            validate=True,  # Evaluate the model after each epoch
            schedule_type="warmup_linear",
            optimizer_type="adamw")

and the error is

2020-08-30 22:03:42,452 - INFO]: ***** Running training *****
[2020-08-30 22:03:42,454 - INFO]:   Num examples = 13815
[2020-08-30 22:03:42,455 - INFO]:   Num Epochs = 5
[2020-08-30 22:03:42,455 - INFO]:   Total train batch size (w. parallel, distributed & accumulation) = 16
[2020-08-30 22:03:42,457 - INFO]:   Gradient Accumulation steps = 1
[2020-08-30 22:03:42,459 - INFO]:   Total optimization steps = 4320

 0.00% [0/5 00:00<00:00]

 100.00% [864/864 05:55<00:00]
[2020-08-30 22:09:38,025 - INFO]: Running evaluation
[2020-08-30 22:09:38,026 - INFO]:   Num examples = 1936
[2020-08-30 22:09:38,028 - INFO]:   Batch size = 32

 100.00% [61/61 00:15<00:00]
---------------------------------------------------------------------------
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-17-e58633857728> in <module>()
      3                         validate=True,  # Evaluate the model after each epoch
      4                         schedule_type="warmup_linear",
----> 5             optimizer_type="adamw")

3 frames
/usr/local/lib/python3.6/dist-packages/fast_bert/metrics.py in fbeta(y_pred, y_true, thresh, beta, eps, sigmoid)
     56     y_pred = (y_pred > thresh).float()
     57     y_true = y_true.float()
---> 58     TP = (y_pred * y_true).sum(dim=1)
     59     prec = TP / (y_pred.sum(dim=1) + eps)
     60     rec = TP / (y_true.sum(dim=1) + eps)

RuntimeError: The size of tensor a (3) must match the size of tensor b (1936) at non-singleton dimension 1
En-J-A commented 4 years ago

I worked in COLAB, and the runtime was set to GPU.

WissamAntoun commented 4 years ago

are your labels one-hot encoded?

this is an example from a script that i used for multilabel classification to build the dataframe before saving it as csv. sentences_filtered = pd.concat([sentences_filtered['text'],pd.get_dummies(sentences_filtered['label'])],axis=1) The sentences_filtered['label'] column originally had the labels as text.

Then save it to csv: sentences_filtered.to_csv("data/train.csv",index=True,columns=sentences_filtered.columns,sep=',',header=True)

and don't forget to multi_label to True in both the DataBunch and Learner classes

WissamAntoun commented 4 years ago

if you are using the AJGT no need to do one-hot encoding since its a binary classification dataset. Just run the notebook like it is, and it should run without any issue (I just tried it).

Now if you want to use your own multilabel dataset, then you should do one-hot encoding on your labels

En-J-A commented 4 years ago

Ok. I will try to use one-hot encoding (I have 3 classes, not 2).

En-J-A commented 4 years ago
train_AJGT = pd.DataFrame(train_AJGT)
train_AJGT['label'] = train_AJGT['label'].apply(str)

test_AJGT = pd.DataFrame(test_AJGT)
test_AJGT['label'] = test_AJGT['label'].apply(str)
train_AJGT = pd.concat([train_AJGT['text'],pd.get_dummies(train_AJGT['label'])],axis=1)
test_AJGT = pd.concat([test_AJGT['text'],pd.get_dummies(test_AJGT['label'])],axis=1)
!mkdir data
train_AJGT.to_csv("data/train.csv",index=True,columns=train_AJGT.columns,sep=',',header=True)
test_AJGT.to_csv("data/dev.csv",index=True,columns=test_AJGT.columns,sep=',',header=True)
with open('data/labels.csv','w') as f:
  f.write("0\n1\n2")

after this step, the data will be like this

    text            0        1         2
------------------------------------------------
0         XXX         0          1          0
1         XXX         1          0          0

So I have now 3 cols (0,1,2) instead of one (label) when if I need to call BertDataBunch obj , I got an error, I think the reason is there is no column called label in label_col='label', but what can I put instead of it?

WissamAntoun commented 4 years ago

yes the label_col should be a list of the label column names : [0,1,2]

En-J-A commented 4 years ago

I am grateful for what you are doing to help @WissamAntoun . I made all the changes,

databunch = BertDataBunch(
                           './data/',
                          './data/',
                          tokenizer=tokenizer,
                          train_file='train.csv',
                          val_file='dev.csv',
                          label_file='labels.csv',
                          text_col='text',
                          label_col=[0,1,2],
                          batch_size_per_gpu=16,
                          max_seq_length=512, #256
                          multi_gpu=True,
                          multi_label=True,
                          model_type='bert',)

I created labels.csv as

with open('data/labels.csv','w') as f:
  f.write("0\n1\n2")

and I faced a problem :

ValueError                                Traceback (most recent call last)
<ipython-input-22-9e1bdac0802c> in <module>()
     18                           multi_gpu=True,
     19                           multi_label=True,
**---> 20                           model_type='bert',)**

2 frames
/usr/local/lib/python3.6/dist-packages/fast_bert/data_cls.py in convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode, cls_token_at_end, pad_on_left, cls_token, sep_token, pad_token, sequence_a_segment_id, sequence_b_segment_id, cls_token_segment_id, pad_token_segment_id, mask_padding_with_zero, logger)
    180             label_id = []
    181             for label in example.label:
--> 182                 **label_id.append(float(label))**
    183         else:
    184             if example.label is not None:

ValueError: could not convert string to float: 'إن +ها ل+ ال+ قلب مصدر سعاد +ة'

إن +ها ل+ ال+ قلب مصدر سعاد +ة refers to the first raw in text column in training dataset breviously, I converted the label col from int to str to apply one-hot encoding and in the file https://github.com/kaushaltrivedi/fast-bert/blob/c91c72327a4150c25645802ffe9175e64cc61fca/fast_bert/data_cls.py#L58 I found the note : cls_token_segment_id define the segment id associated to the CLS token (0 for BERT, 2 for XLNet) and cls_token_segment_id=1 is it true and does it relate to the error?

WissamAntoun commented 4 years ago

Can you try label_col=['0','1','2'], i think the function is accessing the first column instead of the column named '0'

En-J-A commented 4 years ago

Yes, It works now. Thank you very much ^_^.

WissamAntoun commented 4 years ago

Great, You can close the issue if you want

En-J-A commented 4 years ago

hello. when I tried to execute the code again, at this step

learner.fit(epochs=10,
            lr=2e-5,
            validate=True,  # Evaluate the model after each epoch
            schedule_type="warmup_linear",
            optimizer_type="adamw")

I got this error

TypeError                                 Traceback (most recent call last)
<ipython-input-18-78eff0b78623> in <module>()
      3                         validate=True,  # Evaluate the model after each epoch
      4                         schedule_type="warmup_linear",
----> 5             optimizer_type="adamw")

7 frames
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in __array__(self, dtype)
    478     def __array__(self, dtype=None):
    479         if dtype is None:
--> 480             return self.numpy()
    481         else:
    482             return self.numpy().astype(dtype, copy=False)

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I don't have any idea about the reason. If you have please let me know.

WissamAntoun commented 4 years ago

can you copy the whole frame stack of the error

En-J-A commented 4 years ago
[2020-09-02 06:51:34,975 - INFO]: ***** Running training *****
[2020-09-02 06:51:34,978 - INFO]:   Num examples = 13815
[2020-09-02 06:51:34,979 - INFO]:   Num Epochs = 10
[2020-09-02 06:51:34,981 - INFO]:   Total train batch size (w. parallel, distributed & accumulation) = 16
[2020-09-02 06:51:34,984 - INFO]:   Gradient Accumulation steps = 1
[2020-09-02 06:51:34,984 - INFO]:   Total optimization steps = 8640

 0.00% [0/10 00:00<00:00]

 100.00% [864/864 05:40<00:00]
[2020-09-02 06:57:15,245 - INFO]: Running evaluation
[2020-09-02 06:57:15,249 - INFO]:   Num examples = 1936
[2020-09-02 06:57:15,250 - INFO]:   Batch size = 32

 100.00% [61/61 00:15<00:00]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-78eff0b78623> in <module>()
      3                         validate=True,  # Evaluate the model after each epoch
      4                         schedule_type="warmup_linear",
----> 5             optimizer_type="adamw")

7 frames
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in __array__(self, dtype)
    478     def __array__(self, dtype=None):
    479         if dtype is None:
--> 480             return self.numpy()
    481         else:
    482             return self.numpy().astype(dtype, copy=False)

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
WissamAntoun commented 4 years ago

I mean expand the error to get the full error stack trace

En-J-A commented 4 years ago
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-78eff0b78623> in <module>()
      3                         validate=True,  # Evaluate the model after each epoch
      4                         schedule_type="warmup_linear",
----> 5             optimizer_type="adamw")

7 frames
/usr/local/lib/python3.6/dist-packages/fast_bert/learner_cls.py in fit(self, epochs, lr, validate, return_results, schedule_type, optimizer_type)
    421             # Evaluate the model against validation set after every epoch
    422             if validate:
--> 423                 results = self.validate()
    424                 for key, value in results.items():
    425                     self.logger.info(

/usr/local/lib/python3.6/dist-packages/fast_bert/learner_cls.py in validate(self, quiet, loss_only)
    515             for metric in self.metrics:
    516                 validation_scores[metric["name"]] = metric["function"](
--> 517                     all_logits, all_labels
    518                 )
    519             results.update(validation_scores)

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
    183 
    184     # Compute accuracy for each possible representation
--> 185     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    186     check_consistent_length(y_true, y_pred, sample_weight)
    187     if y_type.startswith('multilabel'):

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
     79     """
     80     check_consistent_length(y_true, y_pred)
---> 81     type_true = type_of_target(y_true)
     82     type_pred = type_of_target(y_pred)
     83 

/usr/local/lib/python3.6/dist-packages/sklearn/utils/multiclass.py in type_of_target(y)
    245         raise ValueError("y cannot be class 'SparseSeries' or 'SparseArray'")
    246 
--> 247     if is_multilabel(y):
    248         return 'multilabel-indicator'
    249 

/usr/local/lib/python3.6/dist-packages/sklearn/utils/multiclass.py in is_multilabel(y)
    136     """
    137     if hasattr(y, '__array__') or isinstance(y, Sequence):
--> 138         y = np.asarray(y)
    139     if not (hasattr(y, "shape") and y.ndim == 2 and y.shape[1] > 1):
    140         return False

/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

/usr/local/lib/python3.6/dist-packages/torch/tensor.py in __array__(self, dtype)
    478     def __array__(self, dtype=None):
    479         if dtype is None:
--> 480             return self.numpy()
    481         else:
    482             return self.numpy().astype(dtype, copy=False)

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.  
--
En-J-A commented 4 years ago

I think I find the error. I added accuracy_score to metrics and it is the reason for the error . Now I removed it , the code is running well.


metrics = [{'name': 'accuracy_m', 'function': accuracy_multilabel},
           {'name': 'accuracy_th', 'function': accuracy_thresh },
     # {'name': 'accuracy_sc', 'function': accuracy_score }, --> here is the reason
           {'name': 'F1', 'function': F1}
                                ]

thank you @WissamAntoun