LisaBanana commented 4 years ago

Describe the bug I train a Camembert + Classification model with the folowing codes line : my training datas are a dataset with 10 lines (it's a very small test before a bigger experiment) with 5 lines labeled 0 and 5 lines labeled 1. each lines is between 90 to 500 words long.

Create a TransformerModel

model = ClassificationModel('camembert', 'camembert-base', sliding_window=True, args=train_args, use_cuda=False)

# Train the model
model.train_model(data_train, eval_df=dataeval)

(train args used are the default ones)

after training is complete, I have a few warnings on my console like :

C:\Users\bezl\Envs\torch\lib\site-packages\torch\serialization.py:292: UserWarning: Couldn't retrieve source code for container of type CamembertForSequenceClassification. It won't be checked for correctness upon loading. "type " + obj.name + ". It won't be checked " (I don't really understand what it implies)

Than it starts to convert features for the prediction phase (test_data made from 8 lines of my datas, unlabeld but 4 are extracted from 1 labeled lines and 4 from 0 labeled lines, easyer for me to check how the model handles the classification this way)

Expected behavior

Predictions are supposed to be 0 or 1 but I only have 0.

anyway, all help or suggestion are welcome, I can copy/paste all my code if required. Thanks :)

ThilinaRajapakse commented 4 years ago

The warning should be safe to ignore as discussed here.

The results you are getting is likely because the training dataset is far too small for the model to learn anything useful. However, the model is "training" (so to speak) as the predicted values are changing.

LisaBanana commented 4 years ago

Hi,

Thanks for your answer, I wasn't very worried about the warning messages as we discussed previously, I thought maybe they could bring an insight for my problem as I couldn't understand why there where popping here. But, anyway, not a big deal.

I had the same result with a bigger dataset previously (+/-7000 lines of +/-200 words) that's why I tried with a smaller one to check if the issue was not from my code.

I have an even bigger dataset that could maybe give a result. I'll try it and tell you if there's any changes. Thanks again for your quick answer.

Kind regards,

Lisa

Le lun. 6 janv. 2020 à 19:08, Thilina Rajapakse notifications@github.com a écrit :

The warning should be safe to ignore as discussed here https://discuss.pytorch.org/t/got-warning-couldnt-retrieve-source-code-for-container/7689/12 .

The results you are getting is likely because the training dataset is far too small for the model to learn anything useful. However, the model is "training" (so to speak) as the predicted values are changing.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ThilinaRajapakse/simpletransformers/issues/126?email_source=notifications&email_token=ALP3CMGVXUGO62NXU3W4VADQ4NXTDA5CNFSM4KDFIWKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIGIJ7A#issuecomment-571245820, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALP3CMH5IR5RG2COLB2WAM3Q4NXTDANCNFSM4KDFIWKA .

pommedeterresautee commented 4 years ago

Hi I get the same issue with camembert too. With m-bert it works... (not as good as I want but not 0.5 accuracy :-) )

pommedeterresautee commented 4 years ago

I have played a bit with the source code and it appears that at the level of the line logits = self.classifier(sequence_output) in forward function, the values are already very low/negatives... and finish to be classified as 0.

Any idea @ThilinaRajapakse of what we can do to help? Did you saw Camembert returning something else than 0?

I imagine than even with English data, you can do a rapid check that we can try to reproduce on our machines.

ThilinaRajapakse commented 4 years ago

I'm running this now. I'll get back to you guys when I have something.

pommedeterresautee commented 4 years ago

In case you are looking for French data, check this repo

https://github.com/getalp/Flaubert

Like CamemBERT, Flaubert is BERT for French which comes with Flue a kind of Glue... in French !

ThilinaRajapakse commented 4 years ago

I tested it on a small English dataset and it seems to work. It's definitely returning more than just 0's. I did notice that it would give all 0's or all 1's at the beginning before eventually giving more reasonable outputs. I am not sure whether this is due to the model or because I was using English data.

You can see the results here.

pommedeterresautee commented 4 years ago

🤔 strange I let it run for 10 epochs, still having 100% zeros. I will redo for 100 but I definitely think there is something strange as I use CamemBERT a lot on other datasets and saw it works like any other Bert based model. Will keep you informed.

pommedeterresautee commented 4 years ago

After 100 epochs... still 100% 1!

Features loaded from cache at ./output/cache_simple_transformer/cached_dev_camembert_128_2_256
{'mcc': 0.0, 'tp': 128, 'tn': 0, 'fp': 128, 'fn': 0, 'acc': 0.5, 'eval_loss': 0.6931496541947126}

My code:

import random

from simpletransformers.classification import ClassificationModel
import pandas as pd
import sklearn

def load(path: str):
    result = list()
    with open(path) as f:
        for line in f.readlines():
            s1, s2, label = line.split("\t")
            result.append((s1, s2, int(float(label))))
    random.shuffle(result)
    return pd.DataFrame(result, columns=['text_a', 'text_b', 'labels'])

train_df = load("*****.tsv")
eval_df = load("*****.tsv")

train_args = {
    'reprocess_input_data': False,  # True
    'overwrite_output_dir': True,
    'num_train_epochs': 50,
    'fp16': False,
    'silent': True,
    'evaluate_during_training': True,
    'evaluate_during_training_steps': 0,
    'output_dir': "./output/simple_transformer",
    'cache_dir': './output/cache_simple_transformer/',
    # 'do_lower_case': True,
    'use_multiprocessing': False,
}

model = ClassificationModel('camembert', 'camembert-base', use_cuda=True, args=train_args)
# model = ClassificationModel('bert', 'bert-base-multilingual-cased', use_cuda=True, args=train_args)
# model = ClassificationModel('distilbert', 'distilbert-base-multilingual-cased', use_cuda=True, args=train_args)

# Train the model
model.train_model(train_df, eval_df=eval_df, show_running_loss=False, acc=sklearn.metrics.accuracy_score)

# Evaluate the model
scores, model_outputs, wrong_predictions = model.eval_model(eval_df, acc=sklearn.metrics.accuracy_score, verbose=True)

Another interesting thing, when I uncomment 'do_lower_case': True, first epoch is 100% 0, and at epoch 2 and after, I get 100% 1.

Again when I try mBERT instead of Camembert... it learns at the first epoch and continues after. Same for distilled mBERT (with slightly lower results than mBERT). The dataset is perfectly balanced and few thousands large.

I still have no idea where the issue is. do you see an issue with the code above?

Edit: Other interesting thing, the loss is very stable... like it doesn't learn anything at all.

ThilinaRajapakse commented 4 years ago

strange I let it run for 10 epochs, still having 100% zeros. I will redo for 100 but I definitely think there is something strange as I use CamemBERT a lot on other datasets and saw it works like any other Bert based model. Will keep you informed.

Do you mean that CamemBERT (Simple Transformers implementation) works when used with other datasets?

I can't spot any issues in your code either. This is puzzling indeed!

pommedeterresautee commented 4 years ago

I use it in another lib (Flair mainly) and I do know it works well, much better than mBERT on French for instance. That's why I try to understand what is happening here because clearly it should not behave that way. How may I help debug the thing? Is there somewhere inside the lib I can check the behaviour? Clearly logit variable is too late. May be it s something related to the tokenization done by Camembert, or some very low learning rate. I have no Idea. I checked the forward input_ids too, I can see the first token and the last token of each example of each batch are always the same which is expected, and other tokens in between are some numbers, so everything looks ok for me. The attention_mask seems to be always 1, I don't get why it s like there is no padding, but it s the same with mBERT so ok for me. token_type_ids=None, position_ids=None, head_mask=None, are all undefined. I am using transformers 2.3.0

Are those observations ok for you?

ThilinaRajapakse commented 4 years ago

Thank you for the detailed information!

The CamemBERT model was a community addition but the implementation looked fine to me. I think the issue may have been caused by the model subclassing the RoBERTa model from the Hugging Face library directly rather than the Simple Transformers implementation. If so, the fix I pushed just now should clear it up. Can you run it and let me know?

I am no longer seeing the weird all 0's to all 1's behaviour at the beginning after making this change.

pommedeterresautee commented 4 years ago

It s running now. I can already tell you that it starts to learn something.

Converting to features started. Cache is not used.
Features loaded from cache at ./output/cache_simple_transformer/cached_dev_camembert_128_2_256
{'mcc': 0.06362847629757777, 'tp': 6, 'tn': 125, 'fp': 3, 'fn': 122, 'acc': 0.51171875, 'eval_loss': 0.689890056848526}
Features loaded from cache at ./output/cache_simple_transformer/cached_dev_camembert_128_2_256
{'mcc': 0.11642436803197997, 'tp': 11, 'tn': 124, 'fp': 4, 'fn': 117, 'acc': 0.52734375, 'eval_loss': 0.688759284093976}
Features loaded from cache at ./output/cache_simple_transformer/cached_dev_camembert_128_2_256
{'mcc': 0.21352448376514868, 'tp': 125, 'tn': 18, 'fp': 110, 'fn': 3, 'acc': 0.55859375, 'eval_loss': 0.7145331678912044}
Features loaded from cache at ./output/cache_simple_transformer/cached_dev_camembert_128_2_256
{'mcc': 0.3923530128589653, 'tp': 83, 'tn': 95, 'fp': 33, 'fn': 45, 'acc': 0.6953125, 'eval_loss': 0.6186056612059474}
Features loaded from cache at ./output/cache_simple_transformer/cached_dev_camembert_128_2_256
{'mcc': 0.36007896737396095, 'tp': 91, 'tn': 83, 'fp': 45, 'fn': 37, 'acc': 0.6796875, 'eval_loss': 0.6091095320880413}
Features loaded from cache at ./output/cache_simple_transformer/cached_dev_camembert_128_2_256
{'mcc': 0.38298414139387743, 'tp': 75, 'tn': 101, 'fp': 27, 'fn': 53, 'acc': 0.6875, 'eval_loss': 0.6124491011723876}

The scores are similar to mBERT for now. So I would say you fixed the bug!!!!

Thanks a lot.

I also tested with xlmroberta and got the 100% 0 too for a few epochs.

model = ClassificationModel('xlmroberta', 'xlm-roberta-base', use_cuda=True, args=train_args)  # xlm-roberta-large

Right now learning with Camembert so didn't try to clean code myself but may be it is the same story.

ThilinaRajapakse commented 4 years ago

I ran it with xlmroberta but I didn't get the weird behaviour. Also, the CamemBERT fix was to make it mirror the XLMRoBERTa implementation. Maybe deleting the cache directory and rerunning it might help.

pommedeterresautee commented 4 years ago

FYI regarding XLMRoBERTa, I highly decreased the learning rate... and it started to learn something. Results are quite low but not 100% 0 or 1, so probably not a bug but the multi-languages support cost. I am quite surprised as Adam is supposed to adjust the LR per parameter...

Anyway, I think you can close this issue as the bug is fixed :-)

ThilinaRajapakse commented 4 years ago

Ok, I'll close this then. Let me know if something else comes up.

ThilinaRajapakse / simpletransformers

Classification with Camembert - always predict a 0 #126

Create a TransformerModel