Only O and <pad> in inference mode

AnnaKholkina commented 10 months ago

Hi! Your work looks great! I tried to train my own model in Russian language. I made train/val/test like yours and changed pretrained BERT to another one. This is my args:

python arabiner/bin/train.py --output_path ./ArabicNER/output 
                                                --train_path ./ArabicNER/data/train.txt 
                                                --val_path ./ArabicNER/data/val.txt 
                                                --test_path ./ArabicNER/data/test.txt 
                                                --batch_size 8 
                                                --data_config '{"fn":"arabiner.data.datasets.NestedTagsDataset","kwargs":{"max_seq_len":512}}' 
                                                --trainer_config '{"fn":"arabiner.trainers.BertNestedTrainer","kwargs":{"max_epochs":50}}' 
                                                --network_config '{"fn":"arabiner.nn.BertNestedTagger","kwargs": 
                                                                               {"dropout":0.1,"bert_model":"DeepPavlov/rubert-base-cased-conversational"}}' 
                                               --optimizer '{"fn":"torch.optim.AdamW","kwargs":{"lr":0.0001}}'

Model trained with this args good. Metrics on test set: But when i try to inference model on text, I have troubles with only 'O' or pad in output on example from train.txt: In this example the second word is B-PER. And in no other example did the model predict an entity. Code for run inference:

python -u ./ArabicNER/arabiner/bin/infer.py 
              --model_path ./ArabicNER/output
              --text "привет андрей"

Can you help me with this trouble?

AnnaKholkina commented 10 months ago

Problem found. It lies in the fact that when training the model, bert_model was not installed in data_config:

--data_config '{"fn":"arabiner.data.datasets.NestedTagsDataset","kwargs":{"max_seq_len":512}}'

Therefore, after training, the name of the model for the tokenizer was not recorded in args.json and the default model in NestedTagsDataset was used in inference mode:

class NestedTagsDataset(Dataset):
    def __init__(
        self,
        examples=None,
        vocab=None,
        bert_model="aubmindlab/bert-base-arabertv2",
        max_seq_len=512,
    ):

To fix this problem, you need to specify the name of the BERT model in --data_config when you start train the model:

--data_config '{"fn":"arabiner.data.datasets.NestedTagsDataset","kwargs":{"max_seq_len":512, "bert_model": "DeepPavlov/rubert-base-cased-conversational"}}'

or write this manually in args.json:

    "data_config": {
        "fn": "arabiner.data.datasets.NestedTagsDataset",
        "kwargs": {
            "max_seq_len": 512,
            "bert_model": "DeepPavlov/rubert-base-cased-conversational"
        }
    },

AnnaKholkina commented 10 months ago

Fix #8

SinaLab / ArabicNER

Only O and <pad> in inference mode #7