131250208 / TPlinker-joint-extraction

438 stars 94 forks source link

What are the causes of predicting weird char span of entities? #44

Closed jarork closed 3 years ago

jarork commented 3 years ago

Hi there, I notice that there are some really weird entity char spans from my model predictions. e.g. The length of text is only 250, but some entities can have a span of [3510, 3512], which apparently makes no sense; the char span of a entity can also be predicted as [0, 0], that represents nothing. The ent_f1 and rel_f1 for my model is 0.81 and 0.78

What can be the causes of predicting these weird entity spans? Thank you.

############################################# There are no token span errors on the preprocessing and training phases. the config for evaluation is listed as following:

eval_config = { "model_state_dict_dir": "./wandb", # if use wandb, set "./wandb", or set "./default_log_dir" if you use default logger "run_ids": ["2n26hvto", ], "last_k_model": 1, "test_data": "test.json",

"save_res": True,  
"save_res_dir": "../datasets/result_data",

"score": True,

"hyper_parameters": {
    "batch_size": 32, 
    "force_split": False,
    "max_seq_len": 240,
    "sliding_len": 50,
},

}

131250208 commented 3 years ago

I had never come into this situation before. Maybe there is something wrong with the offset mapping list, you could check out whether there is a problem with predicted token spans. If not, the problem might come from the mapping progress from token spans to char spans. You need to debug by yourself.

jarork commented 3 years ago

Hi mate, I haven't dived into it that you mentioned yet. I've found the issue is highly related to max_seq_len; the smaller max_seq_len is, the more entities with those outranged spans appear. The issue has been solved since I adjust the value of max_seq_len to 512 (max input len for BERT), which is the default value that you set.

However, there are still some entities with empty spans. I'll check it out few days later when I'm free.

Many thanks