Evaluating other datasets

pvcastro commented 3 years ago

Hi @131250208 !

I was interested in evaluating other datasets (such as conll04 and scierc, as evaluated by SpERT). I was wondering wich other settings I'd have to change in your model to try them. So far, for conll04, the only change I made was changing match_pattern to "whole_text". I wrote a script to change conll04 dataset to CasRel's format, so I could use your build_data script to convert it to your format. As far as I can see, everything looks good with this procedure. Since conll04 is such a smaller dataset (931 training sentences compared to nyt's 56k sentences, and around 200 validation/test sentences), I also considered changing loss_weight_recover_steps to 100. With a batch size of 12, there are only 79 training steps per epoch. For nyt_star I was able to get good results with your model, without changing anything. However, for conll04, I had no success....after 20 epochs the scores are still 0. For scierc I get similar results, with the difference that the samples accuracies (ent_seq, head_rel, tail_rel) stay around 30%. What else do you recommend changing for these benchmarks?

Thanks!

131250208 commented 3 years ago

Hi, @pvcastro. Don't worry. In my experience, it is because the dataset is small and it needs more steps to converge. For example, on NYT, one epoch equals 2000+ steps (batch_size = 24), and you might need 2-3 epochs (it is an example, I forget how many epochs it needs) to get a positive score. That's around 6000 steps. If you want to try other datasets, make sure they have been trained for the same steps. 79 * 20 = 1580 steps, far less than 6000. So, it is not enough to get a score. In short, do not look at epochs, look at steps.

FYI, set the learning rate from 1e-5 to 5e-5 will speed up the training (but might hurt the final performance). And smaller batch_size makes it faster to converge. If you care about the convergence speed so much, try TPLinkerPlus and set shaking_type to “cln_plus”, inner_enc_type to "lstm". If you continue to use TPLinker, loss_weight_recover_steps also interfere with the convergence speed so much. Do not set it too small, keep it thousands of steps. Set it big if you use a big batch size. I recommend you to keep it the same as you tried on NYT and wait for the same steps.

Why in some datasets the accuracies stay around a score and do not increase after several epochs? Check how many negative samples in your dataset. Because 0 tags are easy to classify, so the model will quickly achieve a specific tag sequence accuracy, e.g. 30%. But it needs far more steps to learn the classification of positive samples. So, do not care about the accuracies too much, see whether the loss is still in decreasing.

131250208 commented 3 years ago

@pvcastro And we would appreciate that so much if you could share your experimental results of TPLinker on other datasets!

liutianling commented 3 years ago

@131250208 Can you provide a .py file to eval the model ? In the tplinker/Evaluation.ipynb load the config.py, and load some parameters such as data_home, exp_name, but they are in eval_config.yaml and not in config.py

131250208 commented 3 years ago

@liutianling data_home and exp_name are also in config.py. Please check it again. eval_config.yaml is deprecated.

liutianling commented 3 years ago

Thanks for your fast reply! I'm sorry. I have found it.

131250208 commented 3 years ago

@liutianling No problem

pvcastro commented 3 years ago

So, it is not enough to get a score. In short, do not look at epochs, look at steps.

Ok, so per your statement, I should expect around 75 epochs until it starts converging? 6000 steps / 79 steps per epoch = 75 epochs.

FYI, set the learning rate from 1e-5 to 5e-5 will speed up the training (but might hurt the final performance).

I'm already using 5e-5 learning rate, since it's default in the master source code. I read all your parameter options in config.py and I'm understanding that I should keep all defaults (considering dataset, of course).

And smaller batch_size makes it faster to converge.

I'm using 12, since it's the one that fits my GTX-1070.

If you care about the convergence speed so much

It's not that I care about convergence speed, but about convergence at all. I was under the impression that after 20 epochs I should be getting at least a slightly increase in the evaluation, but I'll do as you instructed and wait for at least 75 epochs to see.

If you continue to use TPLinker, loss_weight_recover_steps also interfere with the convergence speed so much. Do not set it too small, keep it thousands of steps. Set it big if you use a big batch size. I recommend you to keep it the same as you tried on NYT and wait for the same steps.

OK, I'll leave 6000 for the next try. Can you explain to me what exactly this parameter do? I saw where it is used, but didn't get it's purpose exactly:

total_steps = hyper_parameters["loss_weight_recover_steps"] + 1  # + 1 avoid division by zero error
current_step = steps_per_ep * ep + batch_ind
w_ent = max(1 / z + 1 - current_step / total_steps, 1 / z)
w_rel = min((len(rel2id) / z) * current_step / total_steps, (len(rel2id) / z))
loss_weights = {"ent": w_ent, "rel": w_rel}

loss, ent_sample_acc, head_rel_sample_acc, tail_rel_sample_acc = train_step(batch_train_data, optimizer, loss_weights)

Check how many negative samples in your dataset. Because 0 tags are easy to classify, so the model will quickly achieve a specific tag sequence accuracy, e.g. 30%.

conll04 and scierc don't have an explicit "no relation" label. I don't think NYT has it either, right? I think the only "negative samples" in your training process are the empty points in the matrix that contain no association between entities or entities labels for most tokens, right?

So, do not care about the accuracies too much, see whether the loss is still in decreasing.

OK, got it...will get back to you after waiting for 75 epochs.

Just to be clear, here are the parameters I'll be using next time:

import string
import random

common = {
    "exp_name": "conll04",
    "rel2id": "rel2id.json",
    "device_num": 0,
    "workers": 6,
#     "encoder": "BiLSTM",
    "encoder": "BERT", 
    "hyper_parameters": {
        "shaking_type": "cat", # cat, cat_plus, cln, cln_plus; Experiments show that cat/cat_plus work better with BiLSTM, while cln/cln_plus work better with BERT. The results in the paper are produced by "cat". So, if you want to reproduce the results, "cat" is enough, no matter for BERT or BiLSTM.
        "inner_enc_type": "lstm", # valid only if cat_plus or cln_plus is set. It is the way how to encode inner tokens between each token pairs. If you only want to reproduce the results, just leave it alone.
        "dist_emb_size": -1, # -1: do not use distance embedding; other number: need to be larger than the max_seq_len of the inputs. set -1 if you only want to reproduce the results in the paper.
        "ent_add_dist": False, # set true if you want add distance embeddings for each token pairs. (for entity decoder)
        "rel_add_dist": False, # the same as above (for relation decoder)
        "match_pattern": "whole_text", # only_head_text (nyt_star, webnlg_star), whole_text (nyt, webnlg), only_head_index, whole_span
    },
}
common["run_name"] = "{}+{}+{}".format("TP1", common["hyper_parameters"]["shaking_type"], common["encoder"]) + ""

run_id = ''.join(random.sample(string.ascii_letters + string.digits, 8))
train_config = {
    "train_data": "conll04_train.json",
    "valid_data": "conll04_dev.json",
    "rel2id": "rel2id.json",
    "logger": "wandb", # if wandb, comment the following four lines

#     # if logger is set as default, uncomment the following four lines
#     "logger": "default", 
#     "run_id": run_id,
#     "log_path": "./default_log_dir/default.log",
#     "path_to_save_model": "./default_log_dir/{}".format(run_id),

    # only save the model state dict if F1 score surpasses <f1_2_save>
    "f1_2_save": 0, 
    # whether train_config from scratch
    "fr_scratch": True,
    # write down notes here if you want, it will be logged 
    "note": "start from scratch",
    # if not fr scratch, set a model_state_dict
    "model_state_dict_path": "",
    "hyper_parameters": {
        "batch_size": 12,
        "epochs": 100,
        "seed": 2333,
        "log_interval": 10,
        "max_seq_len": 100,
        "sliding_len": 20,
        "loss_weight_recover_steps": 6000, # to speed up the training process, the loss of EH-to-ET sequence is set higher than other sequences at the beginning, but it will recover in <loss_weight_recover_steps> steps.
        "scheduler": "CAWR", # Step
    },
}

eval_config = {
    "model_state_dict_dir": "./wandb", # if use wandb, set "./wandb", or set "./default_log_dir" if you use default logger
    "run_ids": ["12t77zlz", ],
    "last_k_model": 1,
    "test_data": "*test*.json", # "*test*.json"

    # where to save results
    "save_res": True,
    "save_res_dir": "../results",

    # score: set true only if test set is annotated with ground truth
    "score": True,

    "hyper_parameters": {
        "batch_size": 32,
        "force_split": False,
        "max_test_seq_len": 512,
        "sliding_len": 50,
    },
}

bert_config = {
    "data_home": "../data4bert",
    "bert_path": "bert-base-cased",
    "hyper_parameters": {
        "lr": 5e-5,
    },
}
bilstm_config = {
    "data_home": "../data4bilstm",
    "token2idx": "token2idx.json",
    "pretrained_word_embedding_path": "../../pretrained_emb/glove_300_nyt.emb",
    "hyper_parameters": {
         "lr": 1e-3,
         "enc_hidden_size": 300,
         "dec_hidden_size": 600,
         "emb_dropout": 0.1,
         "rnn_dropout": 0.1,
         "word_embedding_dim": 300,
    },
}

cawr_scheduler = {
    # CosineAnnealingWarmRestarts
    "T_mult": 1,
    "rewarm_epoch_num": 2,
}
step_scheduler = {
    # StepLR
    "decay_rate": 0.999,
    "decay_steps": 100,
}

# ---------------------------dicts above is all you need to set---------------------------------------------------
if common["encoder"] == "BERT":
    hyper_params = {**common["hyper_parameters"], **bert_config["hyper_parameters"]}
    common = {**common, **bert_config}
    common["hyper_parameters"] = hyper_params
elif common["encoder"] == "BiLSTM":
    hyper_params = {**common["hyper_parameters"], **bilstm_config["hyper_parameters"]}
    common = {**common, **bilstm_config}
    common["hyper_parameters"] = hyper_params

hyper_params = {**common["hyper_parameters"], **train_config["hyper_parameters"]}
train_config = {**train_config, **common}
train_config["hyper_parameters"] = hyper_params
if train_config["hyper_parameters"]["scheduler"] == "CAWR":
    train_config["hyper_parameters"] = {**train_config["hyper_parameters"], **cawr_scheduler}
elif train_config["hyper_parameters"]["scheduler"] == "Step":
    train_config["hyper_parameters"] = {**train_config["hyper_parameters"], **step_scheduler}

hyper_params = {**common["hyper_parameters"], **eval_config["hyper_parameters"]}
eval_config = {**eval_config, **common}
eval_config["hyper_parameters"] = hyper_params

131250208 commented 3 years ago

@pvcastro

What the loss_weight_recover_steps do?

I give a high weight to the entity classification loss (from EH-to-ET sequence) first, very low weights to relation classification loss. And the weights will recover to even in steps. In my experiments, it helped speed up the convergence. I guess it is because entities are easier to identify. So, It is a faster way to learn by starting at an easier task and then getting into harder ones.

conll04 and scierc don't have an explicit "no relation" label. I don't think NYT has it either, right? I think the only "negative samples" in your training process are the empty points in the matrix that contain no association between entities or entities labels for most tokens, right?

No, I do not mean the empty points in the matrix. Note that the splitting process will generate many negative samples without any relations. A short segment in a text may not contain a complete SPO triplet.

Just to be clear, here are the parameters I'll be using next time

By see this, I remember one thing very important! Did you build the data by my BuildData.ipynb? Or you just transform the data format to CasRel and did not touch the rel2id.json. I forgot to mention that BuildData.ipynb will generate rel2id.json for training. If you used the rel2id.json of another dataset, it will never get a positive score... You might want to check this first.

pvcastro commented 3 years ago

Hi @131250208 , I got the results for training the first 100 epochs:

Validating: 100%|██████████| 20/20 [00:03<00:00,  5.04it/s]
project: conll04, run_name: TP1+cat+BERT, Epoch: 97/100, batch: 79/79, train_loss: 3.290012761992122e-05, t_ent_sample_acc: 0.9293249068380911, t_head_rel_sample_acc: 0.8934599289411231, t_tail_rel_sample_acc: 0.902953599827199,lr: 2.5e-05, batch_time: 0.2803502082824707, total_time: 44.086851358413696 -------------{'time': 3.968517780303955,
 'val_ent_seq_acc': 0.4928571581840515,
 'val_f1': 0.599681020683742,
 'val_head_rel_acc': 0.5011904940009118,
 'val_prec': 0.648275862068742,
 'val_recall': 0.557863501483514,
 'val_tail_rel_acc': 0.5065476328134537}
Current avf_f1: 0.599681020683742, Best f1: 0.599681020683742
project: conll04, run_name: TP1+cat+BERT, Epoch: 98/100, batch: 79/79, train_loss: 3.135921244531254e-05, t_ent_sample_acc: 0.9293249045746236, t_head_rel_sample_acc: 0.9156118299387679, t_tail_rel_sample_acc: 0.918776386146304,lr: 5e-05, batch_time: 0.33971571922302246, total_time: 44.55414390563965 -------------{'time': 4.252135276794434,
 'val_ent_seq_acc': 0.45892858505249023,
 'val_f1': 0.584615384565273,
 'val_head_rel_acc': 0.5089285895228386,
 'val_prec': 0.6070287539934163,
 'val_recall': 0.5637982195844025,
 'val_tail_rel_acc': 0.5059523969888687}
Current avf_f1: 0.584615384565273, Best f1: 0.599681020683742
Validating: 100%|██████████| 20/20 [00:04<00:00,  4.71it/s]
project: conll04, run_name: TP1+cat+BERT, Epoch: 99/100, batch: 79/79, train_loss: 3.3024236875497056e-05, t_ent_sample_acc: 0.918776386146304, t_head_rel_sample_acc: 0.8892405267003216, t_tail_rel_sample_acc: 0.883966261827493,lr: 2.5e-05, batch_time: 0.2803835868835449, total_time: 44.20893597602844 -------------{'time': 4.236193895339966,
 'val_ent_seq_acc': 0.45059525445103643,
 'val_f1': 0.5760709009838042,
 'val_head_rel_acc': 0.5035714462399483,
 'val_prec': 0.5735294117645372,
 'val_recall': 0.5786350148366236,
 'val_tail_rel_acc': 0.4827381119132042}
Current avf_f1: 0.5760709009838042, Best f1: 0.599681020683742
Validating: 100%|██████████| 20/20 [00:04<00:00,  4.72it/s]
Validating: 100%|██████████| 20/20 [00:04<00:00,  4.87it/s]
project: conll04, run_name: TP1+cat+BERT, Epoch: 100/100, batch: 79/79, train_loss: 2.8681669080677102e-05, t_ent_sample_acc: 0.9388185777241671, t_head_rel_sample_acc: 0.9208860910391505, t_tail_rel_sample_acc: 0.9198312382154827,lr: 5e-05, batch_time: 0.33092474937438965, total_time: 44.49380707740784 -------------{'time': 4.111032485961914,
 'val_ent_seq_acc': 0.4714285895228386,
 'val_f1': 0.5837037036535309,
 'val_head_rel_acc': 0.5077381119132042,
 'val_prec': 0.5828402366862181,
 'val_recall': 0.584569732937512,
 'val_tail_rel_acc': 0.49642858654260635}
Current avf_f1: 0.5837037036535309, Best f1: 0.599681020683742

You were totally right, at some point, convergence started :smile:

Not sure why It's only showing 800 steps in the log. If I trained for 100 epochs, and there were 79 steps per epoch, shouldn't be 7900 steps total?

I resumed training for another 1000 epochs now, to see how far the model goes with this dataset.

pvcastro commented 3 years ago

No, I do not mean the empty points in the matrix. Note that the splitting process will generate many negative samples without any relations. A short segment in a text may not contain a complete SPO triplet.

Oh, I didn't realize this. Good to know. Let me check this code to understand it better.

By see this, I remember one thing very important! Did you build the data by my BuildData.ipynb? Or you just transform the data format to CasRel and did not touch the rel2id.json. I forgot to mention that BuildData.ipynb will generate rel2id.json for training. If you used the rel2id.json of another dataset, it will never get a positive score... You might want to check this first.

Yes, I used exactly your build data script :+1: Data should be ok.

131250208 commented 3 years ago

@pvcastro Congratulations！It only shows 800 steps on wandb because the default setting of the log interval is 10, which means log every 10 steps. You could log every step by setting "log_interval" to 1. But I do not recommend doing this, cause it will increase the cost of network IO.

131250208 commented 3 years ago

@pvcastro To continue your training. You could set "fr_scratch" to False and set "model_state_dict_path" to the path of the model state you have already saved. It will save you time.

pvcastro commented 3 years ago

@pvcastro Congratulations！It only shows 800 steps on wandb because the default setting of the log interval is 10, which means log every 10 steps. You could log every step by setting "log_interval" to 1. But I do not recommend doing this, cause it will increase the cost of network IO.

Oh, ok...I see.

@pvcastro To continue your training. You could set "fr_scratch" to False and set "model_state_dict_path" to the path of the model state you have already saved. It will save you time.

I'm already doing this! Yesterday I interrupted the training, since I needed the GPU for another thing. After an additional 111 epochs (211 total), the best f1 so far was 64.85%. I also took some time to check why different models were counting a different total of relations for each dataset. TPlinker was counting over 1400 items for CoNLL04 and spert was counting 1231. I compared the lists and noticed that build data was adding duplicate triples for some items. So I made a minor fix there to prevent adding duplicate triples. But still, after doing this, number of items was 1361, and looks like it's due to adding triples for char spans that match entities that are not actual entities (like Haiti x Haitian). Do you consider doing something to prevent this?

131250208 commented 3 years ago

@pvcastro A suggestion. You could also try to set the shaking type to cln_plus. In my experiments, it usually improves performance when you use BERT. Because NYT and WebNLG do not provide character spans, so I add them automatically. Given a triplet (S, P, O), I assume that all S and O in the text have the relation P, so I think this is where the duplicate relations come from. But I do not think it would hurt the performance too much, so I just left it alone. And I do not have ideas to prevent this.

looks like it's due to adding triples for char spans that match entities that are not actual entities (like Haiti x Haitian)

Set ignore_subword to true in build_data_config.yaml. Then, it would never match inner subwords. The default parameter is for Chinese datasets.

pvcastro commented 3 years ago

Hi @131250208 ! I tried running a different training using cln_plus. I trained for around 273 epochs (over 21k steps) and interrupted it, since I needed the GPU for other stuff. I got around 66% for the best F1. Here are the rest of the data: Doesn't look like it could get substantially better with additional epochs, what do you think?

131250208 commented 3 years ago

@pvcastro Yes, seems more epochs do not help. Thanks for your information! I will try this dataset after finishing my work. Maybe adjusting other parameters would help. In my experience, TPLinker is easy to overfit, I usually got a big gap between training sequence accuracy and validation sequence accuracy. The model structure and decoding strategy need more designing consideration.

pvcastro commented 3 years ago

@131250208 I'll close this issue, since I think I won't be doing any further experiments with TPLinker for the time being. Since I'm also studying this task, would you mind leaving me your e-mail so we can keep discussing this? I couldn't find your name and e-mail in the paper's author list. Thanks!

131250208 commented 3 years ago

@pvcastro Yes, of course. I am the first author Yucheng Wang. My e-mail is wangyucheng@iie.ac.cn.

131250208 / TPlinker-joint-extraction

Evaluating other datasets #8