KeyError: 'snt' - Githubissues

14H034160212 commented 2 years ago

Hi,

I got the KeyError: 'snt' when I run the 20_Train_Model.py under the scripts/31_Model_Parse_T5/. I have run the 10_Collect_AMR_Data.py to get the whole training, dev, and test dataset. Also, Does anyone know the difference between train.txt and train.txt.nowiki? It seems they are quite similar, but in the model_generate_t5.json, I only see it uses train.txt.

Here is scripts/31_Model_Parse_T5/, What is the meaning of the number in each file name? For example, why 10_Collect_AMR_Data.py has 10? For my understanding, it means the number of orders I need to execute for training the model like I need to run 10_Collect_AMR_Data.py firstly and then run 20_Train_Model.py and so on. Am I correct?

Here is the processed dataset after I run 10_Collect_AMR_Data.py.

I use the default hyperparameter from model_generate_t5.json

{   "gen_args" :
    {
        "model_name_or_path"            : "t5-base",
        "corpus_dir"                    : "amrlib/data/LDC2020T02/",
        "train_fn"                      : "train.txt",
        "valid_fn"                      : "dev.txt",
        "max_in_len"                    : 512,
        "max_out_len"                   :  90

    },
    "hf_args" :
    {
        "output_dir"                    : "amrlib/data/model_generate_t5",
        "do_train"                      : true,
        "do_eval"                       : false,
        "overwrite_output_dir"          : false,
        "prediction_loss_only"          : true,
        "num_train_epochs"              : 8,
        "save_steps"                    : 1000,
        "save_total_limit"              : 2,
        "per_device_train_batch_size"   : 6,
        "per_device_eval_batch_size"    : 6,
        "gradient_accumulation_steps"   : 4,
        "learning_rate"                 : 1e-4,
        "seed"                          : 42
    }
}

Here is the detail error when I run the 10_Collect_AMR_Data.py.

(venv) qbao775@Broad-AI-2:/data/qbao775/amrlib/scripts/31_Model_Parse_T5$ python                                                                              20_Train_Model.py
Loading model and tokenizer
Building datasets
Loading and converting amrlib/data/tdata_t5/train.txt.nowiki
  0%|                                                                     | 0/55                                                                               0%|                                                                     | 0/55                                                                             635 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "20_Train_Model.py", line 18, in <module>
  File "/data/qbao775/amrlib/amrlib/models/parse_t5/trainer.py", line 46, in tra                                                                             in
    train_dataset   = self.build_dataset(train_file_path)
  File "/data/qbao775/amrlib/amrlib/models/parse_t5/trainer.py", line 73, in bui                                                                             ld_dataset
    entries = load_and_serialize(fpath) # returns a dict of lists
  File "/data/qbao775/amrlib/amrlib/models/parse_t5/penman_serializer.py", line                                                                              23, in load_and_serialize
    serials['sents'].append(serializer.get_meta('snt').strip())
  File "/data/qbao775/amrlib/amrlib/models/parse_t5/penman_serializer.py", line                                                                              47, in get_meta
    return self.graph.metadata[key]
KeyError: 'snt'

bjascob commented 2 years ago

Your data needs to have ::snt <sentence> in the AMR metadata header for each graph. The 10_Collect_AMR_Data.py script simply copies this, and all the other data, from the original English AMR-3 file to the new file. If you look at that script, the only things it's doing is collating multiple training files into one file (with some ASCII character filtering), and creating a version with the :wiki edges stripped. The :wiki edges are stripped because the model is not trained to produce these. If needed, they are added to the predicted graphs in a post-processing step.

Be sure you are using the config file model_parse_t5.json to train the parse (sentence to graph) model. The model_generate_t5.json config is used to train a generate (graph to sentence) model.

14H034160212 commented 2 years ago

Hi Brad, Thanks a lot for your reply. Merry Christma! The dataset I am using is the AMR-3 which is LDC2020T02. Here is the screenshot for the dataset. What I was trying to do is to replicate what this link has trained for the T5 AMR parser using AMR-3 dataset.

What the current AMR metadata header has is something like that # ::tok Establishing Models in Industrial Innovation. Do you mean I need to replace the ::tok by ::snt?

bjascob commented 2 years ago

In the released corpus, the data your showing (with 'alignments' in the filename) is in amr_annotation_3.0/data/alignments/split. The training data that is typically used is in amr_annotation_3.0/data/amrs/split/ and has filenames like amr-release-3.0-amrs-training-bolt.txt (no 'alignment' in the filename). The standard files have ':snt' in the metadata.

The data above is annotated with token surface alignments (the ~e.32 after the node names) (which is why it has 'tok' instead of 'snt'). I'm not sure what would happen if you try to train with this data by changing the tok field. I think the surface alignments will be stripped during the training linearization process but I believe that when you test with smatch, most of your nodes will fail so you'll get very low scores. I'd recommend using the amrs directory data. If you don't have access to that then I'd recommend pre-stripping the surface alignments with the penman library.

14H034160212 commented 2 years ago

Thank you so much. When I use the correct version of amrs directory, the program is working! If I want to replicate the result for that T5 parser. Are there any other things I need to change? Can I use the default hyperparameter from the model_parse_t5.json? I am current using the default hyperparameter from the model_parse_t5.json and 1 RTX8000 GPU with 48 GB memory.

Here is the current progress. I saw you got 82 SMATCH score with LDC2020T02. What is the 82 means? I got the series of number of SMATCH score for the current stage SMATCH -> P: 0.829, R: 0.793, F:0.811.

bjascob commented 2 years ago

I you use the scripts and config exactly as they appear in the project you will get a 0.831 on the Dev set during training and 0.819 with the Test (beams=4) set afterwards. This is the no-wiki corpus. If you add the wiki tags with the postprocessing BLINK scripts in the training directory you'll get 0.818 smatch (I generally don't bother to add the wiki tags because it's a pain to setup and doesn't change the overall score much).

If you're interested in scores, I'd recommend trying t5-large. That model is too big to train on 12GB GPUs but with a 48GB you won't have any issues. All you have to do is change the config file 'model_name_or_path' from t5-base to t5-large.

14H034160212 commented 2 years ago

Thanks a lot! What are the wiki tags? Is there any difference with the no-wiki corpus?

bjascob commented 2 years ago

Wiki tags (any edge with :wiki) are links from AMR (named) entities to Wikipedia articles. To add them you need to do an article search for the named-entities on wikipedia. This is very different than parsing so it's common to strip them from the corpus and ignore them.

BTW... I notice your training is fairly slow for that GPU. If you happen to train this again with that 48GB GPU, you should be able to change your "per_gpu_batch_size" to 16 and then drop the "gradient_accumulation_step" to 1. In theory this should train exactly the same as the params in the config (batch size 4, grad accum 4 ==> 16 steps per optimizer batch). You can probably cut training time by half.

14H034160212 commented 2 years ago

Thank you so much! I have finished the training using the config (batch size 4, grad accum 4). Here is the result. I have one more question. How can I save the best dev acc checkpoint? From the saved checkpoints, I did not know which one has the highest dev acc. Also does the 82 SMATCH score from that link means the F1 score?

Here is the saved checkpoints.

bjascob commented 2 years ago

The smatch score is printed out in the training log, right above where it saves the checkpoint. I can only see epoch 14-16 on the screen but looks like epoch 15 is slightly higher with a score of 0.830. That's checkpoint-51720. Assuming that's the highest score you can just delete the rest of the checkpoints.

Yes, the smatch score is the F1 score (precision and recall are typically ignored). However note that during training you're scoring on the dev set with a beam size of 1. For whatever reason the test set gets a little lower score, even with a beam of 4. You should get an 0.819 (or maybe 0.818) if you run 22_Test_Model.py. This is the score typically reported.

14H034160212 commented 2 years ago

Hi, Thanks for the feedback. For the training log, I found two different log files namely train_model_parse_t5.log under amrlib/logs/ and trainer_state.json under checkpoint-51720/. But it seems neither of them record the smatch score.

I found in the trainer_state.json, there are best_metric and best_model_checkpoint, but both of them are null. Perhaps, we can set those two hyperparameters in the config file to save the best dev acc checkpoint?

{
  "best_metric": null,
  "best_model_checkpoint": null,

Also, I found run_tensorboard.sh under 31_Model_Parse_T5/ and it seems have runs folder after training. But I did not find it. Do you know how can I have the runs folder? Do I need to set some parameters in model_parse_t5.json for that?

I got 0.821 test F1 score by using the checkpoint from model_parse_t5/checkpoint-51720/.

bjascob commented 2 years ago

Glad it's working for you. Good luck.

bjascob / amrlib

KeyError: 'snt' #36