Add token positions to GSII model, and add fine tuning model by passing T5 model to trainer

plandes commented 2 years ago

This pull request has two contributions:

Add token position to GSII model: this optionally adds relations to nodes of where the corresponding word appears in the source sentence as a 0-index document based offset. This is nice when you want to, say for example, add additional linguistic tokens to the graph such as POS or named entities.
Provides a way to add a trained model to the T5 trainer to treat the existing pre-trained models as a check point, then continue to train it on other (i.e. domain specific) corpora.

bjascob commented 2 years ago

Taking a quick look at the changes, I have a question about parse_t5/trainer.py. By passing in a model parameter you are trying to change the type (class) of model to something other than a T5. If you just want to continue training the existing amrlib model you just need to put the local name (ie.. the directory name) in the trainer's config file. ie..

in <root>/config/model_parse_t5.json modify the line...
model_name_or_path = '/home/you/amrlib/data/your_copy_of_model_parse_t5'

The one complication is that I didn't create a local copy of the tokenizer info and normally the code will use what's in your HF cache (~/.cache/huggingface/transformers/xx) but when you change the name it can't find it. The simplest thing to do is to just copy the two files to your model directory. You can find them by grepping for t5-base in the very long hash names in the cache. Here's the quick link to download them if that's easier..

https://huggingface.co/t5-base/resolve/main/tokenizer.json
https://huggingface.co/t5-base/resolve/main/spiece.model

plandes commented 2 years ago

Taking a quick look at the changes, I have a question about parse_t5/trainer.py. By passing in a model parameter you are trying to change the type (class) of model to something other than a T5. If you just want to continue training the existing amrlib model you just need to put the local name (ie.. the directory name) in the trainer's config file. ie..

No, I am not trying to change the class. The T5ForConditionalGeneration type hint of the initializer method comes from doing an instanceof on amrlib/models/parse_t5.inference.Inference.model.

Yes, I want to continue training the existing AMR model_stog model. Now looking at the code, I'm just doing exactly what you did in amrlib/models/parse_t5.inference.Inference.model to get the model, so that change can be revered. Not sure how I missed that--sorry for the static.

The one complication is that I didn't create a local copy of the tokenizer info and normally the code will use what's in your HF cache (~/.cache/huggingface/transformers/xx) but when you change the name it can't find it. The simplest thing to do is to just copy the two files to your model directory. You can find them by grepping for t5-base in the very long hash names in the cache. Here's the quick link to download them if that's easier..

I didn't have this issue--I simply used T5ForConditionalGeneration.from_pretrained and used the trainer you provided. I retrained on the bio-medical corpus, then evaluated both the distributed pretrained model you distributed and my new trained model and it went from a smatch score of 0.517 to 0.646, so it seems to be working. Although I'm a little worried those scores are so low, but that's a different issue. (Also, I was getting a smatch score using the pertained AMR model you distributed of .75, which seems low--also a different issue.)

bjascob commented 2 years ago

One quick note on SMATCH scores... the models don't generate the wiki tags and the scoring code here is setup to score against a version of the AMR test corpus with these tags stripped out. If you try to score the generated test graphs against a test corpus with wiki tags present, you're going to get lower scores because the wiki triples won't match to anything.

Adding the wiki tags is a post-process operation that is involves setting up a local copy of the spotlight db server, and so it's a bit painful and not useful to most folks. There aren't a lot of wiki triples and their SMATCH score alone is about 0.8 so testing without these doesn't impact the final overall score much, so long as you test against a corpus with those tags stripped.

If this doesn't answer your question and you'd like some help looking at the scores, open another issue so we don't get too far off topic here.

plandes commented 2 years ago

One quick note on SMATCH scores... the models don't generate the wiki tags and the scoring code here is setup to score against a version of the AMR test corpus with these tags stripped out. If you try to score the generated test graphs against a test corpus with wiki tags present, you're going to get lower scores because the wiki triples won't match to anything.

I haven't done anything with the wiki tags so I'd think it would test higher (~0.75 SMATC) with The Little Prince corpus.

If this doesn't answer your question and you'd like some help looking at the scores, open another issue so we don't get too far off topic here.

Should I revert the trainer commit and recreate the pull request?

One more thing about the token position tags: if there is more than one position with the word, only one is given. I created a completely separate algorithm to do this to create a list of comma separated indexes as a string I use for the T5 model. I have that baked in to my library, but I could look at ripping it out and adding it to amrlib.

bjascob commented 2 years ago

Please remove the parse_t5/trainer.py commit from this PR.

It's not clear to me how you intend this to be used by other users of the library. Please provide a small snippet of code here on how to exercise this new functionality (generally the user is expected to interact with inference::parse_sents).

Please also confirm that you have run the test script to get a final smatch score for this model and that the new code doesn't impact the results (F=0.768)

bjascob commented 2 years ago

No response from author on issues above. Closing.

bjascob / amrlib

Add token positions to GSII model, and add fine tuning model by passing T5 model to trainer #28