Closed plandes closed 2 years ago
Taking a quick look at the changes, I have a question about parse_t5/trainer.py
. By passing in a model
parameter you are trying to change the type (class) of model to something other than a T5. If you just want to continue training the existing amrlib model you just need to put the local name (ie.. the directory name) in the trainer's config file. ie..
in <root>/config/model_parse_t5.json modify the line...
model_name_or_path = '/home/you/amrlib/data/your_copy_of_model_parse_t5'
The one complication is that I didn't create a local copy of the tokenizer info and normally the code will use what's in your HF cache (~/.cache/huggingface/transformers/xx) but when you change the name it can't find it. The simplest thing to do is to just copy the two files to your model directory. You can find them by grepping for t5-base
in the very long hash names in the cache. Here's the quick link to download them if that's easier..
https://huggingface.co/t5-base/resolve/main/tokenizer.json
https://huggingface.co/t5-base/resolve/main/spiece.model
Taking a quick look at the changes, I have a question about
parse_t5/trainer.py
. By passing in amodel
parameter you are trying to change the type (class) of model to something other than a T5. If you just want to continue training the existing amrlib model you just need to put the local name (ie.. the directory name) in the trainer's config file. ie..
No, I am not trying to change the class. The T5ForConditionalGeneration
type hint of the initializer method comes from doing an instanceof
on amrlib/models/parse_t5.inference.Inference.model
.
Yes, I want to continue training the existing AMR model_stog
model. Now looking at the code, I'm just doing exactly what you did in amrlib/models/parse_t5.inference.Inference.model
to get the model, so that change can be revered. Not sure how I missed that--sorry for the static.
The one complication is that I didn't create a local copy of the tokenizer info and normally the code will use what's in your HF cache (~/.cache/huggingface/transformers/xx) but when you change the name it can't find it. The simplest thing to do is to just copy the two files to your model directory. You can find them by grepping for
t5-base
in the very long hash names in the cache. Here's the quick link to download them if that's easier..
I didn't have this issue--I simply used T5ForConditionalGeneration.from_pretrained
and used the trainer you provided. I retrained on the bio-medical corpus, then evaluated both the distributed pretrained model you distributed and my new trained model and it went from a smatch score of 0.517 to 0.646, so it seems to be working. Although I'm a little worried those scores are so low, but that's a different issue. (Also, I was getting a smatch score using the pertained AMR model you distributed of .75, which seems low--also a different issue.)
One quick note on SMATCH scores... the models don't generate the wiki
tags and the scoring code here is setup to score against a version of the AMR test corpus with these tags stripped out. If you try to score the generated test graphs against a test corpus with wiki
tags present, you're going to get lower scores because the wiki
triples won't match to anything.
Adding the wiki
tags is a post-process operation that is involves setting up a local copy of the spotlight
db server, and so it's a bit painful and not useful to most folks. There aren't a lot of wiki
triples and their SMATCH score alone is about 0.8 so testing without these doesn't impact the final overall score much, so long as you test against a corpus with those tags stripped.
If this doesn't answer your question and you'd like some help looking at the scores, open another issue so we don't get too far off topic here.
One quick note on SMATCH scores... the models don't generate the
wiki
tags and the scoring code here is setup to score against a version of the AMR test corpus with these tags stripped out. If you try to score the generated test graphs against a test corpus withwiki
tags present, you're going to get lower scores because thewiki
triples won't match to anything.
I haven't done anything with the wiki tags so I'd think it would test higher (~0.75 SMATC) with The Little Prince corpus.
If this doesn't answer your question and you'd like some help looking at the scores, open another issue so we don't get too far off topic here.
Should I revert the trainer commit and recreate the pull request?
One more thing about the token position tags: if there is more than one position with the word, only one is given. I created a completely separate algorithm to do this to create a list of comma separated indexes as a string I use for the T5 model. I have that baked in to my library, but I could look at ripping it out and adding it to amrlib
.
Please remove the parse_t5/trainer.py commit from this PR.
It's not clear to me how you intend this to be used by other users of the library. Please provide a small snippet of code here on how to exercise this new functionality (generally the user is expected to interact with inference::parse_sents).
Please also confirm that you have run the test script to get a final smatch score for this model and that the new code doesn't impact the results (F=0.768)
No response from author on issues above. Closing.
This pull request has two contributions: