Heidelberg-NLP / ancient-language-models

19 stars 3 forks source link

Documentation for the use of T5 model #1

Open gcelano opened 1 month ago

gcelano commented 1 month ago

I m trying to use GreTa following the commands here https://huggingface.co/bowphs/GreTa , but it does not work. AutoModelForConditionalGeneration seems to have been substituted with T5ForConditionalGeneration (using transformers 4.42.4), but, most importantly, it is not clear how to use the model:

tokenized = tokenizer(sentences)
model(**tokenized)

This returns the error The following keyword arguments are not supported by this model: ['token_type_ids']

If I adapt the following example (https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model.forward)

from transformers import AutoTokenizer, T5Model

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
model = T5Model.from_pretrained("google-t5/t5-small")

input_ids = tokenizer(
    "Studies have been shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1

# preprocess: Prepend decoder_input_ids with start token which is pad token for T5Model.
# This is not needed for torch's T5ForConditionalGeneration as it does this internally using labels arg.
decoder_input_ids = model._shift_right(decoder_input_ids)
# forward pass
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
last_hidden_states = outputs.last_hidden_state

it works, but it is not clear what decoder_input_ids should correspond to and how to pass the masks.

Can you, please, provide an example of the use of the model?

bowphs commented 1 month ago

Hi,

Thank you very much for opening the issue. AutoModelForConditionalGeneration does indeed seem to be outdated, good catch!

The fundamental problem you are having is that GreTa (or T5 in general) is an encoder-decoder model, where the encoder and decoder each expect different input_ids. You can see the inputs for the encoder here, and for the decoder here.

What a useful example demonstrating how the model should be used looks like of course depends on the use case. For example, to do inference for machine translation, you could do something like this:

from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("bowphs/ancient-t5-translation")
model = T5ForConditionalGeneration.from_pretrained("bowphs/ancient-t5-translation")

input_ids = tokenizer("translate english to greek: the man took the bowl with the intention of drinking wine.", return_tensors="pt").input_ids
outputs = model.generate(input_ids, num_beams=3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This example accesses a fine-tuned variant of the model. I have a fine-tuning script for lemmatization here. In general, GreTa should work with any code that also works for the original T5 model, so the "canonical" HuggingFace notebooks should also be good pointers. If you have any further questions or have a different use case in mind, don't hesitate to follow up.