exasol / transformers-extension

An Exasol extension for using state-of-the-art pretrained machine learning models via the Hugging Face Transformers API.
MIT License
2 stars 2 forks source link

Calling save_pretrained on on a generic AutoModel class loses the model specifics #213

Closed ahsimb closed 2 months ago

ahsimb commented 2 months ago

Problem

Downloading a model and saving it locally using the following code loses the model specialization.

    tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir, **kwargs)
    model = AutoModel.from_pretrained(model_name, cache_dir=cache_dir, **kwargs)

    save_dir = Path(pretrained_dir) / model_name
    tokenizer.save_pretrained(save_dir)
    model.save_pretrained(save_dir)

One way to verify this is to inspect the config.json file. For example, consider the gaunernst/bert-tiny-uncased model. Here is the beginning of its config.json file:

{
  "architectures": [
    "BertForMaskedLM"
  ],
  ...

After running the above code for this model the config.json becomes like this:

{
  "_name_or_path": "gaunernst/bert-tiny-uncased",
  "architectures": [
    "BertModel"
  ],
  ...

A proper way of saving this model is by using a specialized model class, as in the code below.

    tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir, **kwargs)
    model = AutoModelForMaskedLM.from_pretrained(model_name, cache_dir=cache_dir, **kwargs)

    save_dir = Path(pretrained_dir) / model_name
    tokenizer.save_pretrained(save_dir)
    model.save_pretrained(save_dir)

When used correctly, this model should produce the output for the request

"I [MASK] you so much."

Similar to the following:

[
  {'score': 0.21148031949996948, 'token': 2293, 'token_str': 'love', 'sequence': 'i love you so much.'},
  {'score': 0.07706509530544281, 'token': 2113, 'token_str': 'know', 'sequence': 'i know you so much.'},
  {'score': 0.06537336856126785, 'token': 2215, 'token_str': 'want', 'sequence': 'i want you so much.'},
  {'score': 0.04397880658507347, 'token': 2342, 'token_str': 'need', 'sequence': 'i need you so much.'},
  {'score': 0.03759443759918213, 'token': 2425, 'token_str': 'tell', 'sequence': 'i tell you so much.'}
]

At the moment it returns gibberish.

Solution

Acceptance Criteria

### Tasks
- [ ] Refactor model path generation into one function out of HuggingFaceHubBucketFSModelTransferSP and LoadLocalModel
- [ ] Refactor PredictionUDFs and LoadLocalModel that LoadLocalModel constructs the bucketfs file path to the model
- [ ] Change upload cli to use HuggingFaceHubBucketFSModelTransferSP
- [ ] Add task parameter with default value to HuggingFaceHubBucketFSModelTransferSP, LoadLocalModel
- [ ] Add version parameter with default value to HuggingFaceHubBucketFSModelTransferSP, LoadLocalModel. Use to pin the model version in the tests
- [ ] Add task and version parameter to download udf
- [ ] Add version and optional seed parameter in the prediction udfs, use to pin version and seed in the tests
- [ ] Change the documentation
ahsimb commented 2 months ago

Referring to the previous example, it's worth noting that AutoModel.from_pretrained('gaunernst/bert-tiny-uncased') returns an instance of the BertModel class, while AutoModelForMaskedLM.from_pretrained('gaunernst/bert-tiny-uncased') returns an instance of the BertForMaskedLM class.

MarleneKress79789 commented 2 months ago

I created tickets to address these issues: #216, #217, #218, #219, #220, #221, #222.

MarleneKress79789 commented 2 months ago

Closing this issue, work will be done in the issues linked above.