elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
628 stars 98 forks source link

`eland_import_hub_model --task-type text_classification` fails #609

Closed pquentin closed 10 months ago

pquentin commented 10 months ago

Using transformers 4.27.4 (our current upper bound), we get:

$ eland_import_hub_model --cloud-id $CLOUD_ID --hub-model-id distilbert-base-uncased-finetuned-sst-2-english --task-type text_classification
2023-09-25 21:30:12,644 INFO : Establishing connection to Elasticsearch
2023-09-25 21:30:13,729 INFO : Connected to cluster named '9fcead8be8d2469fa096128e20496cee' (version: 8.9.1)
2023-09-25 21:30:13,730 INFO : Loading HuggingFace transformer tokenizer and model 'distilbert-base-uncased-finetuned-sst-2-english'
Traceback (most recent call last):
  File "/home/q/.virtualenvs/eland/bin/eland_import_hub_model", line 33, in <module>
    sys.exit(load_entry_point('eland', 'console_scripts', 'eland_import_hub_model')())
  File "/home/q/src/eland/eland/cli/eland_import_hub_model.py", line 241, in main
    tm = TransformerModel(
  File "/home/q/src/eland/eland/ml/pytorch/transformers.py", line 632, in __init__
    self._traceable_model = self._create_traceable_model()
  File "/home/q/src/eland/eland/ml/pytorch/transformers.py", line 787, in _create_traceable_model
    model = transformers.AutoModelForSequenceClassification.from_pretrained(
  File "/home/q/.virtualenvs/eland/lib64/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/home/q/.virtualenvs/eland/lib64/python3.10/site-packages/transformers/modeling_utils.py", line 2498, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
TypeError: DistilBertForSequenceClassification.__init__() got an unexpected keyword argument 'token'

But 4.33.2 (the latest version as of today), the same command works:

$ eland_import_hub_model --cloud-id $CLOUD_ID --hub-model-id distilbert-base-uncased-finetuned-sst-2-english --task-type text_classification
2023-09-25 21:27:54,366 INFO : Establishing connection to Elasticsearch
2023-09-25 21:27:55,192 INFO : Connected to cluster named '9fcead8be8d2469fa096128e20496cee' (version: 8.9.1)
2023-09-25 21:27:55,193 INFO : Loading HuggingFace transformer tokenizer and model 'distilbert-base-uncased-finetuned-sst-2-english'
/home/q/.virtualenvs/eland/lib64/python3.10/site-packages/transformers/models/distilbert/modeling_distilbert.py:223: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask, torch.tensor(torch.finfo(scores.dtype).min)
2023-09-25 21:28:00,059 INFO : Creating model with id 'distilbert-base-uncased-finetuned-sst-2-english'
2023-09-25 21:28:00,272 INFO : Uploading model definition
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [01:51<00:00,  2.29 parts/s]
2023-09-25 21:29:52,051 INFO : Uploading model vocabulary
2023-09-25 21:29:52,494 INFO : Model successfully imported with id 'distilbert-base-uncased-finetuned-sst-2-english'

We need to figure out when this was fixed (there's nothing obvious from the release notes so I'll have to bisect) and update our bounds for transformers.

pquentin commented 10 months ago

Turns out GitHub was truncating the release notes. I confirmed that this was fixed in transformers==4.31.0 with https://github.com/huggingface/transformers/pull/24306 and https://github.com/huggingface/transformers/pull/24862.

pquentin commented 10 months ago

The problem is that 4.31.0 drops support for Python 3.7 which has reached end-of-life on June 27th. Thankfully it accounts for <5% of our downloads today, so dropping it sounds fine.

Edit: oh, we dropped Python 3.7 in #512 already.