elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
636 stars 98 forks source link

Changing max_sequence_length and other parameters of SentenceTransformer #527

Closed temiwale88 closed 1 year ago

temiwale88 commented 1 year ago

Hi Eland Team -

I'm re-indexing my data in Elastic Cloud with embeddings for semantic search. I'm currently using this multilingual model. It seems we can change some parameters with the SentenceTransformer class e.g. changing max_seq_length to 512 and do_lower_case to True. How do we achieve that goal with the TransformerModel class from eland.ml.pytorch.transformers?

Thanks.

P.S. thanks for allowing us to quantize :-D

davidkyle commented 1 year ago

max_seq_length is a hard limit set by the model. You can change it to a lower value if you want but you cannot increase it beyond the model's input size.

If you are using a cased model why do you want to set do_lower_case: True?

Those settings are read from tokenizer_config.json, if you really want to you can clone the model repo, edit the file locally and then upload as described in https://github.com/elastic/eland/issues/502#issuecomment-1335193128 but I can't imagine why you would want to do that.

joshdevins commented 1 year ago

This value should not be changed after training. Closing this issue as there's nothing to do for eland.