SeanLee97 / AnglE

Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard
https://arxiv.org/abs/2309.12871
MIT License
454 stars 32 forks source link

How to use the encoding with tiktoken ? #28

Closed rishabhgupta93 closed 6 months ago

rishabhgupta93 commented 8 months ago

Hey,

I am trying to get the encoding using tiktoken to initiate token counter:

_import tiktoken from llama_index.callbacks import CallbackManager, TokenCountingHandler enc = tiktoken.get_encoding("WhereIsAI/UAE-Large-V1") tokencounter = TokenCountingHandler(tokenizer= enc.encode)

But i am getting following error:

_--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[20], line 3 1 import tiktoken 2 from llama_index.callbacks import CallbackManager, TokenCountingHandler ----> 3 enc = tiktoken.get_encoding("WhereIsAI/UAE-Large-V1") 4 token_counter = TokenCountingHandler(tokenizer= enc.encode)

File f:\pycharmprojects\llamaindex\venv\lib\site-packages\tiktoken\registry.py:68, in get_encoding(encoding_name) 65 assert ENCODING_CONSTRUCTORS is not None 67 if encoding_name not in ENCODING_CONSTRUCTORS: ---> 68 raise ValueError( 69 f"Unknown encoding {encoding_name}. Plugins found: {_available_plugin_modules()}" 70 ) 72 constructor = ENCODING_CONSTRUCTORS[encoding_name] 73 enc = Encoding(**constructor())

ValueError: Unknown encoding WhereIsAI/UAE-Large-V1. Plugins found: ['tiktoken_ext.openaipublic']

Is there any way to use the encodings with tiktoken ?

Thanks

SeanLee97 commented 8 months ago

It seems tiktoken is used for GPT-like models, but UAE is a BERT-based model.

Could you use the tokenizers package in your application? You can use tokenizers to load UAE's tokenizer.

SeanLee97 commented 8 months ago

By the way, do you want to get the tokenized IDs of sentences or obtain their sentence embedding? If you want to get their sentence embeddings, follow the usage.

rishabhgupta93 commented 8 months ago

Thanks for prompt response!

I am able to create embeddings.

I just want to count total tokens for which embedding is generated and also the number of tokens while running the query engine.

I am using llama-index for creating RAG.