AI4Bharat / Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT
https://indicnlp.ai4bharat.org
MIT License
276 stars 41 forks source link

How to decode token embeddings into token ids? #46

Closed VimalMollyn closed 2 years ago

VimalMollyn commented 2 years ago

I'm trying to build a machine translation model using the indicBERT model as an embedding. I'm able to obtain token embeddings from a tokenized sentence as follows:

tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert') 
model = AutoModel.from_pretrained('ai4bharat/indic-bert')

vocab_to_embedding_convertor = model.get_input_embeddings()
tokens = tokenizer(["హలో","పేరు"], return_tensors="pt")['input_ids']

embeddings = vocab_to_embedding_convertor(tokens)

However, I'm unable to find a way to obtain token ids from these embeddings. How would I go about doing this?

Thanks! Vimal