This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL-2022.
232
stars
31
forks
source link
Can I extract word embeddings using BanglaBERT ? #2
Hi,
Is it possible to extract/generate word embeddings using BanglaBERT?
I have tokenized my Bangla sentence using BanglaBERT. Now I want to generate Word Embeddings from my tokenized sentence.
!pip install transformers
!pip install git+https://github.com/csebuetnlp/normalizer
from transformers import AutoModelForPreTraining, AutoTokenizer
from normalizer import normalize
import torch
model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert")
tokenizer_bbert = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")
text = 'দেশদ্রোহিতার মামলা স্বর্ণ মন্দিরের ভিতর ও বৈশাখী উৎসবের মিছিলে খলিস্তানপন্থী স্লোগান দেওয়ার জন্য কয়েকজন বিশ্ব যুবকের বিরুদ্ধে দেশদ্রোহিতার মামলা দায়ের করা হয়েছে ।'
text = normalize(text)
text = tokenizer_bbert.tokenize(text)
print(text)
# >> ['দেশদ্রোহ', '##িতার', 'মামলা', 'স্বর্ণ', 'মন্দিরের', 'ভিতর', 'ও', 'বৈশাখী', 'উৎসবের', 'মিছিলে', 'খলি', '##স্তান', '##পন্থী', 'স্লোগান', 'দেওয়ার','জন্য', 'কয়েকজন', 'বিশ্ব', 'যুবকের', 'বিরুদ্ধে', 'দেশদ্রোহ', '##িতার', 'মামলা', 'দায়ের', 'করা', 'হয়েছে', '।']
Hi, Is it possible to extract/generate word embeddings using BanglaBERT? I have tokenized my Bangla sentence using BanglaBERT. Now I want to generate Word Embeddings from my tokenized sentence.
I have find out how to generate Word Embeddings using BERT. Here is the link (https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958). Will it be same for BanglaBERT or Bangla Language or it will be better to use a different Bangla Language specific approach?
Any kind of suggestion or advice will be helpful for me. Thanks in advance.