csebuetnlp / banglabert

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL-2022.
232 stars 31 forks source link

Can I extract word embeddings using BanglaBERT ? #2

Closed MusfiqDehan closed 2 years ago

MusfiqDehan commented 2 years ago

Hi, Is it possible to extract/generate word embeddings using BanglaBERT? I have tokenized my Bangla sentence using BanglaBERT. Now I want to generate Word Embeddings from my tokenized sentence.

!pip install transformers
!pip install git+https://github.com/csebuetnlp/normalizer

from transformers import AutoModelForPreTraining, AutoTokenizer
from normalizer import normalize
import torch

model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert")
tokenizer_bbert = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")

text = 'দেশদ্রোহিতার মামলা স্বর্ণ মন্দিরের ভিতর ও বৈশাখী উৎসবের মিছিলে খলিস্তানপন্থী স্লোগান দেওয়ার জন্য কয়েকজন বিশ্ব যুবকের বিরুদ্ধে দেশদ্রোহিতার মামলা দায়ের করা হয়েছে ।'

text = normalize(text)

text = tokenizer_bbert.tokenize(text)

print(text)

# >>  ['দেশদ্রোহ', '##িতার', 'মামলা', 'স্বর্ণ', 'মন্দিরের', 'ভিতর', 'ও', 'বৈশাখী', 'উৎসবের', 'মিছিলে', 'খলি', '##স্তান', '##পন্থী', 'স্লোগান', 'দেওয়ার','জন্য', 'কয়েকজন', 'বিশ্ব', 'যুবকের', 'বিরুদ্ধে', 'দেশদ্রোহ', '##িতার', 'মামলা', 'দায়ের', 'করা', 'হয়েছে', '।']

I have find out how to generate Word Embeddings using BERT. Here is the link (https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958). Will it be same for BanglaBERT or Bangla Language or it will be better to use a different Bangla Language specific approach?

Any kind of suggestion or advice will be helpful for me. Thanks in advance.

Tahmid04 commented 2 years ago

Hi, the method you showed should work just fine.