amaiya / ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply
Apache License 2.0
1.23k stars 269 forks source link

Ktrain Bi-Lstm Bert NER (SciBert and BioBert) cased models, preprocessing text to lowercase #425

Closed dummynov1 closed 2 years ago

dummynov1 commented 2 years ago

@amaiya :Thanks for sharing the info on Scibert Cased model tokenizer config issue. #422 Just to confirm should i use your workaround like this:

TDATA = 'train2.txt'
VDATA = 'test2.txt'
(trn, val, preproc) = text.entities_from_conll2003(TDATA, val_filepath=VDATA)
from transformers import AutoTokenizer
preproc.p.te.tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased', do_lower_case=False)
WV_URL = 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz'
model = text.sequence_tagger('bilstm-bert', preproc, 
                            bert_model='allenai/scibert_scivocab_cased', wv_path_or_url=WV_URL)

Is this how the Preproc tokennizer to be intialized, befor running learner code .? I think i'm doing something wrong, i'm not sure how to pass the option do_lower_case=False.

amaiya commented 2 years ago

It is configured after creating the model:

model = text.sequence_tagger('bilstm-bert', preproc, bert_model='allenai/scibert_scivocab_cased')
from transformers import AutoTokenizer
preproc.p.te.tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased', do_lower_case=False)