AdamStein97 / Semi-Supervised-BERT-NER

31 stars 8 forks source link

Two questions about applying your amazing model to Traditional Chinese dataset #1

Open marcusau opened 4 years ago

marcusau commented 4 years ago

Thanks for your amazing works. Appreciate this models very much.

I do have a Chinese dataset of ~60k sentences. The labeled NERs are done but 50% of them must be further polished as the manual NER labelling would include some noises.

Based on your instructions in "bert_ner_data_dist_kl_config.yaml",

model_start_weights_filename: 'BERT_NER_final'

So , in order to train a bert_ner_data_dist_kl model, we have to train a ber_ner_baseline model first, and then leverage the result of this baseline model to train bert_ner_data_dist_kl model.

My Understanding is correct?

Additionally, the purpose of your semi-supervised-bert-ner is to tackle the issues of limited labelled NER data for training a NER model right?

For me , my firm offers me a huge "raw dataset" but labelling data is limited.

Thanks a lot.

Marcus

AdamStein97 commented 4 years ago

Hi Marcus,

You are correct. You first train the BERT_NER model using the bert_ner_trainer.py file. After this is trained, you use the bert_ner_trainer_data_dist_kl.py to fine tune these weights. With around 60k sentences, I would recommend a high batch size to make the fine tuning as effective as possible.

Yes exactly. The goal of the model is to be able to leverage unlabelled data to improve accuracy. If you can get this "raw dataset" in the same format as your labelled examples then hopefully this approach will be successful for you.

Let me know if you have any more questions!

Thanks, Adam

marcusau commented 4 years ago

Hi Adam,

Thanks for your prompt response. Just one follow-up question. If my approach is :

  1. use 40% of my raw dataset (~25k sentences) to train the BERT_NER model using the bert_ner_trainer.py file, (because I have high conviction of the quality of the NER label on this 40% data.)

  2. use the full dataset (~60k sentences) for bert_ner_trainer_data_dist_kl.py to fine tune these weights.

Do you think my approach is correct?

Thanks a lot.

Marcus

AdamStein97 commented 4 years ago

Hi Marcus,

Yes this is the correct approach. Please note that in the bert_ner_trainer_data_dist_kl.py file, you will also pass the 40% of your raw dataset as the "labelled_ds" argument and the remaining 60% as the "unlabelled_ds" argument.

Also do not forget that you will need to estimate the true probability distribution of your labels and place that in the "bert_ner_data_dist_kl_config.yaml" config file.

Good luck! Adam

marcusau commented 4 years ago

Hi Adam,

sure. thanks a lot.

Will do so in mid-this week

Thanks.

Marcus

marcusau commented 4 years ago

Hi Adam,

I would like to share the progress of my work on your amazing library.

I am training the BERT_NER model by using the bert_ner_trainer.py file with 'https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/2' on google colab.

my dataset is something like these: sentence_id,Word,word_id,tag_id 0,[CLS],101,0 0,花,5709,0 0,旗,3186,0 0,發,4634,22 0,表,6134,22 0,報,1841,3 0,告,1440,3 0,指,2900,22 0,,,8024,22 0,中,704,0 0,升,1285,0 0,控,2971,0 0,股,5500,0 0,旗,3186,22 0,下,678,22 0,L,154,0 0,e,147,0 0,x,166,0 0,u,163,0 0,s,161,0 0,新,3173,22 0,款,3621,22 0,M,155,3 0,P,158,3 0,V,164,3 0,L,154,3 0,M,155,3 0,3,124,3 0,0,121,3 0,0,121,3 0,市,2356,15 0,場,1842,15 0,接,2970,22 0,受,1358,22 0,度,2428,22 0,極,3513,22 0,佳,881,22 0,,,8024,22 0,有,3300,22

i have about 22 ner tags, including: {"0": "ORG", "1": "LOC", "2": "FAC", "3": "PRODUCT", "4": "LANGUAGE", "5": "NORP", "6": "WORK_OF_ART", "7": "QUANTITY", "8": "PERSON", "9": "LAW", "10": "EVENT", "11": "TITLE", "12": "TIME", "13": "IDIOM", "14": "ENGLISH", "15": "J", "16": "FIN", "17": "TERM", "18": "UNIT", "19": "CONCEPT", "20": "POLICY", "21": "SLOGAN"}

org- firm/organization FIN = financial tools, e.g stock indice or bond, options etc. CONCEPT= idea, concepts TERM= professional terms, outside financial scopes, e.g. XX accounting standards UNIT = kg, lots of stocks, etc SLOGAN= most Chinese listed companies and policies are made up of slogans, this is character of China equity market. Policy = Gov policies or schemes LAW = rule or laws J = shortnames of some stocks, names or policies,

image

Configs: csv_filename: 'preprocessed_ner_dataset.csv' max_seq_length: 216 BATCH_SIZE: 128 BUFFER_SIZE: 2048 test_set_batches: 75 labelled_train_batches: 20 categories: 22

word_id_field: 'word_id' mask_field: 'mask' segment_id_field: 'segment_id' tag_id_field: 'tag_id'

EPOCHS: 20 latent_dim: 32 rate: 0.0 mlp_dims: [256, 128, 64] lr: 0.001 model_save_weights_name: 'BERT_NER'

Lets see the result after finish.

Marcus

marcusau commented 4 years ago

Also i use 'BMES' format as label instead of 'BIO' format

marcusau commented 4 years ago

Hi Adam,

I may need your help on my first training on Bert_trainer.py.

For my initial training exercise, all procedures are run on google colab but the training result is strange and below expectation.

image

I dont know what mistake i have made on the dataset. The val accuracy is capped by 10% even running till 100 epochs and the accuracy without 'O' must be 0%.......

There are the sample of my 'processed_dataset.csv' and the parameters i used for training:

image

I do think my preprocessed dataset format is correct and it is strictly followed your requirements.

The parameters i used are:

in config.yaml

csv_filename: 'preprocessed_ner_dataset.csv' max_seq_length: 128 ( ----> I change to 128 in order to fit the news articles of my data source) BATCH_SIZE: 128 BUFFER_SIZE: 2048 test_set_batches: 75 labelled_train_batches: 22 categories: 22 ( ----> there are 22 ner categories in my dataset)

word_id_field: 'word_id' mask_field: 'mask' segment_id_field: 'segment_id' tag_id_field: 'tag_id'

bert_ner.yaml

EPOCHS: 20 latent_dim: 32 rate: 0.0 mlp_dims: [256, 128, 64] lr: 0.001 model_save_weights_name: 'BERT_NER'

For the bert pretrained model, i used multi-lingual bert image

Please give me some hints about any mistake i have made .

Thanks a lot.

Marcus

AdamStein97 commented 4 years ago

Hi Marcus,

So your loss seems to be NaN from the beginning which implies the input to the model seems to be wrong. Have you made sure that the tokenizer which generates the word ids is compatible with chinese (This is the most likely issue)? I suspect you will also maybe need to pull a different version of the bert layer from tf hub in both the preprocessor and the model.

Following that I would recommend checking the batches being passed to the model and just ensuring they seem sensible.

Thanks, Adam

marcusau commented 4 years ago

ok i may try to use bert-Chinese-base from tensor-hub first. let 's see if i can make a difference. Thanks