Sequence labeling and multi-class classification problem.

geo47 commented 3 years ago

Hello @dsindex

I need some advice related to sequence labeling and multi-class classification problem.

Given a dataset having:

1- Sentence: each sentence contains entities. 2- Category: A category to which the sentence belongs. 3- Entities: List of entities in the Sentence. 4- Labels: List of labels for the entities.

For a single NER task, we tokenize the sentence and apply labels on those tokes usually in a format similar to CoNLL2003. But in the case where I also have to categorize the sentence as well, what format and strategy should be applied considering BERT as a mean of feature extraction.

Thanks.

dsindex commented 3 years ago

@geo47

there is a similar problem. e.g. the intent classification and slot tagging. and multi-task learning is well-suited for it.

BERT for Joint Intent Classification and Slot Filling https://arxiv.org/pdf/1902.10909.pdf

1. To train this, we need to prepare a dataset something like:

"i want to make a reservation A-hotel restaurant at 9 p.m" => category#reqReservation i O want O to O ... A-hotel B-Place restaurant I-Place at 9 B-Time p.m I-Time

except the first line, the format is same as CoNLL style.
there should be two kind of y labels for each example.
- first : category class id
- second : sequence of entity label id
- we can combine the two y labels using '[CLS]' tokens. that is:

2. Multi-task loss

total loss = sequence classification loss: cross entropy loss for the first token(e.g. [CLS]) + token classification loss: cross entropy loss for each tokens

geo47 commented 3 years ago

@dsindex

Thank you so much, It was really appreciating. I was literally looking for the same thing and it also has the open source JointBERT .

Could we add up this function in ntagger repo..?

Thanks again :-)

dsindex commented 3 years ago

@geo47

just updated, --bert_use_mtl https://github.com/dsindex/ntagger/blob/master/MULTI-TASK.md

geo47 commented 3 years ago

Thank you so much, I will give it a try, much appreciable :-)

geo47 commented 3 years ago

Hi @dsindex

I run the code on atis data with the same settings described here, and I was able to train the model. However, I got the following error in evaluate.py:

evaluate.py --config=configs/config-bert.json --data_dir=data/atis --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --bert_use_mtl

INFO:__main__:{'emb_class': 'bert', 'enc_class': 'bilstm', 'n_ctx': 180, 'lowercase': True, 'token_emb_dim': 300, 'pad_token': '<pad>', 'pad_token_id': 0, 'unk_token': '<unk>', 'unk_token_id': 1, 'dsa_num_attentions': 4, 'dsa_dim': 300, 'dsa_r': 2, 'pos_emb_dim': 100, 'pad_pos': '<pad>', 'pad_pos_id': 0, 'char_n_ctx': 50, 'char_vocab_size': 262, 'char_padding_idx': 261, 'char_emb_dim': 25, 'char_num_filters': 30, 'char_kernel_sizes': [3, 9], 'dropout': 0.1, 'lstm_hidden_dim': 200, 'lstm_num_layers': 2, 'lstm_dropout': 0.0, 'mha_num_attentions': 8, 'pad_label': '<pad>', 'pad_label_id': 0, 'default_label': 'O', 'prev_context_size': 64, 'args': Namespace(batch_size=1, bert_disable_lstm=False, bert_output_dir='bert-checkpoint', bert_use_doc_context=False, bert_use_feature_based=False, bert_use_mtl=True, bert_use_pos=False, bert_use_subword_pooling=False, bert_use_word_embedding=False, config='configs/config-bert.json', convert_onnx=False, data_dir='data/atis', device='cuda', elmo_options_file='embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json', elmo_weights_file='embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5', enable_dqm=False, enable_ort=False, model_path='pytorch-model-bert.pt', num_examples=0, num_threads=0, onnx_opset=11, onnx_path='pytorch-model.onnx', quantize_onnx=False, quantized_onnx_path='pytorch-model.onnx-quantized', use_char_cnn=False, use_crf=False, use_mha=False, use_ncrf=False)}

INFO:dataset.dataset:[data/atis/test.txt.fs data loaded]
Traceback (most recent call last):
  File "/ntagger/evaluate.py", line 551, in <module>
    main()
  File "/ntagger/evaluate.py", line 548, in main
    evaluate(args) 
  File "/ntagger/evaluate.py", line 323, in evaluate
    model = load_model(config, checkpoint)
  File "/ntagger/evaluate.py", line 70, in load_model
    bert_model = AutoModel(bert_config)
TypeError: __init__() takes 1 positional argument but 2 were given

dsindex commented 3 years ago

@geo47

you can modify it to:

bert_model = AutoModel.from_config(bert_config)

geo47 commented 3 years ago

Thank you so much, it worked :-)

dsindex / ntagger

Sequence labeling and multi-class classification problem. #5