Support NRE of new category of terms?

dmis-lab / BERN2

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

http://bern2.korea.ac.kr

BSD 2-Clause "Simplified" License

174 stars 41 forks source link

Support NRE of new category of terms? #27

Closed hh1985 closed 1 year ago

hh1985 commented 2 years ago

Any idea of extending it to support new classes, such as microbiota? Thanks.

mjeensung commented 2 years ago

Thanks for reaching out to us.

As long as the training datasets for new types exist, the supported type of BERN2 can be expanded. We will consider providing how to train our NER model on new datasets.

jaredcthomas commented 2 years ago

I am also interested in identifying new entity types. I would very much appreciate a tutorial on how to do this.

mjeensung commented 1 year ago

We uploaded a tutorial on how to train our NER model for the supporting entity types. https://github.com/dmis-lab/BERN2/tree/main/multi_ner/training

By preprocessing the dataset of the new entity type and adding it as a training set, you will be able to get a NER model for the new entity type. If you have any follow-up questions, please re-open this issue.

liwenqingi commented 1 year ago

Hi! If I want to train a new type, should I modify the modeling.py file to set up a separate classifier for training, and then add the modified classifier layer to the original modeling.py file?

minstar commented 1 year ago

Hi @liwenqingi

If you have data of a new type, then you could modify and train through the modeling.py file to set up a separate classifier.

liwenqingi commented 1 year ago

@minstar I see that you are training several types together. I want to integrate new types of entities into bern2_ner. Should I add a classifier to the modeling.py file and conduct joint training with the original entities, or just train new entities separately and add them to in bern2_ner?Because I found the f1-score for training entities separately(like "disease") are low(50 epochs around 0.6). Thanks for your reply！

minstar commented 1 year ago

I prefer to choose the latter case, this is because you have to find optimal training settings which could be time-consuming and labor-intensive things. May I ask the reason that training entities separately could cause the low f1 score?

liwenqingi commented 1 year ago

I prefer to choose the latter case, this is because you have to find optimal training settings which could be time-consuming and labor-intensive things. May I ask the reason that training entities separately could cause the low f1 score?

I modified the structure of modeling.py just to test the feasibility of training the classifier separately, using NERdata data (like gene,disease,..) for separate training, but the effect is not very good

minstar commented 1 year ago

Then, how about adding your new entity classifying system through socket communication as we did?

In bern2.py, line 361-363, we separately get the results of tmvar, gnormplus, and our multi-ner classifier.

        for ner_type in ['tmvar', 'gnormplus', 'mtner']:
            arguments_for_coroutines.append([ner_type, pubtator_file, output_mtner, base_name, loop])
        async_result = loop.run_until_complete(self.async_ner(arguments_for_coroutines))

You could train your entities solely which could be better than just modifying the classifier separately.

liwenqingi commented 1 year ago

Then, how about adding your new entity classifying system through socket communication as we did?

In bern2.py, line 361-363, we separately get the results of tmvar, gnormplus, and our multi-ner classifier.
        for ner_type in ['tmvar', 'gnormplus', 'mtner']:
            arguments_for_coroutines.append([ner_type, pubtator_file, output_mtner, base_name, loop])
        async_result = loop.run_until_complete(self.async_ner(arguments_for_coroutines))
You could train your entities solely which could be better than just modifying the classifier separately.

Thanks for your reply! But I want to use bern2 locally and may not need socket interaction because a large amount of data needs to be processed. At the same time, I just found that the effect of training "species" entities is very good, which may be related to the annotation quality of the data.