BaderLab / saber

Saber is a deep-learning based tool for information extraction in the biomedical domain. Pull requests are welcome! Note: this is a work in progress. Many things are broken, and the codebase is not stable.
https://baderlab.github.io/saber/
MIT License
102 stars 17 forks source link

DISO model #100

Closed rishabh279 closed 5 years ago

rishabh279 commented 5 years ago

Hello,

Thank you for sharing the trained model. I tested it today on a few sentences and found that it is able to catch most single term symptoms/signs. I should also say the performance was pretty good. I was getting results within a minute for each sentence. However, the model was missing certain multi-word symptoms/signs such as below a) "One case of anaphylactic dermatitis manifested as erythema with swelling on the face & neck, and others as erosive and scaly erythema on the fold of skin, or red macules, papules, plaques and pigmentation on the whole body." -> It missed "pigmentation" and "anaphylactic" two terms..we were looking to capture "anaphylactic dermatitis". b) "In our patient case, worsening vision combined with persistent CME and development of retinal pigmentation epithelial changes after the initial 4 weeks of Paclitaxel cessation indicated that irreversible reduced in vision was a real possibility if we persisted with the drug cessation treatment plan alone." -> It missed "retinal pigmentation", "worsening vision" c) In response to invading microbes, an acute inflammatory response was mounted with the release of pro-inflammatory mediators and the recruitment of inflammatory cells into the lungs, which resulted to increased vascular permeability and pulmonary edema. -> It missed "vascular permeability" and "inflammatory response"

Do you suppose this is because of the smaller word embedding? It would be great if you could share the model trained on your best word embedding (~4GB). I have a feeling that model will be able to capture signs/symptoms more accurately. Great efforts.Thank you.

JohnGiorgi commented 5 years ago

I will answer this!! Just need a little time to go over the examples and decide if they are indeed FPs or not. Thanks.

rishabh279 commented 5 years ago

Sure you can take your time. Thank you.

JohnGiorgi commented 5 years ago

Okay great! Thanks for this feedback. It is super useful.

I should also say the performance was pretty good. I was getting results within a minute for each sentence.

On my local machine a full abstract takes <1 second. For example, this abstract takes only 587ms on average. Hopefully the same is true for you.

a) "One case of anaphylactic dermatitis manifested as erythema with swelling on the face & neck, and others as erosive and scaly erythema on the fold of skin, or red macules, papules, plaques and pigmentation on the whole body." -> It missed "pigmentation" and "anaphylactic" two terms..we were looking to capture "anaphylactic dermatitis".

So the first example stikes me as a genuine false-negative. It missed the left boundary of the entity (anaphylactic).

However in the latter case, I went and checked the training sets and when the word "pigmentation" appears without some kind of adjective (e.g. "hepatomegaly skin pigmenation") it was not annotated. The annotators must have agreed that the token "pigmentation" itself does not constitute a disease entity. I think Saber actually made the right call in this case. Indeed, this BioNER tool seems to agree, labelling it as a "biological process".

b) "In our patient case, worsening vision combined with persistent CME and development of retinal pigmentation epithelial changes after the initial 4 weeks of Paclitaxel cessation indicated that irreversible reduced in vision was a real possibility if we persisted with the drug cessation treatment plan alone." -> It missed "retinal pigmentation", "worsening vision"

Okay same story here. I think "retinal pigmentation" is a genuine false-negative. However "worsening-vision" is dubious. Indeed, another BNER tool doesn't label it either.)

c) In response to invading microbes, an acute inflammatory response was mounted with the release of pro-inflammatory mediators and the recruitment of inflammatory cells into the lungs, which resulted to increased vascular permeability and pulmonary edema. -> It missed "vascular permeability" and "inflammatory response"

I am noticing a trend! So I will give you "inflammatory response" (and arguably this should be annotated as "acute inflammatory response") but "vascular permeability" is again a little dubious. Seems more like a "biological process" than a disease entity to me. Again, this tool seems to agree.

So, I think you found some genuine mistakes and thank you for that! It helps me diagnose the model. It is important to remeber the difficulty in deciding what should be counted as a biomedical entity or not. You can always browse through the training sets here. I also read annotator guidlines for a given dataset to get my head around what should be considered an entity and what shouldn't be.

Also, Saber supports transfer learning / online learning, so you could compile you own small dataset and train the pre-trained model for a few epochs:

from saber.saber import Saber

saber = Saber()
saber.load('DISO')

# For example, the exact number of epochs would depend on how big the new dataset is.
saber.config.epochs = 10

saber.load_dataset('path/to/my/small/dataset')
saber.train()
saber.save('path/to/save/updated/model')

Do you suppose this is because of the smaller word embedding?

As for the full model, I do suspect it would improve performance. The reason is that right now, the model only contains embeddings for words that appeared in the training set. This means anytime you feed it a word that didn't appear in one of the training sets, the word embedding is just the 0 vector (essentially, the model knows nothing about that word).

The full model contains all tokens in the pre-trained embedding file, which in our case was trained on all of PMC, PubMed, and Wikipedia, meaning we are far far less likely to encounter a word we do not have information for.

I will work on getting the model up. I think I will host it on Google Drive as DISO-LRG and have saber automatically download it when you call

saber.load('DISO-LRG')
rishabh279 commented 5 years ago

Thank you for the detailed clarification. I agree with you that some of them are genuine false-negatives which your model will be able to pick up over time with more training and better word embeddings. Regarding the rest of the items, it seems to be very use-case dependent. In some cases, we would need “worsening vision” and for some use-cases, we would not. If I understand correctly, you are attempting to identify terms that resemble some standard like UMLS concept id, SNOMED etc? I noticed that the training dataset available on your link is only 110Mb. Is this enough data for training CRF? Or are you restricted by how much annotated data you can share outside of Bader labs? With regards to accuracy and word embeddings, your reasoning is correct. The 0 vectors for unknown words will lead to issues. I look forward to receiving your link for the full model. Just curious to know that how are you training your Word embeddings- Gensim (Fasttext, Word2vec), Elmo, Bert? Also, could you confirm if your training data present in Biomedical-Corpora/corpora/NER/CoNLL includes all of the data from the different sources listed under "Table of Corpora" on your github page? I was just wondering how much effort would go into compiling all of that annotated data(in different formats) into a single format like what you have. Great work and thank you for the prompt responses.

JohnGiorgi commented 5 years ago

Based on your questions, I think our paper would be of interest to you: Transfer learning for biomedical named entity recognition with neural networks.

Thank you for the detailed clarification. I agree with you that some of them are genuine false-negatives which your model will be able to pick up over time with more training and better word embeddings. Regarding the rest of the items, it seems to be very use-case dependent. In some cases, we would need “worsening vision” and for some use-cases, we would not.

This is one of the primary difficulties in text-mining / information extraction of biomedical text; we (meaning the experts) aren't in agreement in what constitutes an entity in many cases.

If I understand correctly, you are attempting to identify terms that resemble some standard like UMLS concept id, SNOMED etc?

Essentially, yes. Each dataset follows their own annotation guidelines. One of the datasets I trained on (NCBI Disease) follows UMLS I believe.

I noticed that the training dataset available on your link is only 110Mb.

I used multi-task learning, and trained on 9 datasets (if you expand the model you can see the datasets listed in the config.ini file under dataset_folder).

Is this enough data for training CRF?

This question doesn't actually have a straighforward answer. For starters the CRF is just the final classification layer, the heavy lifting in this model is done by the BiLSTM networks. Some papers seem to suggest that you don't actually need that much data at all to train CRF for NER. My experience says otherwise, which is why I use multi-task learning.

Or are you restricted by how much annotated data you can share outside of Bader labs?

Nope, these are not our datasets and are all open source. They were originally collected here.

With regards to accuracy and word embeddings, your reasoning is correct. The 0 vectors for unknown words will lead to issues. I look forward to receiving your link for the full model.

Yes exactly. However we do learn word emebddings based on the characters of a word, so there is always some information for every word, even if it didn't appear in the training sets.

I think the full model will still improve performance but I don't expect the performance increase to be dramatic.

Just curious to know that how are you training your Word embeddings- Gensim (Fasttext, Word2vec), Elmo, Bert?

Word2Vec. I didn't train them myself though, I got them from here.

Also, could you confirm if your training data present in Biomedical-Corpora/corpora/NER/CoNLL includes all of the data from the different sources listed under "Table of Corpora" on your github page?

Almost, but no. Some datasets in the table are not in the repository. The repository needs some love. I started it a while ago and never updated it. I will try to get around to it.

You can always look at pubannotation.org and the repository I linked above for more biomedical corpora.

I was just wondering how much effort would go into compiling all of that annotated data(in different formats) into a single format like what you have.

I might need you to clarify this question. By "into a single format" do you mean a single dataset? Or do you mean a single repo (or otherwise)?

By the way, we are open to any help we can get with the tool, if you feel like working on a feature or collecting data or whatever just let me know or go ahead and open a pull request!