covidgraph / documentation

28 stars 8 forks source link

Implement bio-med named entity recognition (NER) #46

Open seangrant82 opened 4 years ago

seangrant82 commented 4 years ago

As we get more data streaming through, we should implement named entity recognition (NER) to identify medical entities such as drug, gene, cell, etc in order to link to nodes in our current graph to gather more insights in our graph

Here is example of identifying entities in CORD-19 data: image

COVID-19 NER data here: https://uofi.box.com/s/k8pw7d5kozzpoum2jwfaqdaey1oij93x Training of BioBERT would also apply here too: https://github.com/dmis-lab/biobert

conceptual schema for entity and entity_types: image

This relates to the following data sources:

1 (CORD-19)

motey commented 4 years ago

Would it make sense to link :Entity to the text fragment (somehow) where it origins from? We could achieve that by linking it to :Body_text and the position coordinates on the relation.

mpreusse commented 4 years ago

We also have the :Fragment nodes (full text split up by .). Not sure if that is useful though 😄

mpreusse commented 4 years ago

I've also been in touch with https://www.scibite.com/ who build https://www.scibite.com/platform/termite/

amalic commented 4 years ago

Interesting paper: A Survey on Deep Learning for Named Entity Recognition

mpreusse commented 4 years ago

Interesting paper: A Survey on Deep Learning for Named Entity Recognition

Yes, I think everything is going BERT now. Large pretrained models that you fine-tune for your use cases.

MFreidank commented 4 years ago

Hi, would it make sense to finetune a biobert model on the NER data above as a first step? :) I could look into that if it would help. However, I certainly would need help in the integration/linking piece, who would be the best person to collaborate with?

I would suggest to use huggingface/transformers to fine tune biobert, as it has full integration for NER/ Q&A etc tasks and a nice interface. Checkpoints can be converted and used from that framework without much additional work.

Update: I've started work on a Google Colab Notebook to fine-tune BioBert (or technically any other BERT-like model/transformer model - basically anything supported by huggingface/transformers ) on the NER data linked above.

Colab Notebook

Current status:

Missing:

Will finalize in the colab notebook for quick prototyping - we can run the actual training on separate infrastructure.

@mpreusse @seangrant82 Could you have a look to see if I'm off on a tangent or an at least remotely productive direction?

mpreusse commented 4 years ago

@MFreidank that looks really interesting. To be honest I don't get all the details of the colab notebook. huggingface/transformers is a framework that you use to finetune BioBERT, right? Out of general curiosity: Why huggingface/transformers and what would be an alternative approach?

Regarding preprocessing: The text is stored on several different nodes (such as Abstract, PatentDescription) in Neo4j. The entities we want to identify are also stored in Neo4j. With a few lines of code the text can be extracted. In what format do you need the data? Do you need both text and entities or does BioBERT know all the genes already?

Last question: How does the output of BioBERT look like for NER?

tomasonjo commented 4 years ago

I have tried out the BioBert on various articles. There is a public endpoint exposed that one can use to analyze papers. Input can be either a pubmedID or text, details available at https://bern.korea.ac.kr/.

An example output looks like:

[
   {
      "project":"BERN",
      "sourcedb":"PubMed",
      "sourceid":"29446761",
      "text":"3D printed stretchable capacitive sensors for highly sensitive tactile and electrochemical sensing. Developments of innovative strategies for the fabrication of stretchable sensors are of crucial importance for their applications in wearable electronic systems. In this work, we report the successful fabrication of stretchable capacitive sensors using a novel 3D printing method for highly sensitive tactile and electrochemical sensing applications. Unlike conventional lithographic or templated methods, the programmable 3D printing technique can fabricate complex device structures in a cost-effective and facile manner. We designed and fabricated stretchable capacitive sensors with interdigital and double-vortex designs and demonstrated their successful applications as tactile and electrochemical sensors. Especially, our stretchable sensors exhibited a detection limit as low as 1 x 10-6 M for NaCl aqueous solution, which could have significant potential applications when integrated in electronics skins.",
      "denotations":[
         {
            "id":[
               "CHEBI:26710",
               "BERN:314219703"
            ],
            "span":{
               "begin":902,
               "end":906
            },
            "obj":"drug"
         }
      ],
      "timestamp":"Fri Apr 24 08:44:20 +0000 2020",
      "logits":{
         "disease":[

         ],
         "gene":[

         ],
         "drug":[
            [
               {
                  "start":902,
                  "end":906,
                  "id":"CHEBI:26710\tBERN:314219703"
               },
               0.9999877214431763
            ]
         ],
         "species":[

         ]
      },
      "elapsed_time":{
         "tmtool":0.851,
         "ner":0.349,
         "normalization":0.003,
         "total":1.204
      }
   }
]

All entities are available under "denotation" object. AFAIK there are 5 types available: disease, drug, species, mutation, and gene. Each entity has also assigned one or many ids from medicinal databases like Ensembl, Mim, Hgnc, Mesh, and some more. This is very useful as I have noticed genes with the same Ensembl Id might not have exactly the same name in the output, so it is IMO better to use IDs, and then enrich those ids with information from medicinal databases.

You can check results directly with Neo4j and APOC

CALL apoc.load.json('https://bern.korea.ac.kr/pubmed/29446767/json') YIELD value
UNWIND value.denotations as el
RETURN el.obj as type,
       [x in el.id where x STARTS WITH 'Ensembl' | x] as ensembl_id,
       [x in el.id where x STARTS WITH 'MESH' | x] as mesh_id,
       [x in el.id where x STARTS WITH 'MIM' | x] as mim_id,
       [x in el.id where x STARTS WITH 'BERN' | x] as bern_id, 
       substring(value.text,el.span.begin,el.span.end - el.span.begin) as entity

Returns

type ensembl_id mesh_id mim_id bern_id entity
"disease" [] ["MESH:C567763"] [] ["BERN:262813101"] "CLAPO syndrome"
"gene" ["Ensembl:ENSG00000121879"] [] ["MIM:171834"] ["BERN:324295302"] "PIK3CA"
"disease" [] ["MESH:C567763"] [] ["BERN:262813101"] "CLAPO syndrome"
"disease" [] ["MESH:D014652"] [] ["BERN:256572101"] "vascular disorder"
"disease" [] ["MESH:C567763"] [] ["BERN:262813101"] "capillary malformation of the lower lip"
"disease" [] ["MESH:C567763"] [] ["BERN:262813101"] "lymphatic malformation predominant on the face and neck"
"disease" [] ["MESH:C567763"] [] ["BERN:262813101"] "CLAPO"
"gene" ["Ensembl:ENSG00000121879"] [] ["MIM:171834"] ["BERN:324295302"] "PIK3CA gene"
"mutation" [] [] [] [] "p.Phe83Ser"
"disease" [] [] [] ["BERN:257523801"] "developmental disorders"
"gene" [] [] [] [] "PIK3CA mutations"
"disease" [] ["MESH:C567763"] [] ["BERN:262813101"] "CLAPO"

I am currently in the process of analyzing 57k articles that are available in the PubMed, will share more once it completes the process

tomasonjo commented 4 years ago

Ok, so give a bit more detail about the Biobert Bern extraction:

When we use the PubMed ID as an input, it only evaluates the abstract of the article as the input for NER extraction. On the other hand, it looks like it pulls abstracts directly from PubMed as all articles have them available. Out of 55.000 articles evaluated, only around 32.000 of them have one or more entities detected, which is only 60%. It makes sense to run the bioBert extraction on the whole body text, to get better results.

If we look at the results of Ner extraction, I used the entity text as a unique identifier of an entity, we get 105.000 different entities. Some of them could be grouped by medicinal Ids like MesH or Ensembl into a single entity.

If we group results by the entity type we get:

entity_type count medicinal_id_count id_ratio
"gene" 63666 11527 0.181054251876983
"disease" 24875 22175 0.8914572864321608
"drug" 9685 5580 0.5761486835312338
"species" 5266 2929 0.556209646790733
"mutation" 1037 90 0.08678881388621022
"pathway" 15 15 1.0
"miRNA" 23 23 1.0

There are by far the most genes as entities, but unfortunately, only 20% have a valid id. I imagine there could be some false positives with genes without a valid medicinal id as well.

I want to run this NER extraction on the whole body text to see what comes out, but it will definitely take some time. I'll try to split the workload to make it faster.

LegalBIGuy commented 4 years ago

I would be interested in participating on this task. I recently ran an experiment where I used two spaCy NER models (one domain specific, and one general), and then cross-referenced the entities which allowed me to eliminate / clarify some bad results. For example, if the legal model identified a JUDGE, but if the general model did not identify this same entity as a PERSON, it was a bad result. I also stored the text around the entity (these were clauses in a legal contract), and maintained sequence for all entities for further analysis using a sequence model. This is where it would be valuable to have labeled data for training an LSTM. Do you have labeled entities in any of the documents you are processing?

mpreusse commented 4 years ago

@LegalBIGuy can you join out chat for more details: https://github.com/covidgraph/documentation/wiki/Getting-started-with-the-community