flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.9k stars 2.1k forks source link

Semantic tag embedding based on external knowledge bases like UMLS on tokens #2042

Closed KRSTD closed 3 years ago

KRSTD commented 3 years ago

Hi, I am new to FLAIR. I have a question about stack embeddings where we can combine multiple embeddings like GloVe and BERT. Can we add another embedding which is a custom embedding based on dictionary lookup against an external knowledge base like UMLS and then combine the above 3 embeddings.

Thanks

alanakbik commented 3 years ago

Hello @KRSTD sounds interesting and should be very doable. The question is what does the UMLS lookup return? Is it some sort of vector representation?

If so, you could either precompute vectors for all words and put them into a standard word embedding format, like the one used for GloVe embeddings. Then you can just load them with the class WordEmbeddings.

If the knowledge base keeps changing and you always want to do a query against the current state, you would need to write your own word embeddings class. It's not difficult. You need to write a class that inherits from TokenEmbeddings and write your own _add_embeddings_internal method. You could start by copying the class HashEmbeddings and changing its _add_embeddings_internal method to perform this lookup.

Once you have an embedding class that inherits from TokenEmbeddings you can use StackedEmbeddings as always to combine different embeddings.

KRSTD commented 3 years ago

Thanks @alanakbik. This information is really helpful. UMLS returns a concept code for every token lookup that is requested, for example "Dementia" returns a code C049XXXX . Not a vector representation. Any thoughts, how to use this?

Thanks

alanakbik commented 3 years ago

How many distinct codes are there? Perhaps it could be used like a gazetteer?

(@mariosaenger @leonweber: any ideas?)

KRSTD commented 3 years ago

There are a lot of codes. Gazetteer sounds like an option maybe having a one-hot encoding for these concept codes. Does FLAIR have any class for handling gazetteers in general? Will be happy to hear thoughts from @mariosaenger @leonweber.

mariosaenger commented 3 years ago

Hej @KRSTD ,

there are several projects and publications that learn semantic embeddings for biomedical concepts, e.g.:

Maybe you are able to utilize some of the pre-trained embeddings, e.g. by mapping them to UMLS identifiers. The exact implementation depends, of course, on the specific task. Are you given just plain text or do you have access to further information of your input texts?

If you are given plain text only, you need to recognize mentions of these biomedical concepts in the text and then link / normalize them to the used taxonomy (e.g. MESH or UMLS identifier), first. Biomedical concept recognition and normalization is in itself very complex.

KRSTD commented 3 years ago

Hello @mariosaenger,

Thanks for sharing this valuable information and the literature. I agree concept recognition and normalization in itself is a complex task. I just have discharge notes as .txt files that I might have to process based on the above studies.

Thanks

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.