Closed KRSTD closed 3 years ago
Hello @KRSTD sounds interesting and should be very doable. The question is what does the UMLS lookup return? Is it some sort of vector representation?
If so, you could either precompute vectors for all words and put them into a standard word embedding format, like the one used for GloVe embeddings. Then you can just load them with the class WordEmbeddings.
If the knowledge base keeps changing and you always want to do a query against the current state, you would need to write your own word embeddings class. It's not difficult. You need to write a class that inherits from TokenEmbeddings
and write your own _add_embeddings_internal
method. You could start by copying the class HashEmbeddings
and changing its _add_embeddings_internal
method to perform this lookup.
Once you have an embedding class that inherits from TokenEmbeddings
you can use StackedEmbeddings
as always to combine different embeddings.
Thanks @alanakbik. This information is really helpful. UMLS returns a concept code for every token lookup that is requested, for example "Dementia" returns a code C049XXXX . Not a vector representation. Any thoughts, how to use this?
Thanks
How many distinct codes are there? Perhaps it could be used like a gazetteer?
(@mariosaenger @leonweber: any ideas?)
There are a lot of codes. Gazetteer sounds like an option maybe having a one-hot encoding for these concept codes. Does FLAIR have any class for handling gazetteers in general? Will be happy to hear thoughts from @mariosaenger @leonweber.
Hej @KRSTD ,
there are several projects and publications that learn semantic embeddings for biomedical concepts, e.g.:
Maybe you are able to utilize some of the pre-trained embeddings, e.g. by mapping them to UMLS identifiers. The exact implementation depends, of course, on the specific task. Are you given just plain text or do you have access to further information of your input texts?
If you are given plain text only, you need to recognize mentions of these biomedical concepts in the text and then link / normalize them to the used taxonomy (e.g. MESH or UMLS identifier), first. Biomedical concept recognition and normalization is in itself very complex.
Hello @mariosaenger,
Thanks for sharing this valuable information and the literature. I agree concept recognition and normalization in itself is a complex task. I just have discharge notes as .txt files that I might have to process based on the above studies.
Thanks
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, I am new to FLAIR. I have a question about stack embeddings where we can combine multiple embeddings like GloVe and BERT. Can we add another embedding which is a custom embedding based on dictionary lookup against an external knowledge base like UMLS and then combine the above 3 embeddings.
Thanks