chanzuckerberg / MedMentions

A corpus of Biomedical papers annotated with mentions of UMLS entities.
308 stars 31 forks source link

Discrepancies in semantic types of CUI/mentions #1

Closed ajaynagesh closed 6 years ago

ajaynagesh commented 6 years ago

Hi,

I see some discrepancies in labels in some of the data points:

1) Some entities have only one type: For e.g.: When I search for C0854135 (Pseudomonas aeruginosa infection) on UMLS (2017AA), I get the type as T047 [disease or Syndrome]. But I see that it has been marked with T038 [biologic function]. Now this CUI has only one type, so there is no question of least common ancestor. Could you please clarify.

screen shot 2018-09-14 at 7 29 34 pm

2) Also in some cases, I do not see any type or a different type when I search in UMLS (2017AA) For e.g.: C0444245 (Deoxyribonucleic acid sample) has type T026 [cell component] in UMLS but is marked as T017 [Anatomical Structure]

screen shot 2018-09-14 at 7 48 57 pm

3) In some cases, mentions are marked with UnknownType, but they do have a semantic type in UMLS (2017AA). For e.g.: C0563034 ( Aquatic environment) in UMLS has T067 [Phenomenon or Process]

screen shot 2018-09-14 at 7 57 18 pm

czi-sunil commented 6 years ago

Ajay, thanks for catching this issue. The data (and description and stats) have been corrected. The format is the same as before, with the following exception: when mentions are annotated with concepts linked to multiple semantic types, the type field has a comma-separated list of all these types. There is an example in the main README.md file.