bridgedb / create-bridgedb-metabolites

Create BridgeDb identity mapping files from HMDB, ChEBI, and Wikidata
Other
4 stars 4 forks source link

Extend the code to handle a selection of Wikidata items differently #30

Open egonw opened 3 years ago

egonw commented 3 years ago

Starting with the Wikidata items for cortisol, which has two Wikidata items because there are two Wikipedia pages for them.

The code should check if a mapping is added for wikidata:Q26981430 and then to be replaced by wikidata:Q190875

Chris-Evelo commented 3 years ago

WikiPedia actually has a (user) mechanism to disambiguate such double pages, is it possible to use that to automate this process? Doesn't sound like a good idea to enter all these exceptions manually.

egonw commented 3 years ago

Plz check the Wikipedia pages.

Chris-Evelo commented 3 years ago

I think we are talking about different things. I meant a general mechanism using WikiPedia's disambiguation methods, if these are available to us, not specific pages. But I am also not sure which specific pages you want me to check.

egonw commented 3 years ago

Sorry, Chris, you lost me. I guess I do not understand your comment. Can you explain why Wikipedia disambiguation pages are relevant here?

Chris-Evelo commented 3 years ago

What I understood is that the problem of multiple Wikidata items occurs because there are multiple Wikipedia entries for the same compound. I thought that using Wikipedia's own mechanism for disambiguation (the mechanism that cerates those pages, not the pages themselves I would think)might be useful to detect such instances and to find out what the main Wikipedia entry should be and thus which Wikidata entry to use and not use. I have no clue whether that is feasible though.

egonw commented 3 years ago

The made a deliberate choice here. Wikipedia writes: "Hydrocortisone is the name for the hormone cortisol when supplied as a medication." and on the other page "When used as a medication, it is known as hydrocortisone."

Chris-Evelo commented 3 years ago

Which makes it a "scientific lenses" problem, right? If looking at a biological pathway you would use "cortisol" if looking at drug extensions in a network, e.g. via CyTargetLinker, you would use "hydrocortisone".

But you probably meant to say, "it is not practical to automate tracing such cases". Yes, if this is typical then I agree.