OneZoom / OZtree

OneZoom Tree of Life Explorer
Other
90 stars 20 forks source link

Additional Wikidata tab on leaves's description #863

Open oolonek opened 4 months ago

oolonek commented 4 months ago

This would be a nice addition to acess to the Wikidata page of a given taxon when clicking on it's leaf.

For example for https://www.onezoom.org/life/@Aloe_ferox=608115 one could reach https://www.wikidata.org/wiki/Q1194889

Using Qlever, all pairs of Open Tree of Life IDs and Wikidata QID can be retrieve in ms https://qlever.cs.uni-freiburg.de/wikidata/MjoDT0?exec=true. Currently yielding 2'034'851 pairs.

hyanwong commented 4 months ago

We have the wikidata ID anyway, in the ordered_leaves table, so we don't need to use the Qlever site (although I'm intrigued how that site works).

davidebbo commented 4 months ago

As a workaround, note that from the Wikipedia page, you can choose Tools / Wikidata item to go to that. So it's indirect, but there is a path to it...

hyanwong commented 4 months ago

Aha, of course, the OTT IDs are now on wikidata (they used not to be, I argued for their introduction), so we can find the mapping using a sparQL command. Neat.

davidebbo commented 4 months ago

the OTT IDs are now on wikidata

Oh, I didn't know that! It's P9157.

hyanwong commented 4 months ago

Yes, I noticed it the other day. It's new, I think (created 2021)

hyanwong commented 4 months ago

It could be that this is a better way to get the mappings now, rather than going via the ncbi IDs etc.

We could probably check how accurate and comprehensive our mapping is, versus the one on wikidata. If we can simply move to using wikidata, it would probably simplify the code considerably. However, my suspicion is that there are lots of OTT taxa that have NCBI / GBIF ids but which aren't currently on wikidata.

davidebbo commented 4 months ago

It could be that this is a better way to get the mappings now, rather than going via the ncbi IDs etc.

We could probably check how accurate and comprehensive our mapping is, versus the one on wikidata. If we can simply move to using wikidata, it would probably simplify the code considerably. However, my suspicion is that there are lots of OTT taxa that have NCBI / GBIF ids but which aren't currently on wikidata.

Yes, that was my first thought when I saw that. It has the potential to simplify things a lot. For now, it would be easy to add some instrumentation that checks whether the QID we find via other paths maps back to the same ott.

Anyway, we're digressing a bit from @oolonek's request 😄

oolonek commented 4 months ago

It could be that this is a better way to get the mappings now, rather than going via the ncbi IDs etc.

We could probably check how accurate and comprehensive our mapping is, versus the one on wikidata. If we can simply move to using wikidata, it would probably simplify the code considerably. However, my suspicion is that there are lots of OTT taxa that have NCBI / GBIF ids but which aren't currently on wikidata.

This would be interesting to find out which are missing. Do you expect taxa not to have their WD entry or rather to be present on WD but simply lack their OTT id on their WD page ? In both case it will be of interest to find out and eventually work on pushing the missing info to WD. I will look at this on my side also. Thanks for your quick feedbacks :)

davidebbo commented 4 months ago

It's probably going to be a combination of things.

But for the last two, we really don't know right now because we've never looked at the WD OTT field. But it would be interesting to get that data.

hyanwong commented 4 months ago

Good summary. Thanks @davidebbo . And yes, it would be interesting to see how this compares to what wikidata think is the correct mapping.

mdrishti commented 4 months ago

Hi,

I have also been working on getting the taxonomic ids from ott and taxonomies from 11 other dbs (gbif, ncbi, eol, itis etc) corresponding to wikidata ids. Found that ~2,032,649 wd ids have ott and 1,435,238 wd ids don't. The latter map to other databases.
On the other hand, out of total 4,528,302 ott ids, 2,530,549 don't have wd ids.

There are 3,826,740 ott ids which are either at species/strain level. I was wondering about the criteria used for keeping the ott id in OneZoom. Also, do all 2,235,475 leaf taxa in OneZoom have an ott id?

Too many numbers above! Sorry!

hyanwong commented 4 months ago

I was wondering about the criteria used for keeping the ott id in OneZoom

We tend to retain all the OTTs that are present in the synthetic OpenTree (give or take some that differ because of using bespoke trees in particular areas of the tree, mostly mammals / birds)

davidebbo commented 4 months ago

3,826,740 - 2,235,475 = 1,591,265. That's a huge number of species otts that are not in the OneZoom tree. But I do see the same thing if I filter taxonomy.tsv for only species.

I guess that means that all these are incertae sedis, and hence not in the synthetic tree?

davidebbo commented 4 months ago

I did some instrumentation. Out of 1,817,682 OneZoom otts that we are mapping to a Wikidata item:

hyanwong commented 4 months ago

Nice. Thanks @davidebbo. It's good there aren't many wrong matches. Seems like we could switch at some point to using wikidata to provide all our mapping then. What we would be missing is data to do with other identifiers, like NCBI, which we get automatically from the opentree.

However, I think it would be fine to omit all the ncbi -> wikidata mapping, and just go straight to mapping OTT from the wikidata JSON dump to the WD qID.

oolonek commented 4 months ago

I did some instrumentation. Out of 1,817,682 OneZoom otts that we are mapping to a Wikidata item:

  • 1,607,691 (~88%) have an ott in Wikidata, and it matches our ott
  • 2,893 (<1%) have an ott in Wikidata that does not match our ott
  • 207,098 (~11%) don't have an ott in Wikidata

Hi @davidebbo are these files somewhere on the OneZoom repo or were they generated elsewhere ? Would you mind sharing ? Also, I guess it is the case, but just to be sure, could you confirm its OTT 3.6 you are using ?

oolonek commented 4 months ago

Nice. Thanks @davidebbo. It's good there aren't many wrong matches. Seems like we could switch at some point to using wikidata to provide all our mapping then. What we would be missing is data to do with other identifiers, like NCBI, which we get automatically from the opentree.

However, I think it would be fine to omit all the ncbi -> wikidata mapping, and just go straight to mapping OTT from the wikidata JSON dump to the WD qID.

Why not also rely on WD to retrieve the NCBI ids ? WD could be the single source for all taxa ids like this their would be a single place to work on to improve mappings.

See https://qlever.cs.uni-freiburg.de/wikidata/QObdaz?exec=true

davidebbo commented 4 months ago

Why not also rely on WD to retrieve the NCBI ids ? WD could be the single source for all taxa ids like this their would be a single place to work on to improve mappings.

Yes, that would be a good end state if the data quality is sufficient. In such a world, we may not need to use the OpenTree taxonomy file at all. We could also do away with all the EOL logic.

Basically, we'd have:

If we went in that direction, we should probably do a rewrite of the tree building logic, rather than iteratively move it in that direction.

I don't think we're quite ready for that yet, but it is a direction.

hyanwong commented 4 months ago

Why not also rely on WD to retrieve the NCBI ids ?

I'm not sure that's so sensible, because the OTT IDs are based on NCBI, GBIF, etc. So the OTT taxonomy.tsv file is the canonical source of the NCBI ids that go into generating an OTT.

I.e. the mappings in the taxonomy.tsv file is the definition of an OTT, for a given OpenTree release.