OneZoom / tree-build

Scripts for assembling the tree, metadata and downstream data products such as popularity and popular images
MIT License
1 stars 2 forks source link

Find Open Tree species that are missing from OneZoom #41

Open davidebbo opened 6 months ago

davidebbo commented 6 months ago

@hyanwong @jrosindell I wrote a simple script to locate those. I filtered out all the names that don't contain exactly one underscore, since we only care about species.

I ended up with 15304. You can see the whole list here. They are in the order in which they appear in the Newick, which is post-order traversal.

I have not tried to analyze it yet.

davidebbo commented 6 months ago

Case 1: Looking randomly at Hylomyscus, a genus or rodents:

So that looks like a legit set of missing species in this case.

hyanwong commented 6 months ago

Neat, thanks. I suspect this is a mix of (mostly) OpenTree mistakes / synonyms / subspecies /mispellings, and OneZoom absences.

For example, the hare genus appears a lot in that list: wikipedia lists 33 species, and OneZoom has 32. I've looked up some information for the first few

Lepus_crawshayi (synonym for _Lepus victoriae_)
Lepus_angolensis (subspecies: _[Lepus microtis angolensis](https://www.wikidata.org/wiki/Q20905190)_)
Lepus_ansorgei (synonym for _Lepus victoriae_)
Lepus_atlanticus (subspecies of _Lepus capensis_)
Lepus_canopus (unclear, not on wikipedia)
Lepus_chadensis (unclear, probably not a species)
Lepus_cordeauxi (probably a variety, or maybe form (e.g. cited in https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180137)
Lepus_creticus (subspecies in crete?)

.... (I haven't looked at the others)

Lepus_crispii
Lepus_cyprius
Lepus_douglasii
Lepus_formosus
Lepus_harterti
Lepus_hawkeri
Lepus_innesi
Lepus_judeae
Lepus_kabylicus
Lepus_lilfordi
Lepus_longicaudatus
Lepus_ochropus
Lepus_omanensis
Lepus_pallidior
Lepus_pamirensis
Lepus_primaevus
Lepus_przewalskii
Lepus_saxatalis
Lepus_sefranus
Lepus_siamensis
Lepus_swinhoei
Lepus_talai
Lepus_tigrensis
Lepus_tunetae
Lepus_vassali
Lepus_verae-crucis
Lepus_veter
Lepus_whitakeri
Lepus_zuluensis
hyanwong commented 6 months ago

Case 1: Looking randomly at Hylomyscus, a genus or rodents:

So that looks like a legit set of missing species in this case.

Wikipedia has 21 species, so I suspect that (unlike Lepus) there are a number missing from OneZoom that we should have.

davidebbo commented 6 months ago

Ah yes, synonyms are probably a big part of it. e.g. I was looking at the Jaguarundi. Open Tree has both Puma yagouaroundi and Puma yaguarondi, with different OTTs. I'm guessing this is bogus data, referring to the same animal.

OneZoom just has Puma yaguarondi.

Interestingly, Wikipedia says it's been reclassified into a new genus: Herpailurus yagouaroundi (https://en.wikipedia.org/wiki/Jaguarundi).

So the early conclusion is that Open Tree is not in very good shape, and that trying to use it directly is unwise? Should this be discussed with them?

davidebbo commented 6 months ago

Another thing is that Open Tree has some extinct species, so we'd need to filter those out. e.g. Acinonyx_pardinensis is an extinct cheetah.

To make things worse, it is not marked as extinct in Open Tree, so I it would be hard to filter out in an automated way.

hyanwong commented 6 months ago

Ah yes, synonyms are probably a big part of it. e.g. I was looking at the Jaguarundi. Open Tree has both Puma yagouaroundi and Puma yaguarondi, with different OTTs. I'm guessing this is bogus data, referring to the same animal.

Yes, this is a typo (not in the OpenTree, but in the source data from which the open tree has been taken)

So the early conclusion is that Open Tree is not in very good shape, and that trying to use it directly is unwise? Should this be discussed with them?

They do know this, but it's a never-ending task keeping up with synonyms etc.

There's no canonical way to find out which names are good or bad, sadly. And often old synonyms are resurrected with better taxonomy.

In the case of the mice that you found (Hylomyscus), OpenTree is pretty good. They are only missing 4 recently described species that are on wikipedia (no active links yet Mahale wood mouse, Hylomyscus mpungamachagorum Demos, Hutterer & Kerbis Peterhans, 2020 Pygmy wood mouse, Hylomyscus pygmaeus Kerbis Peterhans, Hutterer & Demos, 2020 Stanley’s wood mouse, Hylomyscus stanleyi Kerbis Peterhans, Hutterer & Demos, 2020 Mother Ellen’s wood mouse, Hylomyscus thornesmithae (Kerbis Peterhans, Hutterer & Demos, 2020)) and the Volcano wood mouse, Hylomyscus vulcanorum - I'm not sure why they don't have that.

hyanwong commented 6 months ago

Another thing is that Open Tree has some extinct species, so we'd need to filter those out. e.g. Acinonyx_pardinensis is an extinct cheetah.

Yes, that's true (even in the synthetic tree). Ideally we'd remove the extinct species from the comparison (should be easy, if they are properly flagged up as extinct in the OpenTree, which they usually (but not always) are.

davidebbo commented 6 months ago

should be easy, if they are properly flagged up as extinct in the OpenTree

In this particular case, it's not flagged.

hyanwong commented 6 months ago

should be easy, if they are properly flagged up as extinct in the OpenTree

In this particular case, it's not flagged.

Oh, yes, there are many of those. There is a special flag for issues about these in the OpenTree. I'll bring up an issue (but spotting them by hand is very tiresome!)

davidebbo commented 6 months ago

I filtered out all the ones marked as extinct in the taxonomy. It went down by 262, to 15042. Same link above has the updated list, and you can see the diff here.

hyanwong commented 6 months ago

For some areas, like birds and mammals, wikipedia is almost the best source for species lists. But for most of the rest of the tree it is not.

It might be useful to look at species status on wikidata. In fact, it's almost certainly going to be better to find extinct / extant information on wikidata than elsewhere. I wonder how easy it is to get information out of the wikidata JSON about whether something is extinct or not?

davidebbo commented 6 months ago

I wonder how easy it is to get information out of the wikidata JSON about whether something is extinct or not?

It's probably easy, since https://www.wikidata.org/wiki/Q2272925 (Acinonyx pardinensis) shows it as an instance of 'fossil taxon'. So it's in the json somewhere. I'll look for it later.

We could also use this to flag any taxon that is marked as extinct in Wikidata but not in Open Tree, so we can report them all.

hyanwong commented 6 months ago

We could also use this to flag any taxon that is marked as extinct in Wikidata but not in Open Tree, so we can report them all.

Yeah, that would be great. There might be a number in the inverts (for example I seem to remember editing out by hand a number of fossil nautilus species)

davidebbo commented 6 months ago

I did this and found 14807 taxa that are marked as extinct in Wikidata, but not in Open Tree. You can see the list here.

However, I suspect that many are incorrectly marked as extinct in Wikidata. e.g. The first one Ictiobus is marked as a fossil taxon, when it's clearly extant.

So it doesn't seem like we can trust either Open Tree or Wikidata when it comes to the extinct flag :(

hyanwong commented 6 months ago

Re extinctness data, it's probably better going with species-level assignments. Making a whole genus as extinct/extant is notoriously error-prone (since the same genus can contain both extinct and extant members.

Looking at that list, most of the species marked (rather than the genera) seem truly extinct, some very recently so.

davidebbo commented 6 months ago

Ah yes, great point about extinct for higher taxa. I have updated that same list to only contain species, bringing the count to 11651.

So it sounds like a large number of these are indeed extinct, and hence represent a missing extinction flag in the Open Tree taxonomy. Clearly, we're not going to report them one by one. Is it worth trying to do something about it?

hyanwong commented 6 months ago

I can mention it to OpenTree, but I think we are likely to want to maintain own own additional list where we override the extinct/extant flag given by OpenTree. I imagine this could be in the form of a text file that we commit to the GitHub repo so that OpenTree can get a list to look at from us, if need be.

There might also be one or two species that we want to add to that list by hand. And in addition we might want to identify those species that are "recently" extinct (e.g. in the last 2000 years), as we tend to include those on the OneZoom tree too (e.g. dodo, elephant bird, moas).

Thinking about how to maintain and modify such a list, as well as how ton incorporate it into the OpenZoom trees in the future, could be a useful thing?

davidebbo commented 6 months ago

I added a comment to https://github.com/OpenTreeOfLife/feedback/issues/578 to let them know of the more general issue.

I also updated the list to include the OTT and the scientific name (instead of the less precise Wikidata name).

Right now, it's in a gist, but we can add more formally to the repo later. And I have a (sort of dirty) tool to regenerate the list.

Identifying those that are recently extinct is an interesting challenge, because as far as I can tell, Wikidata does not have that information. Instead, we could go to English Wikipedia, and scrape it from the taxobox.

e.g. for Acinonyx pardinensis, the Wikipedia page taxobox shows a temporal range of Late Pliocene–Middle Pleistocene, and we could scrape that and map it to a date (say 1.3 Ma).

But for the Thylacine, it says "Early Pleistocene–Holocene", which is useless (no extinction). Lower down in the box, it says 'Extinct (1936) (IUCN 3.1)', but it looks ad hoc, and not easily scrapable in an automated way.

As an alternative, we could scrape this very nice Timeline of extinctions in the Holocene page. I hate scraping, but it seems that's often the only way to get good data in this business...

davidebbo commented 6 months ago

Popping back to the original topic of this thread, if I exclude all species that are marked as extinct on either OT or Wikidata, we end up with 14588 species that are in the Open Tree but not in OneZoom. Full list here. So we further reduced it by 454.

Not sure how much further I can take this, so I will leave it there for now.