OneZoom / tree-build

Scripts for assembling the tree, metadata and downstream data products such as popularity and popular images
MIT License
1 stars 2 forks source link

New tree does not have the correct species name for the dog #56

Closed hyanwong closed 4 months ago

hyanwong commented 4 months ago

After a bit of scrambling, I managed to upload the new OpenTree to OneZoom (hurrah!), but it appears as if we are not correctly recognising the name "Canis lupus familiaris" as a subspecies, so we are allow it to be the tip of the tree (see https://www.onezoom.org/life/@Canis_lupus_familiaris=247333). In OneZoom, we don't allow subspecies to be shown, and there should be code in the tree-building pipeline to recognise this as a subspecies (e.g. because there are 3 words), and remove the node(s) until we are only left with a terminal tip labelled "Canis lupus". Something is going wrong here, and the tree parsing behaviour looks like it has changed.

Screenshot 2024-05-08 at 07 39 44

Note that, however, we need to leave the _Canis_lupusfamiliaris node in the tree until almost the last step, so that we can pick up the popularity of the subspecies nodes into the species node. We should then trim off the subspecies tip to be left with only the species in the final output tree.

hyanwong commented 4 months ago

I'm going to hack this on the DB on the main site, but we should fix it long-term

davidebbo commented 4 months ago

I will look at this today. But didn't the gray wolf use to have a separate leaf from the dog, maybe as a sister?

AllLife_full_tree.phy has this. I assume this is correct, but that the problems happens later, in the CSV_base_table_creator phase:

  (
    (
      Canis_simensis_ott752755:1.123458,
      Canis_latrans_ott247331:1.123458
    ):0.0,
    (
      Canis_lupus_familiaris_ott247333:1e-06
    )Canis_lupus_ott247341:1.123457
  ):0.160494,
hyanwong commented 4 months ago

Thanks. No hurry, but it would be good to sort.

Most authorities treat the dog a a subspecies of wolf, so on previous OneZoom instances, we had only a single leaf for both, like this:

Screenshot 2024-05-08 at 08 48 54
  (
    (
      Canis_simensis_ott752755:1.123458,
      Canis_latrans_ott247331:1.123458
    ):0.0,
    (
      Canis_lupus_familiaris_ott247333:1e-06
    )Canis_lupus_ott247341:1.123457
  ):0.160494,

Yes, this is right. The Canis_lupus_familiaris_ott247333 node should be pruned off near the end of the script, leaving just Canis_lupus_ott247341 at time=0. The same should happen for any other subspecies in the tree: they should be removed and the leaf or leaves collapsed to a single "species tip".

davidebbo commented 4 months ago

Can you confirm that remove_unifurcations_keeping_higher_taxa is the logic that's supposed to do this pruning? I don't recall modifying this, but I will debug it to try to understand why Canis_lupus_familiaris is not getting replaced by Canis_lupus.

hyanwong commented 4 months ago

I'm not sure that's where the subspecies removal happens. Let me have a look through to remind myself.

hyanwong commented 4 months ago

I think it's probably done in prune_children_of_otts. That function takes a set of OTT ids (that represent all the species in the tree) and deletes any children (e.g. subspecies) below them. So the original tree will have e.g.

(Canis_lupus_familiaris_ott247333:1e-06)Canis_lupus_ott247341:1.123457)

But since the node Canis_lupus_ott247341 is a species, we will identify it as a species in the taxonomy, then remove all its children before outputting.

davidebbo commented 4 months ago

Thank you, I will start debugging this now and see what I can find.

hyanwong commented 4 months ago

Great, thanks so much.

davidebbo commented 4 months ago

Ok, I see the problem:

    # check for extinction props
    if (
        (nd.num_child_nodes() == 1)
        and (list(nd.child_nodes())[0].num_child_nodes() == 0)
        and (list(nd.child_nodes())[0].edge.length)
    ):
        trim_me = False  # this is an extinction prop

Here:

So we treat it as an extinction prop and don't trim it.

But this 1e-06 is completely bogus, as it means 1 year! This was caused by my change a54dea83ab8421e417887d05f746e9f92d9f3ad0 to make the tree fully ultrametric. To preserve ultrametricity, I think the right thing to do is to give the year back to Canis lupus, changing:

(Canis_lupus_familiaris_ott247333:1e-06)Canis_lupus_ott247341:1.123457)

to

(Canis_lupus_familiaris_ott247333:0)Canis_lupus_ott247341:1.123458)
hyanwong commented 4 months ago

Ah, nicely figured out. Yes, a length of 0 makes much more sense here, as the species is still extant. I guess I just edit that file.

hyanwong commented 4 months ago

I think this might be the only case where we have allocated a non-zero branch between a species and a subspecies? I guess it might be helpful to warn in the ultrametricity-forcing algorithm if this happens? But I don;t know if we will use it again, at all.

davidebbo commented 4 months ago

I guess it might be helpful to warn in the ultrametricity-forcing algorithm if this happens?

Yes, it would be good. The code I have in fix_ultrametricity has:

https://github.com/OneZoom/tree-build/blob/66ed8a31a99ad4bcec1dfd18b481c1a5e14f04f3/oz_tree_build/newick/fix_ultrametricity.py#L20-L22

But here, the length is 0, not None, so we end up adjusting it and messing it up. I guess if it is either 0 or None, we should leave it alone.

Though come to think of it, is having Canis_lupus_familiaris_ott247333:0 even correct, versus having Canis_lupus_familiaris_ott247333?

I think part of the problem is our extinct props semantic is kind of clashing with this odd case of having a single dog subspecies.

Most authorities treat the dog a a subspecies of wolf

And maybe going off a tangent, but I think we should special case this here. It may be a subspecies, but from a cultural and conservation standpoint, treating Gray Wolves and dogs as one node does not make much sense. The same could be said for the dingo, to a lesser extinct. Plus we're missing out on some likely sponsorships! :)

davidebbo commented 4 months ago

Something along the lines of (with proper edge lengths added):

  (
    (
      Canis_simensis_ott752755,
      Canis_latrans_ott247331
    ),
    (
      Canis_lupus_ott247341,
      (
        Canis_lupus_familiaris_ott247333,
        Canis_lupus_dingo_ott380529
      )"Domesticated_gray_wolves"
    )
  );

And yes, it violates the 'no subspecies' tenet, but what's the harm?

hyanwong commented 4 months ago

You are right that we could simply remove the length altogether (not even have a 0).

There are many other cases in the OpenTree where there are one or even multiple subspecies of a single species. In all cases, I think we should have a null or 0-length branch to the subspecies, as this reflects biological reality (the species is not extinct).

I think we should special case this here

We debated this at the time (and e.g. cats are considered separate species). The problem is that there are lots of animals with interesting subspecies (e.g. giraffes), and there are also quite a lot of dog subspecies (https://tree.opentreeoflife.org/opentree/argus/ottol@247341/Canis-lupus). Once you get to the subspecies level, different people have different naming conventions, and you get quite a lot of taxonomic churn. It's also much less clear whether a tree is the right metaphor between species, as there's often a lot of hybridisation. So generally, having a rigid cutoff at species makes our lives quite a lot easier. But as you say, maybe dogs are an exception.

The sponsorship argument is a bit moot, as we have banned the dog from sponsorship anyway (or at least, we require a very high price in the thousands of pounds/euros/dollars).

hyanwong commented 4 months ago

Something along the lines of (with proper edge lengths added):

  (
    (
      Canis_simensis_ott752755,
      Canis_latrans_ott247331
    ),
    (
      Canis_lupus_ott247341,
      (
        Canis_lupus_familiaris_ott247333,
        Canis_lupus_dingo_ott380529
      )"Domesticated_gray_wolves"
    )
  );

Yep, but why stop at 2 subspecies? I think we would at least also have to have Canis lupus lupus otherwise wolves would not be a tip on the tree, and then it's debatable about whether we should also have some of the other wolf subspecies too? We would need to work out a sensible branching order for the ones we chose too.

davidebbo commented 4 months ago

why stop at 2 subspecies

I guess I view species resulting from ultra fast pace human breeding as different, but I see that would open a can of worms.

Anyway, no need to debate the point here, and the discussion can always be revived later.

So I think we can simply remove the edge length from Canis_lupus_familiaris, and all will be well:

I'll make the change.

hyanwong commented 4 months ago

Anyway, no need to debate the point here, and the discussion can always be revived later.

Yep, we could open another issue and chat about it with James. A quick fix now is all that we need, I think, and we can discuss more complex subspecies ideas later.

So I think we can simply remove the edge length from Canis_lupus_familiaris, and all will be well:

Perfect, thanks.