fathomnet / community-feedback

3 stars 0 forks source link

Flustrina occurs twice in the worms taxon #203

Closed hohonuuli closed 1 week ago

hohonuuli commented 1 week ago

From @kevinsbarnard:

this query to the fast WoRMS API is causing a gateway timeout 504 response: GET https://database.fathomnet.org/worms/ancestors/Smithsonius%20dorothea Any idea why? Joost brought this to me

hohonuuli commented 1 week ago

I put together the following python script to check for a circular relationship in the worms tree:

#!/usr/bin/env python

import requests
import sys

parents = []
base_url = "http://database.fathomnet.org:8888/parent/"

def get_parent(name):
    url = base_url + name
    try:
        r = requests.get(url)
        n = r.json()
        print(n)
        if n in parents:
            print("Circular relation ... " + n + " already in ancestor tree")
        else:
            parents.append(n)
            get_parent(n)
    except:
        print("done")

def main(n):
    parents.append(n)
    get_parent(n)

if __name__ == '__main__':
    print("Walking up the tree, one parent at a time")
    name = sys.argv[1]
    print(name)
    main(name)

Running it produced:

❯ ./walk_parents 'Smithsonius dorothea'
Walking up the tree, one parent at a time
Smithsonius dorothea
Smithsonius
Tessaradomidae
Lepralielloidea
Flustrina
Flustridae
Flustroidea
Flustrina
Circular relation ... Flustrina already in ancestor tree
hohonuuli commented 1 week ago

I grepped the taxon.txt file from worms for Flustrina and attached the output. I think the issue is the following two lines (I edited them for brevity). There's two rows with the same scientific name (but different accepted names and aphids)

taxonID scientificNameID    acceptedNameUsageID parentNameUsageID   namePublishedInID   scientificName  acceptedNameUsage   parentNameUsage 
urn:lsid:marinespecies.org:taxname:153575   urn:lsid:marinespecies.org:taxname:153575   urn:lsid:marinespecies.org:taxname:153575   sid:marinespecies.org:taxname:110722           Flustrina       Flustrina       Cheilostomatida  
urn:lsid:marinespecies.org:taxname:759713   urn:lsid:marinespecies.org:taxname:759713   urn:lsid:marinespecies.org:taxname:110909   urn:lsid:marinespecies.org:taxname:110749       Flustrina   Carbasea    Flustridae

We use the scientific name to help us resolve former names of taxa. As an example ran cat taxon.txt | grep 'Loligo opalescens' which returned a single row.

taxonID scientificNameID    acceptedNameUsageID parentNameUsageID   namePublishedInID   scientificName  
urn:lsid:marinespecies.org:taxname:341883   urn:lsid:marinespecies.org:taxname:341883   urn:lsid:marinespecies.org:taxname:574540   urn:lsid:marinespecies.org:taxname:138139       Loligo opalescens   Doryteuthis opalescens  Loligo

Flustrina.tsv.zip

hohonuuli commented 1 week ago

It looks like the issue is in Data.scala. Some issues to resolve:

  1. The namesMap inserts all names. Since there are duplicates names used in worms it will evict the previous names from the Map.
    • Change the logic in the data class to lookup the WormsConcept by its aphidId, which is stored in the WormsConcept as id, then use that for node resolution. Currently we look up by name.
  2. Sort out what to do about endpoints that take a name and return a single instance. Options include:
    • All name APIs return an array of values
    • Do some resolution to return the best (most accepted?) value.
hohonuuli commented 1 week ago

@BGWoodward @kevinsbarnard This is resolved in worms-server 0.7.0. I've deployed the changes to production. Be aware that, duplicate names that aren't accepted names (i.e. the taxa was re-named) will have a number appended so one of the Flustrina is now Flustrina 1