TreeBASE / treebase

Source code for TreeBASE web application and database
http://www.treebase.org
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

Taxon validation uses wrong matching algorithm #65

Open ghost opened 15 years ago

ghost commented 15 years ago

BP says:

> I did a taxon validation thing and got the attached results at the first > pass. > > 3- When I follow one of the "validate by hand" links, I see a list > like this > <http://8ball.sdsc.edu:6666/treebase-web/user/editTaxonLabel.html?taxonlabelid=237178&gt; one > for Actinomucor elegans: but it list three identical items to choose > from, each having identical ncbi_taxids. We should never really see a > list of multiple hits each with the same ncbi_taxid, but we can see > multiple hits on the same ubio_namebankid. >

[ Bill's output page is attached below, and I reported the triple display of Actinomucor elegans as bug 2712234. however, Bill goes on to say: ]

> ... so on the face if it, it's just a display bug, and that there is > probably an easy fix to make each of the three options show a distinct > taxon name and ncbi_taxid. However, I don't think this should be > happening either. The label "Actinomucor elegans" ought to only match > with one of the three. Is this happening because we have implemented a > wildcard search? (i.e. LIKE 'Actinomucor elegans%'). If so, that > doesn't seem right to me -- we should be doing something like this: > > 1- remove any suffixes from the taxon_label that don't look like they > are part of the name string (i.e. remove suffixes that contain numbers > or that have upper case letters or that have a very short length). > > 2- take the what remains in the taxon_label and try to match it against > the taxon_variant fullnamestring. If it hits, then count the number of > related taxa. If there is more than one, then make the "match by hand" > warning and list the multiple names and ncbi_taxids to choose from. > > 3- If you don't get a taxon_variant fullnamestring match, then SOAP over > to uBIO and see if it's there & if so try to collect namebank and > ncbi_taxids. > > Notice that this does not use any wildcard searching of the > fullnamestring -- so in general "Actinomucor elegans" should not find a > match with "Actinomucor elegans var. elegans" (etc). >

Reported by: mjdominus

ghost commented 15 years ago

Original comment by: mjdominus

ghost commented 15 years ago

Original comment by: rvosa

ghost commented 15 years ago

Bill said this part can be done post-beta.

It is the ultimate cause of #2712234, which is more urgent.

Original comment by: mjdominus

ghost commented 15 years ago

Original comment by: mjdominus