Closed jar398 closed 7 years ago
We are thinking about introducing all kind of vocabularies to deal with things like that. And yes, it is often precision vs recall decisions, so some things we do not introduce fearing dropping in precision, or other way around too, like for example with authors in all caps
obviously retitle the issue if you would like a title that's more informative.
From @mdoering ---
this is indeed a never ending challenge. I am interested in populating this class as good as I can:
Specifically that is flagging names either as virus names via NameType in which case the name is not parsed at all but kept as the full string. Or I try to extract the strain information similar to cultivar names in an extra field.
The current name type enumeration is this: It is a rather pragmatic list as I needed it to distinguish classes of names.
There are some tests incl this for strain parsing:
But we also have an open issue about parsing strains better which you might want to make use of:
egrep " [A-Z][a-z ]+.*[A-Z].* (19|20)[0-9][0-9] " ~/a/ot/repo/reference-taxonomy/tax/ncbi/taxonomy.tsv | fgrep -v "." >tmp.tmp
(should also work with the names.txt file that ships with NCBI)
This yields 72 results, many of which are parsed incorrectly. Unfortunately any rules you make for heuristically dealing with these are going to be baroque, and increasingly so as you try to get the false positive and false negative rates down. So I really don't expect this issue to be addressed. But I thought you should know.
Kuraishia capsulata CBS 1993
is a strain, 1993 is not a yearLeishmania donovani Ld 2001
2001 is a strain; Ld is short for 'L. donavi', not an authoritybut also I'm impressed by how many gnparse gets right, e.g.
Bat coronavirus China 2005
(China is not an author)Lumpy skin disease virus Nigeria 1996