GlobalNamesArchitecture / gnparser

Split scientific names to meaningful elements with meta information
https://parser.globalnames.org/
MIT License
20 stars 2 forks source link

Detect Bacteria and figure out if we can distinguish strains from authorships #322

Closed jar398 closed 7 years ago

jar398 commented 7 years ago

egrep " [A-Z][a-z ]+.*[A-Z].* (19|20)[0-9][0-9] " ~/a/ot/repo/reference-taxonomy/tax/ncbi/taxonomy.tsv | fgrep -v "." >tmp.tmp

(should also work with the names.txt file that ships with NCBI)

This yields 72 results, many of which are parsed incorrectly. Unfortunately any rules you make for heuristically dealing with these are going to be baroque, and increasingly so as you try to get the false positive and false negative rates down. So I really don't expect this issue to be addressed. But I thought you should know.

Examples:

but also I'm impressed by how many gnparse gets right, e.g.

dimus commented 7 years ago

We are thinking about introducing all kind of vocabularies to deal with things like that. And yes, it is often precision vs recall decisions, so some things we do not introduce fearing dropping in precision, or other way around too, like for example with authors in all caps

jar398 commented 7 years ago

obviously retitle the issue if you would like a title that's more informative.

dimus commented 7 years ago

From @mdoering ---

this is indeed a never ending challenge. I am interested in populating this class as good as I can: https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/model/checklistbank/ParsedName.java#L48

Specifically that is flagging names either as virus names via NameType in which case the name is not parsed at all but kept as the full string. Or I try to extract the strain information similar to cultivar names in an extra field.

The current name type enumeration is this: https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/vocabulary/NameType.java#L23 It is a rather pragmatic list as I needed it to distinguish classes of names.

mdoering commented 7 years ago

There are some tests incl this for strain parsing: https://github.com/gbif/name-parser/blob/master/name-parser-gbif/src/test/java/org/gbif/nameparser/NameParserTest.java#L5211

But we also have an open issue about parsing strains better which you might want to make use of: http://dev.gbif.org/issues/browse/POR-2699