gbif / name-parser

The core GBIF scientific name parser library
Apache License 2.0
18 stars 4 forks source link

Parse phrase names #79

Open charvolant opened 3 years ago

charvolant commented 3 years ago

I'm interested in integrating parsing of the phrase names used in Australian herbaria https://www.anbg.gov.au/chah/phrase-names/index.html Integrating this functionality would remove the need in the ALA for an extended parser.

I have a sample implementation branch in https://github.com/charvolant/name-parser/tree/phrase-names See

What would be needed to turn this into a useful pull request?

mdoering commented 3 years ago

That's a very welcome extension! If you could prepare a pull request with just the phrase name changes incl tests that would be great. Please keep the additional ranks separate as these mean some work for integration into COL & GBIF. Happy to add these through a separate PR too though at a later stage.

charvolant commented 3 years ago

I'll send a PR in a couple of days. I've been discovering some extra horrors that require recognition, since a number of examples don't quite follow the formula.

By rights, phrase names shouldn't need extra ranks. Something like "Acaia sp. West Hollow (D.Palmer 7654) WA Herbarium" is implicitly a species. Phrase names are closer to structured informal or placeholder names. It's basically The WA Herbarium nominates a new, as yet undescribed, species of Acacia, identified as such by D.Palmer with ID 7654 and labelled West Hollow, where we found it. We'll get around to describing and publishing it properly, we promise."

mdoering commented 3 years ago

I wonder how well these names are delimited from other names currently parsed as informal:

Some of these have their phrase parsed into the strain field. Maybe it would be good to merge strain and phrase into a single phrase field.

I am also unclear whether a new PHRASE name type is needed or if we should simply apply INFORMAL while having a special field for the value(s). The ANBG page also calls them informal, so I lean towards the later right now:

Standardised Informal Names (phrasenames)

Informal names have various forms

  • latinised names qualified by "manuscript", "ms" or "ined.";
  • letters or numbers;
  • phrases, with or without citation of a voucher specimen signified by a sheet number, or the collector name followed by the collector number, collection date or combination of herbarium acronym and sheet number.
charvolant commented 3 years ago

It should be relatively straightforward. A phrase name should have a specific pattern:

  1. Uninomial or binomial
  2. rank marker
  3. Phrase with upper case first letters and no quotation marks (distinguishing themselves from cultivars)
  4. Voucher in parentheses. The voucher is a combination of a name similar to an author's name and an identifier which is usually a letter followed by digits or just digits). This is the person who "vouches" for a specimen being a distinct species and the identifier should uniquely identify the voucher. It's distinguishable from an zoological author by there being no comma between the name and the id. This is one of the most wobby things, since the voucher id is sometimes a date, instead. And there are sometimes other notations. But that it's a voucher is generally recognisable.
  5. An optional nominating party, the institution that houses the specimen and wants it recorded.

None of the names listed match that form, so it should be possible to distinguish them.

I'm agnostic about merging phrase and strain and I'm quite happy to leave the name as informal, provided that an actual phrase name can be recognised, since we need to be careful about handling them in some cases. Separate slots for voucher and nominating party seem to be essential, since they're different to an author.

Anyway, I'll give it a spin.

charvolant commented 3 years ago

@mdoering Have a look at the draft PR at https://github.com/gbif/name-parser/pull/81

I'm fine with the informal name type but merging strain and phrase makes things unclear and in need of special cases. It works but I would prefer to introduce a separate slot for the phrase.

mdoering commented 3 years ago

Thanks. The reason to merge strain with phrase is that strain it is hardly possible to differ a strain from a phrase in the wider sense. The ANBG phrase names are very controlled with a rather rigid syntax. But there are other phrase names out there and currently these are parsed often into strains or even epithets:

Actually all of them end up in the epithet currently, not nice. It seems to be a COL data model problem though, the name parser tests put those into strain...

My point is that I don't see a clear difference between the ANGB phrase names and most of those names. Apart from the ANGB phrase names having more structure and the need for some more fields.