Open charvolant opened 3 years ago
That's a very welcome extension! If you could prepare a pull request with just the phrase name changes incl tests that would be great. Please keep the additional ranks separate as these mean some work for integration into COL & GBIF. Happy to add these through a separate PR too though at a later stage.
I'll send a PR in a couple of days. I've been discovering some extra horrors that require recognition, since a number of examples don't quite follow the formula.
By rights, phrase names shouldn't need extra ranks. Something like "Acaia sp. West Hollow (D.Palmer 7654) WA Herbarium" is implicitly a species. Phrase names are closer to structured informal or placeholder names. It's basically The WA Herbarium nominates a new, as yet undescribed, species of Acacia, identified as such by D.Palmer with ID 7654 and labelled West Hollow, where we found it. We'll get around to describing and publishing it properly, we promise."
I wonder how well these names are delimited from other names currently parsed as informal
:
Some of these have their phrase parsed into the strain field. Maybe it would be good to merge strain and phrase into a single phrase field.
I am also unclear whether a new PHRASE name type is needed or if we should simply apply INFORMAL while having a special field for the value(s). The ANBG page also calls them informal, so I lean towards the later right now:
Standardised Informal Names (phrasenames)
Informal names have various forms
- latinised names qualified by "manuscript", "ms" or "ined.";
- letters or numbers;
- phrases, with or without citation of a voucher specimen signified by a sheet number, or the collector name followed by the collector number, collection date or combination of herbarium acronym and sheet number.
It should be relatively straightforward. A phrase name should have a specific pattern:
None of the names listed match that form, so it should be possible to distinguish them.
I'm agnostic about merging phrase and strain and I'm quite happy to leave the name as informal, provided that an actual phrase name can be recognised, since we need to be careful about handling them in some cases. Separate slots for voucher and nominating party seem to be essential, since they're different to an author.
Anyway, I'll give it a spin.
@mdoering Have a look at the draft PR at https://github.com/gbif/name-parser/pull/81
I'm fine with the informal name type but merging strain and phrase makes things unclear and in need of special cases. It works but I would prefer to introduce a separate slot for the phrase.
Thanks. The reason to merge strain with phrase is that strain it is hardly possible to differ a strain from a phrase in the wider sense. The ANBG phrase names are very controlled with a rather rigid syntax. But there are other phrase names out there and currently these are parsed often into strains or even epithets:
Actually all of them end up in the epithet currently, not nice. It seems to be a COL data model problem though, the name parser tests put those into strain...
My point is that I don't see a clear difference between the ANGB phrase names and most of those names. Apart from the ANGB phrase names having more structure and the need for some more fields.
I'm interested in integrating parsing of the phrase names used in Australian herbaria https://www.anbg.gov.au/chah/phrase-names/index.html Integrating this functionality would remove the need in the ALA for an extended parser.
I have a sample implementation branch in https://github.com/charvolant/name-parser/tree/phrase-names See
What would be needed to turn this into a useful pull request?