CatalogueOfLife / data

Repository for COL content
8 stars 2 forks source link

"Author" in author string of some names #371

Open mdoering opened 2 years ago

mdoering commented 2 years ago

https://www.catalogueoflife.org/data/search?q=author

yroskov commented 2 years ago

Examples from WoRMS:

image

image

yroskov commented 2 years ago

Examples from Kew database:

image

image

yroskov commented 2 years ago

Examples from ILDIS:

image

image

Corrected: names marked as "provisionally accepted"

yroskov commented 2 years ago

Example from World Plants:

image

"Author not stated" in IPNI as well.

yroskov commented 2 years ago

I do not regard this as a problem. The CoL is publishing these data correctly, as they appear in source databases. As soon as authorstrings will be resolved in WoRMS, ILDIS and World Plants, data will appear in the CoL.

In the case of Kew checklist, the authorstring included a comment - the parser problem. However, I always happy to see authorstrings in full, as they presented by the data provider.

TonyRees commented 2 years ago

I can comment on this from the WoRMS + IRMNG perspective ... let us say you have a name Andromeda Smith (I just made it up, probably does not exist...) Then you have an instance of Andromeda as a separate taxon somewhere else in the taxonomic tree, authored by someone else (i.e., a homonym), the source from which the name does not specify the authorship, but you wish to enter it as a separate taxon anyway for whatever reason (e.g. so that its species can be included or whatever). Now, the data entry software created for WoRMS, also now used for IRMNG, considers "Andromeda Smith" and "Andromeda" (without authorship) as the same name and will not let you enter a second instance, so the second one has to be distinguished by some different content in the "authority" field ... so for IRMNG I might enter it as e.g. "Andromeda [author]", or "Andromeda [author?]", or similar... just a workaround for convenience, until the correct authorship can be determined. Or if I am aware that the second one is a published misspelling, but wish to enter it anyway, I might put "Andromeda [misspelling]", thus [misspelling] will then end up in the authorship field - as I say, something is needed there so that the system will accept the name via the data entry form. (Of course this only applies to homonyms, names that do not already occur in the system can be entered without authorship if needed, as has happened in some small % of cases in the past).

However in other cases that you give, such as "not accepted by author", that is really a comment and should be parsed out separately...

The above is really suboptimal, I will admit; perhaps it would be better for any records that I create as (e.g.) "Andromeda [author?]" to have the "[author?]" portion stripped out again by the database administrators, leaving this field blank (which can be done on request). Same with "[misspelling]". - basically I am trying to force a comment into the authorship field, for software controlled reasons, however there probably is not anywhere else to put it...

Cheers - Tony

mdoering commented 2 years ago

I do not regard this as a problem. The CoL is publishing these data correctly, as they appear in source databases. As soon as authorstrings will be resolved in WoRMS, ILDIS and World Plants, data will appear in the CoL.

Yes, but it is the role of COL to do editorial work on top of the supplied data. We do this in various places, so I would clearly expect that to happen also in those authorstring cases above. An editorial decision would help to remove these problem!

In the case of Kew checklist, the authorstring included a comment - the parser problem. However, I always happy to see authorstrings in full, as they presented by the data provider.

Parser or editorial problem. It is impossible to parse out all problems automatically, that's why we have editorial decisions oin the software to fix such things!

I do not regard this as a problem. The CoL is publishing these data correctly, as they appear in source databases. As soon as authorstrings will be resolved in WoRMS, ILDIS and World Plants, data will appear in the CoL.

In the case of Kew checklist, the authorstring included a comment - the parser problem. However, I always happy to see authorstrings in full, as they presented by the data provider.

mdoering commented 2 years ago

@yroskov or should we take this to the taxonomic group and ask advice? I usually see these problem from a user's perspective and those cases are not expected

yroskov commented 2 years ago

What TG or CoL can do in such cases? It is entirely on GSDs and their taxonomists.

As many changes will be applied on the side of CoL (as well as in GBIF) as messier data become soon. Neither CoL, nor GBIF is a "data creator". It's important to understand this.

dhobern commented 2 years ago

It is important that we understand the structure of the data we received and that we do all we can to ensure that users of these data - particularly those who rely on APIs and structured data exports - do not have to carry out needless extra processing to deal with issues that we can address once.

It seems to me that we are dealing here with a verbatim version as received from the source and that we then need to consider the appropriate way to handle and represent this in different contexts. It may be that a human readable view should just present whatever the source provided, maybe with an interpretative note. For data exports/downloads/APIs, we should avoid including avoidable noise in these channels. Computer software should be able to rely on direct string comparisons rather than having to preprocess the COL data to get it clean enough to compare.

So my view is that we need to have a clean version of these fields (without the extra author) in any context where there is an expectation of machines using the data.

yroskov commented 2 years ago

Do you believe, machines cannot use a value (Author)?

yroskov commented 2 years ago

Actually, what is a big difference for machine between Pterophorus xanthodactylus auct. [nec Treitschke, 1833] and Nymphaea gigantea f. hudsonii (Author) ?

dhobern commented 2 years ago

Sorry - I misunderstood the examples (they were given as images without an explanation). And Tony's comments made me understand this as something different. I can see multiple issues that may need to be kept separate:

Random words like "author" should not appear as through they are part of the uninomial binomial / trinomial itself. Any data fields we offer that claim not to include the author name should be as clean as possible. I thought this was what was being discussed.

Where comments are inserted in the name or authorship fields, we should aim to move these to appropriate other fields and tag the name with any relevant statuses.

We should have clear guidelines on how we expect misapplications of accepted names to be represented in terms of authorship fields and status. As far as possible, we should normalise these to make it easier for users and machines to filter them. I would love to know how to tag the Pterophorus xanthodactylus example so it is truly clear.

And we should not fool ourselves. Organising and tidying data is always a creative act if it is to be of any use, just as publishing and editing manuscripts does not mean taking the author's poorly formatted Word document and putting it between covers. Many incorrect names on the web are there because COL released them into the wild. We should not allow ourselves to be a channel for further pollution.

yroskov commented 2 years ago

Corrections should be made in GSDs, in the source checklists, not in the CoL or GBIF. To succeed that, we need to establish appropriate IT infrastructure for a feedback to GSD authors, and encourage their good will do a work [via grants? proper credits? academic impact?].

dhobern commented 2 years ago

I agree that the changes need to be in the GSDs but there is a whole spectrum of tiny to major things that are easily intercepted at the COL level and prevented from propagating further. These should be logged for subsequent fixing but may be very hard to correct in whatever system the GSD uses or else may require more work than they can handle in any reasonable timeframe.

bart-v commented 2 years ago

WoRMS cases fixed. Available in next COL export 2022-06-01

Bad habit of some editors. We'll spank them :)