CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

GLI, Global Lepidoptera Index (id 55434): test report #195

Open yroskov opened 2 years ago

yroskov commented 2 years ago

LepIndex ver. "2022-04-39 / 2022-04-30" on PROD: https://www.checklistbank.org/dataset/55434/classification

image

yroskov commented 2 years ago
yroskov commented 2 years ago

Name and legal issues. “LepIndex”, identical to NHM dataset, and a full name “Global Lepidoptera Index” vs NHM “The Global Lepidoptera Names Index” (https://www.nhm.ac.uk/our-science/data/lepindex/lepindex/). Both titles are almost identical for users. For sure, it will create a confusion in the information space. People will mess LepIndex @ NHS with LepIndex @ TaxonWorks. Appropriate explanation of a new life for LepIndex data should appear at NHM website before we include LepIndex @TaxonWorks in CoL. 2022-05-06: "TaxonWorks Lepidoptera"

Version. “2022-04-39”, i.e. in a style of date. Users will understand this Users will understand this (as shown in the CoL/CLB interface) as two dates together “2022-04-39 / 2022-04-30”, the first with typo. Would you please consider to change a style of version, for example as ver. “39, Apr 2022”, or as “2022-39” (if “39” is a linear iteration, not a typo). 2022-05-06: "2022-04-30 / 2022-04-30"

Authors & editors. No names so far. Any delimitation between Creators, Editors & Contributors? Names from original website (George Beccaloni, Malcolm Scoble, Ian Kitching, Thomas Simonsen, Gaden Robinson, Brian Pitkin, Adrian Hine & Chris Lyal). Definitely, we need names before we release dataset as a part of CoL.

Publisher. Do you consider Species Files Group to be the publisher of the dataset?

Abstract. I would suggest to add time intervals in two phrases to make steps transparent. “LepIndex [which version?] was subsequently imported into TaxonWorks [in 20??-20??] and is now maintained as an online curated resource”. “Significant updates have been made [in 20??-20??] to some families…”. ( It’s important to know, which version of LepIndex was imported in TaxonWorks. NHM website says “Database last updated January 2018”, CoL used a version 12.3 of Jan 2012. Nice to know, which data were imported in TW and when).

yroskov commented 2 years ago
yroskov commented 2 years ago
yroskov commented 2 years ago

TASKS (2022-05-06) https://www.checklistbank.org/catalogue/3/dataset/55434/tasks image

yroskov commented 2 years ago

Examples: 3 genera Acanthobrahmaea, Brahmaeops, Brahmidia in the family Brahmaeidae https://www.checklistbank.org/dataset/55434/classification?taxonKey=234098 image

Subfamily Hibrildinae in the family Eupterotidae https://www.checklistbank.org/dataset/55434/classification?taxonKey=234110 image also: image

yroskov commented 2 years ago

[Hoplophanes] lithocolleta Turner, 1916 image

See family Incurvariidae with 9 unplaced species: image

Other examples: [GENUS NOT SPECIFIED] brunnea Schaus, 1901 [GENUS NOT SPECIFIED] poltrona Schaus, 1910 image

Record [GENUS NOT SPECIFIED] (supergen. Ganisa) in the family Eupterotidae: https://www.checklistbank.org/dataset/55434/classification?taxonKey=235107 image

yroskov commented 2 years ago
yroskov commented 2 years ago

Other examples:

Alaria Duncan [& Westwood] , 1841 Aricia R.L. [Reichenbach] , 1817 Chloridea Duncan [& Westwood] , 1841

Acalyptris paradividua imkeviciute & Stonis, 2009

Synonyms with auctorum and nec-see below (yr: possibly Misapplied Names - check with Donald before final sync): Alucita auctorum Hypolamprus auctorum Ligia auctorum Cibyra (Philaenia) auctorum Anops phaedrus (Boisduval, 1836 nec Drury, 1773) Antithesia tripunctana (Stephens, 1834 nec Frölich, 1828)

Full list of Unparsable Authorship https://www.checklistbank.org/dataset/55434/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&issue=unparsable%20authorship&limit=1000&offset=0&reverse=false

yroskov commented 2 years ago

ISSUES assessed 2022-05-26,27 image

yroskov commented 2 years ago

Sector(s), steps

yroskov commented 2 years ago

= FIXED 2022-05-31: flagged as Misapplied Names

yroskov commented 2 years ago

TASKS 2022-05-27 image

dhobern commented 2 years ago

The (nec) and (non) cases should all be flagged as misapplied names. Are there issues in the current LepIndex import that I should immediately attempt to fix?

mdoering commented 2 years ago

Subspecies Assigned, 151. At least some of these names, CLB incorrectly parsed species names with two-word epithet as subspecies. Examples, Agrias pseudo lesoudieri Le Moult, 1926; Agrias semi rileyi Michener; Cosmopteryx sancti vincentii Walsingham, 1892 https://www.checklistbank.org/catalogue/3/dataset/55434/workbench?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&issue=subspecies%20assigned&limit=100&offset=0 = NO ACTIONS IN CLB (@dhobern, whether solution should be found in TW, or Markus script should be adjusted)

The verbatim data does place 2 words in the specificEpithet field, so we don't need to parse the names and the resulting name is just fine: https://api.checklistbank.org/dataset/55434/name/180232-0d0b0af472257a86e183be5385ee54e8

If we only had a scientificName string, the name parser provides the option of manual configurations for such cases. It is rather impossible to find a rule for those multi word epithet. I am slightly surprised to not see so many without hyphens.

But are we sure all these cases are indeed species? Agriades cuneo lunulata Tutt, 1909 or autonyms like Agriades cuneata cuneata Tutt, 1909. The issue flags potential errors and it would be good to check them all.

dhobern commented 2 years ago

AAAAAAARGH! These are a complete mess. I suspect these are all supposed to be hyphenated but they are not even properly discoverable for editing inside TaxonWorks. They are in a limbo state. These could probably all be ignored for now as ridiculously obscure synonyms.

dhobern commented 2 years ago

I may be able to get to these one at a time through one of the other TW interfaces. I'll try.

dhobern commented 2 years ago

Here is what I have found. The situations in which these names end up in the LepIndex dataset all seem to be based on names with original hyphens, which have mysteriously disappeared when importing into LepIndex. They include the hyphen in the older NHM import into COL and in the original index cards. For example:

https://www.nhm.ac.uk/our-science/data/lepindex/detail/?taxonno=57675&&snoc=nigro-plumbosa&search_type=starts&sort=snoc&indexed_from=1&page_no=1&page_size=30&path=search

TaxonWorks has taken these names with hyphenated epithet and:

  1. Filled the Verbatim name field with the epithet with a space where the hyphen should be.
  2. Used the section of the epithet following the hyphen in the main Name field
  3. (Apparently) assigned the Rank of species to every one of these regardless of whether the name was originally a specific or an infraspecific epithet. Most of them (but not all - that would be too easy!) were described as forms, aberrations or mutations.

An example of an aberration:

image

An example of a species:

image

The zoological code says that these names should be corrected to:

This is how the names should appear in COL, with the hyphenated forms as verbatim names.

dhobern commented 2 years ago

If every example was an aberration or form, I would suggest we simply automate updating them all to forms with complete epithets minus the spaces. However, some are clearly species. These are the ones that actually matter.

In COL today, these all appear, often nonsensically, as species names. If it is possible, the quickest temporary fix on the COL side may be to interpret all of these as if they still have the hyphen. This will leave the data in the same state as the current COL Checklist. Is that possible @mdoering ?

I can go through and steadily fix these names, but it will take some time. Since LepIndex does not know what rank they are (or therefore what the actual parent was), each one will need separate handling and interpretation to make them correct. It will take me time to go through and

dhobern commented 2 years ago

I'm going through these names in TaxonWorks, fixing as many as I can. I've started with the families with just a few cases. The butterfly families and Lasiocampidae have many because collectors obsessed about small differences.

The biggest problem is that doing anything piecemeal in LepIndex rapidly turns into a quicksand experience. For each of these cases where aberrations and forms that include hyphens are misinterpreted as species, there are usually also several where hyphen-free names are misinterpreted in the same way. And a depressing number of LepIndex records are simple misreadings of the text on the cards that have led to the wrong character strings being recorded as names, e.g.:

https://www.nhm.ac.uk/our-science/data/lepindex/detail/?taxonno=243555&&snoc=costimacula&search_type=starts&sort=snoc&indexed_from=1&page_no=3&page_size=30&path=search

https://www.nhm.ac.uk/our-science/data/lepindex/detail/?taxonno=62913&&snoc=albicauda-&search_type=starts&sort=snoc&indexed_from=1&page_no=1&page_size=30&path=search

Ah, well - every fix is a fix ...

dhobern commented 2 years ago

More on why LepIndex and GloBIS (GART) need much more work. 19 of the species identified above with a space in the verbatim specific epithet come from the same publication by Bright & Leeds, 1938. In fact, these 19 are only a tiny part of a total of 301 aberration names described for a single species in the same publication. Every single one of these is incorrectly shown in LepIndex as a species-rank name, and these are don't come close to completing the full synonymy for the species in question. Behold how much taxonomist effort has been expended on Lysandra coridon: https://www.checklistbank.org/dataset/55434/taxon/281369

But, the most fun thing I've realised is that this over-worked species nevertheless does not appear at all in COL (even though there are >26,000 records of the species in GBIF). This is because COL imports the family Lycaenidae from GloBIS, even though the GloBIS site says that COL just takes the much more complete Pieridae and Papilionidae from that dataset. Big chunks of the Lycaenidae are therefore missing in COL. I think I need to recommend a switch to LepIndex rather than GloBIS for this family.

mdoering commented 2 years ago

In COL today, these all appear, often nonsensically, as species names. If it is possible, the quickest temporary fix on the COL side may be to interpret all of these as if they still have the hyphen. This will leave the data in the same state as the current COL Checklist. Is that possible @mdoering ?

During interpretation that is not possible I am afraid. But I guess we could update the current ColDP to apply those corrections. This would cut us off from TW updates though until these fixes are also applied there. Unless we can script it all and reapply - might be doable.

mdoering commented 2 years ago

Actually it is much simpler and flexible if I allow the name interpreter to insert the hyphen. It will be a new feature EPITHET_ADD_HYPHEN that can be turned on dataset wide on demand in the settings.

dhobern commented 2 years ago

OK - if that works. I've managed to process about a third of them now in TaxonWorks and was going to keep plugging away at them tonight.

dhobern commented 2 years ago

I've carried on and processed them all now in TaxonWorks. I haven't worked out the species for which they would be aberrations or forms, but I have fixed all the hyphen/spaces.

@mdoering @yroskov Are there other critical aspects that should be fixed? Or should be reimport from TaxonWorks?

yroskov commented 2 years ago

Are there other critical aspects that should be fixed? Or should be reimport from TaxonWorks?

@dhobern, I would suggest to do re-import. After that, I'll go ahead with a new version and do resolution of TASKS.

The (nec) and (non) cases should all be flagged as misapplied names.

= FIXED in CLB 2022-05-31: flagged as Misapplied Names

yroskov commented 2 years ago

Task: all these accepted names need to be flagged as Provisionally Accepted via Workbench + RegEx Search. (Thousand(s) of names. Not able to get exact figure, because interface shows me only current and next page with 50 names per page (~2,250 finally)).

The task cannot be completed reliably due to reported problems: https://github.com/CatalogueOfLife/checklistbank/issues/1061. (1) Synonyms might be affected by incorrect decision, (2) Some names might be missed during process, (3) Long list might be not completed (now completed up to page 25, [Macaria] zozinaria 46 (final?) [Xylina] ligniplena to [Zonosoma] prunelliaria) with search pattern [[A-Za-z]

yroskov commented 2 years ago

TASKS, ver 2022-04-30 / 2022-04-30, imported 2022-04-29 image

Resolved 2022-06-01 image

Synced (draft) 2022-06-10

yroskov commented 2 years ago

Tests of sync results:

Families Gelechiidae (in Gelechioidea), Papilionidae, Pieridae, Lycaenidae (in Papilionoidea), Gracillariidae (in Gracillarioidea), Tineidae (in Tineoidea) are in place. Anyway, re-synced 2022-06-01.

Preview release started 2022-06-01

yroskov commented 2 years ago

Import of the iteration 2022-06-02 (the same ver. 2022-04-30, as indicated in metadata)

TASKS image

Resolved 2022-06-02 image

Synced 2022-06-02

yroskov commented 2 years ago

@dhobern, 2022-06-06:

I've now processed around 30 papers on Gelechiidae from Zootaxa and one from Systematic Entomology, all from the period 2018-2022. I think this should cover the most significant additions since Klaus retired. It would therefore be great if you can reprocess the Gelechiidae as well as LepIndex.

Synced 2022-06-08

yroskov commented 2 years ago

Yes - please drop the empty genera. They exist for various reasons and are meaningless in COL terms, but will prove useful as we carry on cleaning up sections of the Global Lepidoptera Index, so I'd like to leave them in the underlying data source for now,

690 genera blocked (selected batch - blocked selected taxa).

Re-synced 2022-06-20.

yroskov commented 2 years ago

ver. 2022-06-30 / 2022-06-30 imported 2022-07-29

TASKS image

Resolved 2022-08-01: image

@dhobern, TASKS report "Identical genus" contains 76 names with the scientificName value "Group" and rank "Genus": https://www.checklistbank.org/catalogue/3/dataset/55434/duplicates?category=uninomial&limit=990&minSize=2&mode=STRICT&offset=0&rank=genus&status=accepted&withDecision=false image

I'll send you CSV file with the report.

Plus, 293 binomial combinations with a word “Group” as a genus.

= FIXED in the import 2022-08-02

yroskov commented 2 years ago

ver. 2022-06-30 / 2022-06-30 imported 2022-08-02

TASKS image

image

Synced 2022-08-02

yroskov commented 1 year ago

Global Lepidoptera Index 2022-10-19 / 2022-10-19 (check version & date - need to be updated; metadata locked for me; = UPDATED); synced 2022-10-20

yroskov commented 1 year ago

Global Lepidoptera Index 2022-10-19 / 2022-10-19; imported 2022-11-28 Donald, 2022-11-28: The GLI includes a number of new species. Synced 2022-11-28

image

ISSUES

image

TASKS

image

Resolved 2022-12-01

image

Synced 2022-12-01

yroskov commented 1 year ago
yroskov commented 1 year ago

Global Lepidoptera Index 0.32.1 / 2023-04-03; imported 2023-04-03

Synced 2023-04-03

yroskov commented 1 year ago

Global Lepidoptera Index 0.32.2 / 2023-04-08; imported 2023-04-08

@dhobern, 2023-04-07: The GLI changes include complete updating of the tribe Micronoctuini (Noctuoidea: Erebidae: Hypenodinae), complete updates for several small families of fewer than 20 species, and modernisation of the taxonomy for Gelechioidea to reflect recent classifications.

image

ISSUES assessed 2023-04-10

image

TASKS

image

and

https://www.checklistbank.org/dataset/55434/taxon/440089 https://www.checklistbank.org/dataset/55434/taxon/443079

Resolved 2023-04-10:

image

Synced 2023-04-10

yroskov commented 1 year ago

Global Lepidoptera Index 0.32.3 / 2023-05-10; imported 2023-05-10

image

TASKS

image

Resolved 2023-05-19:

image

Synced 2023-05-19

yroskov commented 1 year ago

Global Lepidoptera Index 0.33.1 / 2023-05-30; imported 2023-05-31

Donald, 2023-06-01: several new papers processed for Gelechiidae and around 100 more species in GLI.

TASKS

image

Resolved 2023-06-02:

image

Synced 2023-06-02

yroskov commented 1 year ago
yroskov commented 1 year ago

Global Lepidoptera Index 0.33.1 / 2023-07-16; imported 2023-07-17

image

TASKS

image

Resolved 2023-07-17:

image

Synced 2023-07-17

yroskov commented 1 year ago

Global Lepidoptera Index 0.33.1 / 2023-07-27; imported 2023-08-06

image

TASKS

image

Resolved 2023-08-11:

image

Synced 2023-08-11

yroskov commented 1 year ago

@dhobern,

yroskov commented 1 year ago

GLI 0.34.1 / 2023-08-15 re-synced 2023-08-17

yroskov commented 12 months ago

Global Lepidoptera Index 0.34.2 / 2023-09-05; imported 2023-09-06

TASKS

image

Resolved 2023-09-06:

image

Synced 2023-09-06

image

re-synced 2023-09-12

yroskov commented 10 months ago

Global Lepidoptera Index 1.1.23.269 / 2023-09-26; imported 2023-09-26

image

ISSUES assessed 2023-10-12:

image

TASKS

image

Resolved 2023-10-12:

image

Comments: SYN-SYN species (different accepted, same authors) 549 of 551: image

Synced 2023-10-12

yroskov commented 10 months ago
yroskov commented 9 months ago

Global Lepidoptera Index 1.1.23.316 / 2023-11-12; imported 2023-11-12

image

Synced 2023-11-13