CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

Systema Dipterorum (id 1101): test report #127

Open yroskov opened 3 years ago

yroskov commented 3 years ago

Version 3.1 received 2021-05-12. Imported to prod: https://data.catalogueoflife.org/dataset/1101/about

(previous reports are in https://github.com/CatalogueOfLife/testing/issues/6)

As result, assembly tree looks like that:

image

yroskov commented 3 years ago

ISSUES

Unusual Name Characters, 25 https://data.catalogueoflife.org/dataset/1101/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&issue=unusual%20name%20characters&limit=50&offset=0

Culex (Culcio0myia) cheni Dong, Wang & Lu, 2003 -> subgenus corrected as Culiciomyia in CoL Microcephalops (') conspectus (Hardy, 1949) -> Microcephalops conspectus in CoL

Parsed Name Differs, 39 https://data.catalogueoflife.org/dataset/1101/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&issue=parsed%20name%20differs&limit=50&offset=0

Indetermined, 227 - names with "sp." blocked in CoL https://data.catalogueoflife.org/dataset/1101/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&issue=indetermined&limit=50&offset=0

Missing Genus, 3358 - - all names blocked in CoL image

Names Acerocnema macrocera (Meigen, 21826) - blocked & Acerocnema macrocera (Meigen, 1826)

yroskov commented 3 years ago

TASKS

image

ACC-ACC species (different authors)

image

image

image

image

image

image

image

image

image

Resolved 2021-05-17

image

yroskov commented 3 years ago

Synced with 4 blocked families in the assembly tree, 2021-05-17

4 blocked families need to be confirmed after the sync: confirmed, OK.

yroskov commented 3 years ago

See https://github.com/CatalogueOfLife/data/issues/273

Both, species and its parent genus, are blocked in CoL. Systema Dipterorum re-synced 2021-06-14.

yroskov commented 3 years ago

Reported by @olafbanki 2021-06-16:

On Diptera I have a question. I see quite some family names that closely resemble each other (Archisargidaae & Archisargidae) and where duplicate genera exist (e.g. Archirhagio). I attach a screen shot from catalogueoflife.org as example. Looks like there are some data quality issues. What is your take on this?

In CoL: Archisargidaae "taxon blocked" in assembly tree 2021-06-17. Synced.

yroskov commented 3 years ago

Reported by @olafbanki 2021-06-16:

In addition the family Archisargidae is extinct, but it does not have a flag.

We agreed our action plan as (1) modify script and take available values from Epoch field of Systema Dipterorum V3.1_2021-05-12, fill start & end periods in CoL (where it necessary) for 4K accepted species, (2) apply flag “extinct” to the species with not-empty values in Epoch field.

yroskov commented 3 years ago

V3.1_2021-05-12 synced 2021-06-18

Preview 2021-06-18 https://preview.catalogueoflife.org/ looks good. New spp stats: 145933 extant & 3759 extinct spp.

yroskov commented 3 years ago

ITIS offers global checklist for Culicidae family: https://github.com/CatalogueOfLife/testing/issues/8#issuecomment-872345001

Response from SD: keep Culicidae from SD.

yroskov commented 3 years ago

Set of correctly assigned species binomials are placed under incorrect genus in the classification as a false parent: https://github.com/gbif/checklistbank/issues/187 see also CatalogueOfLife/backend#1052

Example: image

Simple re-sync did not fix a problem.

Two hypotheses (@gdower):

(1) Option "Union", which was used for the sector Diptera, may cause a problem. = No (2) Name Trigonometopus (Culex) canus (as in a dataset) may cause a problem. = Yes Generalised problem: the same subgenus name may appear in different genera in SD:

Experiment 1: do not use "Union" in assembly. Steps to repair via re-assembly of sectors: (1) delete all sectors in Diptera (2) delete entire Diptera subtree (3) establish order Diptera from SD (4) block 4 families (which we'll take from CCW): Cylindrotomidae Limoniidae Pediciidae Tipulidae (5) add 4 CCW families as sectors in Diptera (skip suprefamily Tipuloidea as a rank between family & order) (6) sync SD and CCW

Bad news: above steps did not fix a problem:

image

Experiment 2: fix "incorrect" name.
Steps to repair via complex decision over species binomial: (1) complex decision Trigonometopus (Culex) canus --> Trigonometopus canus (2) re-sync SD

Result successful: subgenus Culex is correctly placed in genus Culex image image

Well, sync process is misinterpreting original placement of "homonymic" subgenera.
Source dataset may have a mistake (i.e. incorrect subgenus in a set of species from another genus), but it may use homonymic subgenera for a purpose as well. I would expect that software generates report on detected problems for GSD authors, but also translate placement of subgenus in genus as it occur in the original dataset.

yroskov commented 3 years ago

Report for Identical Subgenus contains 743 names: image

CoL cannot resolve all cases on our end.

yroskov commented 3 years ago

Another strange case of subgenus interpretation in the hierarchy: four species with assigned subspecies in the name have been placed in NotAssignedSubgenus node: image

Can it be addressed in the ChecklistBank code, @mdoering? (@gdower)

yroskov commented 3 years ago

Subgenus issues fixed in the code.

SD synced 2021-09-14.

2021-09-16: looks like technical problem is resolved and species placed in correct genera. Waiting for a new version of the checklist without "homonymic" subgenera in different genera from SD team.

yroskov commented 2 years ago

Version 3.6 received 2022-02-14

Imported to DEV https://data.dev.catalogueoflife.org/dataset/1101/classification

yroskov commented 2 years ago

Checks of the view 2022-02-18 (few notes)

image

In the source (SPECIES table):

Family Full Name Line Full Name Line Range Full Species Line
Chironomidae Chironomus mixtus Holmgren, 1869: 45. TL: Norway. Bear I. (HT F NRS). Chironomus mixtus Holmgren, 1869: 45. TL: Norway. Bear I. (HT F NRS). Orthocladius (Orthocladius) mixtus. (PA: PA)

*) image

CONCLUSION: set of species in these fam & gen have no parent families in the source file (blank values). CLB INTERPRETATION IS CORRECT

Checked few against the source (SPECIES):

Ceratopogoninae have 3 parent families, plus blank family with Serromyia errata: image

Chironominae have 2 parent families, plus blank family with 3 spp Nandeva pudens, Parachironomus inageheus, Polypedilum (Polypedilum) xianjuensis : image

Chironomiinae (is it different with above? - check with Neal) have 1 parent family Chironomidae, plus blank family with species Yaeprimus balteatus: image

Clitellariinae have 2 parent families, plus blank family with species Adoxomyia hasbenlii image

mdoering commented 2 years ago

Species with uncertain placement in the genus: the original genus goes in square brackets, as agreed with Neal. (Previously, such names had question mark in genus field ( ? stercoreus) How CLD deals with such names?

I don't think it is a good idea to feed names with square bracket embraced genera into CLB to indicate uncertain placement. This is a very specific convention for SD only and not known to anyone nor the system itself.

We should try to change those names and rather follow the guidelines of ColDP, where we have discussed this problem and how to deal with it in a consistent way so both CLB and other users understand the data correctly.

Looking at the verbatim Taxon record of that example I see various problems:

In the linked Name record I would suggest to simply remove the square brackets.

yroskov commented 2 years ago

Damm! New url: https://www.dev.checklistbank.org/dataset/1101/classification

yroskov commented 2 years ago

In the linked Name record I would suggest to simply remove the square brackets.

I am comfortable with presentation of an original genus in square brackets where a new placement in a genus is not resolved yet. If CLB allows search for names with square bracket, I'll be happy to mark these accepted names as Provisionally Accepted in the CoL. See: https://github.com/CatalogueOfLife/backend/issues/1112

mdoering commented 2 years ago

Curly brackets around genera is nothing we support at this stage. It will be considered bad data and likely has impacts down the line when we assemble COL, e.g. when we make sure to have a genus record for every accepted species. Don't be surprised if you find new genera with brackets in COL.

yroskov commented 2 years ago

Version 3.6 (2022-02-14) imported to the PROD 2022-05-17

yroskov commented 2 years ago

ISSUES assessed 2022-05-17

image

yroskov commented 2 years ago

TASKS as 2022-05-17 image

!Remember! ACC=ACC sp (diff auth): all names with genus in square brackets = Prov Acc ? what to do with names with authorstrings without year (they may have different synonyms - keep?) image

yroskov commented 2 years ago

Version 3.6 (2022-02-14), new crawl iteration imported to the PROD 2022-05-19

yroskov commented 2 years ago

ISSUES assessed 2022-05-19 (many previous decisions remain in place) image

yroskov commented 2 years ago

Investigating bare names, 8,455 https://www.checklistbank.org/catalogue/3/dataset/1101/workbench?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&limit=100&offset=0&status=bare%20name

? ambigua Pankratova, 1950 = ok ? arcudae Botnariuc, 1956 = ok ? delicatula Botnariuc & Cure, 1956 = ok

However, many names become "bare" for unclear (yet) reasons: Mesembrinella dorsimacula Aldrich, 1922, it is (Available, Valid) Current Status Komisca nanensis (Chaiwong, Sukontason & Sukontason, 2009), it is (Available, Valid) Changed Combination / Rank Abago rohdendorfi Grunin, 1966 (Available, Invalid) Junior SECONDARY Homonym

yroskov commented 2 years ago

TASKS as 2022-05-19 image

Resolved 2022-05-19: image

yroskov commented 2 years ago

NEW SEARCH OPTION IN Workbench @CLB: RegEx Search (Regular Expression Search) image

https://github.com/CatalogueOfLife/testing/issues/197

yroskov commented 2 years ago

Crawl iteration with pre-flaged "prov acc" names imported 2022-05-19 & 20. (The main problem: page Tasks failed to be displayed in CLB (spinning progress forever), page Classification whether also failed or too slow - multiple imports. = now resolved)

yroskov commented 2 years ago

TASKS as 2022-05-20 (all previous decisions re-applied successfully, new decisions added)

image

Synced 2022-05-20

yroskov commented 2 years ago

Checking Systema Dipterorum v 3.6 data in the portal, I have found that lot of provisionally accepted species have been interpreted as synonyms to the parent family after a sync. https://github.com/CatalogueOfLife/backend/issues/1150

yroskov commented 2 years ago

Systema Dipterorum 3.8, May 2022 received 2022-05-22; GO converted & imported to DEV for further tests 2022-05-25: https://www.dev.checklistbank.org/dataset/77757/imports

yroskov commented 2 years ago

image

But I cannot find a single record with question mark in GO spreadsheet.

= see Issue Missing Genus, 3125

mdoering commented 2 years ago

A question mark is generated when a species is missing the genus, e.g. abbreviata only has the species epithet: https://www.dev.checklistbank.org/dataset/77757/verbatim/300284

You can spot all those 3000 sth names without genus by its issue which is a red one to take serious: https://www.dev.checklistbank.org/dataset/77757/verbatim?issue=missing%20genus

Also note that there are some other very serious ID problems, e.g. NameID Invalid: https://www.dev.checklistbank.org/dataset/77757/issues

yroskov commented 2 years ago

Systema Dipterorum 3.8, May 2022 received 2022-05-22 (continue); imported to prod 2022-06-10

ISSUES assessed 2022-06-13 image

yroskov commented 2 years ago

TASKS image

Resolved 2022-06-13 image

Synced 2022-06-13

yroskov commented 2 years ago

Reported problems by GBIF (2022-08-31):

Neal: I see no problem with this record. It is unplaced within Asilidae, so we have done as you instructed and put the original genus in square brackets in the Present genus field. The record says it was revised on 7 May 2022, so it should not have shown up as “Asilidae cristatus” in CoL (or GBIF). I would think it should have shown up as “[Asilus] cristatus”.

YR: probably, a problem with interpretation on our side: https://github.com/CatalogueOfLife/data/issues/455#issuecomment-1234451772

Neal: This species (vicininervis Macquart 1844) has been corrected to place Opomyza in square brackets in the Present genus field -- as it is an unplaced species in Tephritidae.

In the export file, ver. 3.8_2022-05-20, table species Family field Tephritidae - @gdower, no mistake in our interpretation, as I can see. It should disappear with a new update.

Neal: I can’t find this problem in SD. All our records with the name “Lonchaea” map to Lonchaeidae.

YR: Only one species Lonchaea discrepans placed in the family Tachinidae in the CoL.

In the export file, ver. 3.8_2022-05-20, table species, it is in the family Lonchaeidae: https://github.com/CatalogueOfLife/data/issues/457#issuecomment-1234529319 YR: I cannot understand yet, what caused misplacement in the family.

Neal: Another one that is correct in SD. All our Rhagio and Rhagioniinae records map to Rhagionidae. Not sure how this got into CoL as Tabanidae as there are 206 records affected!

YR: Very unclear case to me. In the export file, ver. 3.8_2022-05-20, table species, I did not find(!) genus Rhagio in the family Tabanidae.

However, there are 206 species in the CoL portal, where genus Rhagio is placed in the family Tabanidae: https://www.catalogueoflife.org/data/search?TAXON_ID=8YQML&rank=species&status=accepted&status=provisionally%20accepted

See more at https://github.com/CatalogueOfLife/data/issues/458#issuecomment-1234558730

yroskov commented 1 year ago

Systema Dipterorum 3.10, Sep 2022, received 2022-09-30; imported to prod 2022-10-06

yroskov commented 1 year ago

image

@gdower, perhaps there is a sense to kill non-basic ranks in SD conversion. It may reduce noise caused by unresolved placements.

yroskov commented 1 year ago

ISSUES assessed 2022-10-12 image

yroskov commented 1 year ago

TASKS image

image

Synced 2022-10-25

mdoering commented 1 year ago

ISSUES assessed

Quite a lot of serious invalid ids and duplicate ids. Please investigate into the cause, that has potential for lots of problems.

yroskov commented 1 year ago

@gdower, could you pls have a look on "technical" issues among those highlighted by @mdoering: Id Not Unique, 135 Accepted Id Invalid, 725 Name Id Invalid, 8934

yroskov commented 1 year ago

GO & YR 2022-11-14: decisions in RegEx Search might be set up, but not displayed in the interface.

Experiment of 2022-11-14: Seems, there are 715 genera in square brackets in total: 1st page of 500 per page: [Ablabesmyia] - [Phoraea] 2nd page [Phorbia] - [Zygoneura] Prov Acc status applied to all 715 genera in square brackets. No progress shown. No decisions appear in the report. However, decisions shown in Project-CoL-Decisions (see mode = update): image

Synced 2 hours later, 2022-11-14

Bracketed genera checked in PREVIEW 2022-11-15:

Archalia - 1 in the RegEx report, 0 in the PREVIEW Actia - 1 in the report, 1 from SD in the PREVIEW, accepted Bibio - 4 in the report, 2 in the PREVIEW, both marked as prov acc Ceroxys - 2 in the report, 1 in the PREVIEW, accepted Dinera - 1 in the report, 1 in the PREVIEW, accepted Lydella - 1 in the report, 1 in the PREVIEW, accepted Voriella - 1 in the report, 1 in the PREVIEW, accepted

There are only 19 ProvAcc genera in Diptera in the PREVIEW.

No bracketed genera in the PREVIEW: all brackets removed, but more probable, bracketed genera did not pass to the final product. @gdower, could this be related to the issue of broken parent-child relationships, invalid and duplicated ids?

Anyway, experiment of 2022-11-14 did not change number of accepted species in SD@CoL.

yroskov commented 1 year ago

Experiment of 2022-11-16:

Plan: remove brackets from species names, give them ProvAcc status before the import in CLB. Imported 2022-11-16

TASKS remain unchaged (i.e. resolved)

Synced 2022-11-16

Results in the PREVIEW 2022-11-18:

Previously bracketed genera checked:

Genus RegEx report 2022-11-14 (as bracketed genus) PREVIEW 2022-11-15 PREVIEW 2022-11-18
Archalia 1 0 1, accepted
Actia 1 1 from SD, accepted 1 from SD, accepted
Bibio 4 2, prov acc 5, of them 4 accepted, 1 prov acc
Ceroxys 2 1, accepted 3, of them 3 accepted
Dinera 1 1, accepted 2, of them 1 accepted, 1 prov acc
Lydella 1 1, accepted 2, of them 2 accepted
Voriella 1 1, accepted 2, of them 1 accepted, 1 prov acc
yroskov commented 1 year ago

Misspellings reported to Neal, 2022-12-02:

Diptera> Ceratopogonidaae vs Ceratopogonidae Diptera> Cedratopogonidae Diptera> ceratopogonidae Diptera> Phoridase vs Phoridae Diptera> Sphaerroceridae vs Sphaeroceridae Diptera> Mycxetophilidae vs Mycetophilidae Diptera> Dolichoposidae vs Dolichopodidae Diptera> Limoniidae> Limoninae vs Limoniinae Diptera> Liomniidae vs Limoniidae Diptera> Tephritidae> Tryptetinae vs Trypetinae Diptera> Cecidomyiidae> Porrocondylinae vs Porricondylinae Diptera> Cecidomyiidae> Porriconylinae vs Porricondylinae Diptera> Mycetophilidae> Leeinae vs Leiinae Diptera> Cecidomyiidae> Cercidomyiinae vs Cecidomyiinae Diptera> Chironomidsae vs Chironomidae Diptera> Stratiomyidae> Stratiomyiinae vs Stratiomyinae

yroskov commented 1 year ago

Systema Dipterorum 3.10, Sep 2022

TASKS 2022-12-02

image

Re-synced 2022-12-02

yroskov commented 1 year ago

2023-05-16: SD is in update4.2/revert3.10 process. (checklist reverted back to 3.10, metadata 4.2). Must not be synced until repair!

yroskov commented 1 year ago

Systema Dipterorum 4.2, May 2023, received 2023-05-13; imported to prod 2023-05-17

image

ISSUES

image

yroskov commented 1 year ago

SD 4.2, May 2023

TASKS

image

Not synced

yroskov commented 1 year ago

Systema Dipterorum 4.2.2, May 2023, received 2023-05-27; imported to prod 2023-05-30

image

TASKS

image

Resolved 2023-06-12:

image

Synced 2023-06-12 (without rank subgenus)

yroskov commented 1 year ago

2023-06-15: temporary names such as *FChironominae (start as *F) deleted as a node (“taxon”) in Assembly - Draft. All children attached to the next parent. Sync is not involved (i.e. such names will be back with next sync).