Open yroskov opened 3 years ago
ISSUES
Culex (Culcio0myia) cheni Dong, Wang & Lu, 2003 -> subgenus corrected as Culiciomyia in CoL Microcephalops (') conspectus (Hardy, 1949) -> Microcephalops conspectus in CoL
Indetermined, 227 - names with "sp." blocked in CoL https://data.catalogueoflife.org/dataset/1101/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&issue=indetermined&limit=50&offset=0
Missing Genus, 3358 - - all names blocked in CoL
Names Acerocnema macrocera (Meigen, 21826) - blocked & Acerocnema macrocera (Meigen, 1826)
TASKS
ACC-ACC species (different authors)
Resolved 2021-05-17
Synced with 4 blocked families in the assembly tree, 2021-05-17
4 blocked families need to be confirmed after the sync: confirmed, OK.
See https://github.com/CatalogueOfLife/data/issues/273
Both, species and its parent genus, are blocked in CoL. Systema Dipterorum re-synced 2021-06-14.
Reported by @olafbanki 2021-06-16:
On Diptera I have a question. I see quite some family names that closely resemble each other (Archisargidaae & Archisargidae) and where duplicate genera exist (e.g. Archirhagio). I attach a screen shot from catalogueoflife.org as example. Looks like there are some data quality issues. What is your take on this?
In CoL: Archisargidaae "taxon blocked" in assembly tree 2021-06-17. Synced.
Reported by @olafbanki 2021-06-16:
In addition the family Archisargidae is extinct, but it does not have a flag.
We agreed our action plan as (1) modify script and take available values from Epoch field of Systema Dipterorum V3.1_2021-05-12, fill start & end periods in CoL (where it necessary) for 4K accepted species, (2) apply flag “extinct” to the species with not-empty values in Epoch field.
[x] In the Tree "Chironomus mixtus Holmgren, 1869: 45. TL: Norway." as a family. BLOCKED.
[x] ISSUES
[x] TASKS
V3.1_2021-05-12 synced 2021-06-18
Preview 2021-06-18 https://preview.catalogueoflife.org/ looks good. New spp stats: 145933 extant & 3759 extinct spp.
ITIS offers global checklist for Culicidae family: https://github.com/CatalogueOfLife/testing/issues/8#issuecomment-872345001
Response from SD: keep Culicidae from SD.
Set of correctly assigned species binomials are placed under incorrect genus in the classification as a false parent: https://github.com/gbif/checklistbank/issues/187 see also CatalogueOfLife/backend#1052
Example:
Simple re-sync did not fix a problem.
Two hypotheses (@gdower):
(1) Option "Union", which was used for the sector Diptera, may cause a problem. = No (2) Name Trigonometopus (Culex) canus (as in a dataset) may cause a problem. = Yes Generalised problem: the same subgenus name may appear in different genera in SD:
Experiment 1: do not use "Union" in assembly. Steps to repair via re-assembly of sectors: (1) delete all sectors in Diptera (2) delete entire Diptera subtree (3) establish order Diptera from SD (4) block 4 families (which we'll take from CCW): Cylindrotomidae Limoniidae Pediciidae Tipulidae (5) add 4 CCW families as sectors in Diptera (skip suprefamily Tipuloidea as a rank between family & order) (6) sync SD and CCW
Bad news: above steps did not fix a problem:
Experiment 2: fix "incorrect" name.
Steps to repair via complex decision over species binomial:
(1) complex decision Trigonometopus (Culex) canus --> Trigonometopus canus
(2) re-sync SD
Result successful: subgenus Culex is correctly placed in genus Culex
Well, sync process is misinterpreting original placement of "homonymic" subgenera.
Source dataset may have a mistake (i.e. incorrect subgenus in a set of species from another genus), but it may use homonymic subgenera for a purpose as well. I would expect that software generates report on detected problems for GSD authors, but also translate placement of subgenus in genus as it occur in the original dataset.
Report for Identical Subgenus contains 743 names:
CoL cannot resolve all cases on our end.
Another strange case of subgenus interpretation in the hierarchy: four species with assigned subspecies in the name have been placed in NotAssignedSubgenus node:
Can it be addressed in the ChecklistBank code, @mdoering? (@gdower)
Subgenus issues fixed in the code.
SD synced 2021-09-14.
2021-09-16: looks like technical problem is resolved and species placed in correct genera. Waiting for a new version of the checklist without "homonymic" subgenera in different genera from SD team.
Version 3.6 received 2022-02-14
Imported to DEV https://data.dev.catalogueoflife.org/dataset/1101/classification
Checks of the view 2022-02-18 (few notes)
In the source (SPECIES table):
Family | Full Name Line | Full Name Line Range | Full Species Line |
---|---|---|---|
Chironomidae | Chironomus mixtus Holmgren, 1869: 45. TL: Norway. Bear I. (HT F NRS). | Chironomus mixtus Holmgren, 1869: 45. TL: Norway. Bear I. (HT F NRS). | Orthocladius (Orthocladius) mixtus. (PA: PA) |
*)
CONCLUSION: set of species in these fam & gen have no parent families in the source file (blank values). CLB INTERPRETATION IS CORRECT
Checked few against the source (SPECIES):
Ceratopogoninae have 3 parent families, plus blank family with Serromyia errata:
Chironominae have 2 parent families, plus blank family with 3 spp Nandeva pudens, Parachironomus inageheus, Polypedilum (Polypedilum) xianjuensis :
Chironomiinae (is it different with above? - check with Neal) have 1 parent family Chironomidae, plus blank family with species Yaeprimus balteatus:
Clitellariinae have 2 parent families, plus blank family with species Adoxomyia hasbenlii
Species with uncertain placement in the genus: the original genus goes in square brackets, as agreed with Neal. (Previously, such names had question mark in genus field ( ? stercoreus) How CLD deals with such names?
I don't think it is a good idea to feed names with square bracket embraced genera into CLB to indicate uncertain placement. This is a very specific convention for SD only and not known to anyone nor the system itself.
We should try to change those names and rather follow the guidelines of ColDP, where we have discussed this problem and how to deal with it in a consistent way so both CLB and other users understand the data correctly.
Looking at the verbatim Taxon record of that example I see various problems:
col:genus
is given as the classification (this is a Taxon, not Name record). If it is uncertain don't do that and better remove the field.col:species
is given as an epithet only, the classification expects a binomial. Remove the field.col:provisional
should be used to indicate the uncertain placementIn the linked Name record I would suggest to simply remove the square brackets.
In the linked Name record I would suggest to simply remove the square brackets.
I am comfortable with presentation of an original genus in square brackets where a new placement in a genus is not resolved yet. If CLB allows search for names with square bracket, I'll be happy to mark these accepted names as Provisionally Accepted in the CoL. See: https://github.com/CatalogueOfLife/backend/issues/1112
Curly brackets around genera is nothing we support at this stage. It will be considered bad data and likely has impacts down the line when we assemble COL, e.g. when we make sure to have a genus record for every accepted species. Don't be surprised if you find new genera with brackets in COL.
Version 3.6 (2022-02-14) imported to the PROD 2022-05-17
[x] Imported 177,088 spp
[x] Metadata: OK (ver. 3.6 Feb 2022, 2022-02-14)
[x] Sectors: OK Blocked families Cylindrotomidae, Limoniidae, Pediciidae, Tipulidae in Systema Dipterorum (taken from CCW) pre-synced 2022-05-17
ISSUES assessed 2022-05-17
TASKS as 2022-05-17
!Remember! ACC=ACC sp (diff auth): all names with genus in square brackets = Prov Acc ? what to do with names with authorstrings without year (they may have different synonyms - keep?)
Version 3.6 (2022-02-14), new crawl iteration imported to the PROD 2022-05-19
[x] Imported 169,487 spp (vs 177,088 in previously crawled version)
[x] Metadata: OK
[x] Sectors: families Cylindrotomidae, Limoniidae, Pediciidae, Tipulidae in Systema Dipterorum (taken from CCW) should be blocked again = FIXED
ISSUES assessed 2022-05-19 (many previous decisions remain in place)
Investigating bare names, 8,455 https://www.checklistbank.org/catalogue/3/dataset/1101/workbench?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&limit=100&offset=0&status=bare%20name
? ambigua Pankratova, 1950 = ok ? arcudae Botnariuc, 1956 = ok ? delicatula Botnariuc & Cure, 1956 = ok
However, many names become "bare" for unclear (yet) reasons: Mesembrinella dorsimacula Aldrich, 1922, it is (Available, Valid) Current Status Komisca nanensis (Chaiwong, Sukontason & Sukontason, 2009), it is (Available, Valid) Changed Combination / Rank Abago rohdendorfi Grunin, 1966 (Available, Invalid) Junior SECONDARY Homonym
TASKS as 2022-05-19
Resolved 2022-05-19:
NEW SEARCH OPTION IN Workbench @CLB: RegEx Search (Regular Expression Search)
Crawl iteration with pre-flaged "prov acc" names imported 2022-05-19 & 20. (The main problem: page Tasks failed to be displayed in CLB (spinning progress forever), page Classification whether also failed or too slow - multiple imports. = now resolved)
[x] Now crawler automatically flagged all species with genera in square brackets as "Prov Acc" (3,275 prov acc spp in this version).
[x] ~700 genera with square brackets were flagged as "Prov Acc" via decisions. (Steps: workbench - filter for acc genera: all are at the end of the list, resorting - set up 700 lines per page - applied balk decision)
TASKS as 2022-05-20 (all previous decisions re-applied successfully, new decisions added)
Synced 2022-05-20
Checking Systema Dipterorum v 3.6 data in the portal, I have found that lot of provisionally accepted species have been interpreted as synonyms to the parent family after a sync. https://github.com/CatalogueOfLife/backend/issues/1150
Systema Dipterorum 3.8, May 2022 received 2022-05-22; GO converted & imported to DEV for further tests 2022-05-25: https://www.dev.checklistbank.org/dataset/77757/imports
But I cannot find a single record with question mark in GO spreadsheet.
= see Issue Missing Genus, 3125
A question mark is generated when a species is missing the genus, e.g. abbreviata only has the species epithet: https://www.dev.checklistbank.org/dataset/77757/verbatim/300284
You can spot all those 3000 sth names without genus by its issue which is a red one to take serious: https://www.dev.checklistbank.org/dataset/77757/verbatim?issue=missing%20genus
Also note that there are some other very serious ID problems, e.g. NameID Invalid: https://www.dev.checklistbank.org/dataset/77757/issues
Systema Dipterorum 3.8, May 2022 received 2022-05-22 (continue); imported to prod 2022-06-10
ISSUES assessed 2022-06-13
TASKS
Resolved 2022-06-13
Synced 2022-06-13
Reported problems by GBIF (2022-08-31):
Neal: I see no problem with this record. It is unplaced within Asilidae, so we have done as you instructed and put the original genus in square brackets in the Present genus field. The record says it was revised on 7 May 2022, so it should not have shown up as “Asilidae cristatus” in CoL (or GBIF). I would think it should have shown up as “[Asilus] cristatus”.
YR: probably, a problem with interpretation on our side: https://github.com/CatalogueOfLife/data/issues/455#issuecomment-1234451772
Neal: This species (vicininervis Macquart 1844) has been corrected to place Opomyza in square brackets in the Present genus field -- as it is an unplaced species in Tephritidae.
In the export file, ver. 3.8_2022-05-20, table species Family field Tephritidae - @gdower, no mistake in our interpretation, as I can see. It should disappear with a new update.
Neal: I can’t find this problem in SD. All our records with the name “Lonchaea” map to Lonchaeidae.
YR: Only one species Lonchaea discrepans placed in the family Tachinidae in the CoL.
In the export file, ver. 3.8_2022-05-20, table species, it is in the family Lonchaeidae: https://github.com/CatalogueOfLife/data/issues/457#issuecomment-1234529319 YR: I cannot understand yet, what caused misplacement in the family.
Neal: Another one that is correct in SD. All our Rhagio and Rhagioniinae records map to Rhagionidae. Not sure how this got into CoL as Tabanidae as there are 206 records affected!
YR: Very unclear case to me. In the export file, ver. 3.8_2022-05-20, table species, I did not find(!) genus Rhagio in the family Tabanidae.
However, there are 206 species in the CoL portal, where genus Rhagio is placed in the family Tabanidae: https://www.catalogueoflife.org/data/search?TAXON_ID=8YQML&rank=species&status=accepted&status=provisionally%20accepted
See more at https://github.com/CatalogueOfLife/data/issues/458#issuecomment-1234558730
Systema Dipterorum 3.10, Sep 2022, received 2022-09-30; imported to prod 2022-10-06
@gdower, perhaps there is a sense to kill non-basic ranks in SD conversion. It may reduce noise caused by unresolved placements.
ISSUES assessed 2022-10-12
TASKS
Synced 2022-10-25
ISSUES assessed
Quite a lot of serious invalid ids and duplicate ids. Please investigate into the cause, that has potential for lots of problems.
@gdower, could you pls have a look on "technical" issues among those highlighted by @mdoering: Id Not Unique, 135 Accepted Id Invalid, 725 Name Id Invalid, 8934
GO & YR 2022-11-14: decisions in RegEx Search might be set up, but not displayed in the interface.
Experiment of 2022-11-14: Seems, there are 715 genera in square brackets in total: 1st page of 500 per page: [Ablabesmyia] - [Phoraea] 2nd page [Phorbia] - [Zygoneura] Prov Acc status applied to all 715 genera in square brackets. No progress shown. No decisions appear in the report. However, decisions shown in Project-CoL-Decisions (see mode = update):
Synced 2 hours later, 2022-11-14
Bracketed genera checked in PREVIEW 2022-11-15:
Archalia - 1 in the RegEx report, 0 in the PREVIEW Actia - 1 in the report, 1 from SD in the PREVIEW, accepted Bibio - 4 in the report, 2 in the PREVIEW, both marked as prov acc Ceroxys - 2 in the report, 1 in the PREVIEW, accepted Dinera - 1 in the report, 1 in the PREVIEW, accepted Lydella - 1 in the report, 1 in the PREVIEW, accepted Voriella - 1 in the report, 1 in the PREVIEW, accepted
There are only 19 ProvAcc genera in Diptera in the PREVIEW.
No bracketed genera in the PREVIEW: all brackets removed, but more probable, bracketed genera did not pass to the final product. @gdower, could this be related to the issue of broken parent-child relationships, invalid and duplicated ids?
Anyway, experiment of 2022-11-14 did not change number of accepted species in SD@CoL.
Experiment of 2022-11-16:
Plan: remove brackets from species names, give them ProvAcc status before the import in CLB. Imported 2022-11-16
TASKS remain unchaged (i.e. resolved)
Synced 2022-11-16
Results in the PREVIEW 2022-11-18:
Previously bracketed genera checked:
Genus | RegEx report 2022-11-14 (as bracketed genus) | PREVIEW 2022-11-15 | PREVIEW 2022-11-18 |
---|---|---|---|
Archalia | 1 | 0 | 1, accepted |
Actia | 1 | 1 from SD, accepted | 1 from SD, accepted |
Bibio | 4 | 2, prov acc | 5, of them 4 accepted, 1 prov acc |
Ceroxys | 2 | 1, accepted | 3, of them 3 accepted |
Dinera | 1 | 1, accepted | 2, of them 1 accepted, 1 prov acc |
Lydella | 1 | 1, accepted | 2, of them 2 accepted |
Voriella | 1 | 1, accepted | 2, of them 1 accepted, 1 prov acc |
Misspellings reported to Neal, 2022-12-02:
Diptera> Ceratopogonidaae vs Ceratopogonidae Diptera> Cedratopogonidae Diptera> ceratopogonidae Diptera> Phoridase vs Phoridae Diptera> Sphaerroceridae vs Sphaeroceridae Diptera> Mycxetophilidae vs Mycetophilidae Diptera> Dolichoposidae vs Dolichopodidae Diptera> Limoniidae> Limoninae vs Limoniinae Diptera> Liomniidae vs Limoniidae Diptera> Tephritidae> Tryptetinae vs Trypetinae Diptera> Cecidomyiidae> Porrocondylinae vs Porricondylinae Diptera> Cecidomyiidae> Porriconylinae vs Porricondylinae Diptera> Mycetophilidae> Leeinae vs Leiinae Diptera> Cecidomyiidae> Cercidomyiinae vs Cecidomyiinae Diptera> Chironomidsae vs Chironomidae Diptera> Stratiomyidae> Stratiomyiinae vs Stratiomyinae
Systema Dipterorum 3.10, Sep 2022
TASKS 2022-12-02
Re-synced 2022-12-02
2023-05-16: SD is in update4.2/revert3.10 process. (checklist reverted back to 3.10, metadata 4.2). Must not be synced until repair!
Systema Dipterorum 4.2, May 2023, received 2023-05-13; imported to prod 2023-05-17
ISSUES
SD 4.2, May 2023
TASKS
Not synced
Systema Dipterorum 4.2.2, May 2023, received 2023-05-27; imported to prod 2023-05-30
TASKS
Broken decisions, 369: deleted all
[x] Genera with square brackets blocked
[x] Split subgenera - failed to resolve = the sector synced without rank subgenus
Resolved 2023-06-12:
Synced 2023-06-12 (without rank subgenus)
2023-06-15: temporary names such as *FChironominae (start as *F) deleted as a node (“taxon”) in Assembly - Draft. All children attached to the next parent. Sync is not involved (i.e. such names will be back with next sync).
Version 3.1 received 2021-05-12. Imported to prod: https://data.catalogueoflife.org/dataset/1101/about
(previous reports are in https://github.com/CatalogueOfLife/testing/issues/6)
[x] Metadata: updated
[x] Sector: order Diptera minus 4 families
As result, assembly tree looks like that: