CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

Towards 2024 Annual Checklist #264

Open yroskov opened 1 month ago

yroskov commented 1 month ago

Delivery date: June 2024

Re-synced: Alucitoidea, Collembola.org, Global Gracillariidae, ITIS, MOWD, Pterophoroidea, ReptileDB, Species Fungorum Plus, StaphBase, Taxapad Ichneumonoidea, TITAN, UCD, WCVP (fully updated now), WTaxa, ZOBODAT Vespoidea

yroskov commented 1 month ago

FROM TaxonWorks:

monthly started from January 2024:

monthly started from March 2024:

_single update in a year (January):~

~~SF Coleorrhyncha SF Embioptera SF Grylloblattodea SF Mantophasmatodea SF Zoraptera~~

LEPIDOPTERA:

OTHER:

=============================

=============================

Filling gaps:

Suborder Symphyta (Hymenoptera) https://github.com/CatalogueOfLife/data/issues/579 Class/order Diplura https://github.com/CatalogueOfLife/data/issues/577 Order †Permopsocida (Insecta) https://github.com/CatalogueOfLife/data/issues/578 Family Promecheilidae (Tenebrionoidea, Coleoptera) https://github.com/CatalogueOfLife/data/issues/580

yroskov commented 1 month ago

Duplicated taxa

TASKS, 2024-05-23: image

2024-06-05: image image

yroskov commented 1 month ago
yroskov commented 1 month ago

@aoern, we are preparing 2024 Annual Checklist in June. Would you be able to run your checks BEFORE final release? I am going to close first draft by 10-11th June. If I send you a link to the interim version in COLDP, would you be so kind to do checks?

aoern commented 1 month ago

Yes, of course! Ariyroskov @.***> kirjoitti 30.5.2024 kello 23.40: @aoern, we are preparing 2024 Annual Checklist in June. Would you be able to run your checks BEFORE final release? I am going to close first draft by 10-11th June. If I send you a link to the interim version in COLDP, would you be so kind to do checks?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

yroskov commented 1 month ago

PREVIEW release started 2024-05-31, 12:54 pm (server time) Finished as Annual Checklist 2024, id 297740, 2024-05-31, 2:16 pm Deployed to the preview website 2024-05-31

CHECKS

yroskov commented 1 month ago
yroskov commented 1 month ago
yroskov commented 1 month ago

PREVIEW release started 2024-06-04, 4:21 pm (server time) Finished as Annual Checklist 2024, id 298077, 2024-06-04, 5:46 pm Deployed to the preview website 2024-06-04

See duplicated uninomial taxa (Systema Dipterorum & GLI need to be re-synced)

yroskov commented 1 month ago
yroskov commented 1 month ago

PREVIEW release started 2024-06-05, 10:11 pm (server time) Finished as Annual Checklist 2024, id 298097, 2024-06-05 Deployed to the preview website 2024-06-06

yroskov commented 1 month ago

PREVIEW release started 2024-06-06, 3:44 pm (server time) Finished as Annual Checklist 2024, id 298177, 2024-06-06 Deployed to the preview website 2024-06-06

yroskov commented 1 month ago

PREVIEW release started 2024-06-06, 6:45 pm (server time) Finished as Annual Checklist 2024, id 298184, 2024-06-06, 8:18 pm Deployed to the preview website 2024-06-07

with RWC

yroskov commented 4 weeks ago

See https://github.com/CatalogueOfLife/data/issues/668 https://github.com/CatalogueOfLife/data/issues/669#event-13080073794 https://github.com/CatalogueOfLife/data/issues/667#issuecomment-2155102700

2024-06-07, require re-sync: Systema Dipterorum = re-synced 2024-06-07 ReptileDB (re-check TASKS, sectors, classification) = re-synced 2024-06-07 WCVP = re-synced 2024-06-07. ATTENTION: merge sector detected: image

yroskov commented 4 weeks ago

CoL TASKS of 2024-06-07: image https://www.checklistbank.org/catalogue/3/tasks

@mdoering, that is very much wrong. (You'll understand the problem when open ACC-ACC species (different authors) 36349 or ACC-ACC species (same authors) 47911)

So far, 298184 is a best candidate for AC24.

mdoering commented 4 weeks ago

any idea where that is coming from? SD syncs?

yroskov commented 3 weeks ago

@mdoering, it was happened after I applied some "polishing" in AC24 draft and completed re-syncs of 3 GSDs: Systema Dipterorum = re-synced 2024-06-07 ReptileDB (re-check TASKS, sectors, classification) = re-synced 2024-06-07 WCVP = re-synced 2024-06-07

As I can see, the main problem caused by WCVP. I did re-sync of all global sectors one-by-one, because I saw "merge sector" in the list of sectors. Results look awful: WCVP species duplicated World Plants, but allocated in wrong families. I believe, it happened due to changed IDs (but I have re-synced the same version of WCVP which was used before (according to metadata) and all sectors were shown as healthy).

Species statistics comparison (Preview 2024-06-06 (id 298184) = before WCVP sync, i.e. as it should be:   Project3 Preview 2024-06-06 (id 298184)
Tracheophyta 415458 359898
Liliopsida 88508 81334
Magnoliopsida 311532 263146
Ginkgoopsida 333 333
Pinopsida 833 833
Polypodiopsida 12392 12392
mdoering commented 3 weeks ago

I have an idea of what goes wrong with WCVP. They do not have stable family identifiers and the same identifier can point to a different family next time. If we rematch all WCVP sectors and resync afterwards it should be fine again

yroskov commented 3 weeks ago

How to do rematch for WCVP?

mdoering commented 3 weeks ago

https://api.checklistbank.org/dataset/3/sector/1698

"subjectDatasetKey":2232, "subject":{ "id":"xS", "name":"Rubiaceae", "rank":"family", "status":"accepted", "broken":false, "label":"Rubiaceae", "labelHtml":"Rubiaceae", }

but xS points to Campanulaceae.

mdoering commented 3 weeks ago

rematching via UI does not help, let me see how we can best address this one...

yroskov commented 3 weeks ago

For attention of @olafbanki & @mdoering:

In a case if Markus meet problems with the Project3 cleaning from unwanted WCVP records which were synced on 2024-06-07, the candidate of 2024-06-06 for the 2024 Annual Checklist is ready in CLB with id 298184.

Its preview is here https://preview.catalogueoflife.org/data/metadata

(If fixes will be successful, RWC import of 2024-06-10 should be re-synced = RWC SYNCED 2024-06-11).

mdoering commented 3 weeks ago

@yroskov I rematched all 111 WCVP sectors and mostly they updated fine:

{"broken":13,"updated":96,"unchanged":2,"total":111}

But there are 13 broken sectors now which I will try to solve manually through the UI.

image
mdoering commented 3 weeks ago

I simply matched again and resolved all sectors now:

{"broken":1,"updated":12,"unchanged":98,"total":111}

The reason not all rematched fine in the beginning was that we cannot have 2 sectors with the same subject - and because IDs were wrong some ids were already (falsely) taken and the rematch failed. Doing it a second time removed that problem.

The single sector still reported as broken is the merge sector without any subject - I will see that we change the reporting to exclude those

mdoering commented 3 weeks ago

@yroskov I will trigger syncs now for all of WCVP to replace the bad data

yroskov commented 3 weeks ago

Thanks! Go ahead

yroskov commented 3 weeks ago

WCVP syncs completed. Tasks in Project 3:

image

@mdoering, TASKS in Project 3 look good after the cleanings. But species number in Tracheophyta show 2,818 extra species comparing with expected number as in Preview 2024-06-06 (id 298184):

  Preview 2024-06-06 (id 298184) Project3_before cleaning Project3_after cleaning  
Tracheophyta 359898 415458 362716 +2,818
Liliopsida 81334 88508 83494 +2,160
Magnoliopsida 263146 311532 263804 +658
Ginkgoopsida 333 333 333  =
Pinopsida 833 833 833  =
Polypodiopsida 12392 12392 12392  =

On my end, I cannot find and check these unexpected extra species.

mdoering commented 3 weeks ago

well, it is a newer version of WCVP. The 298184 preview used data synced from WCVP attempt 17-18, version 10.0 / 2022-10-27 The current is version 13.0 / 2024-05-16

The entire current WCVP has 365.790 species, the older version 10 (attempt 18) had 357.450 species.

yroskov commented 3 weeks ago

Thanks! That's explains difference.

mdoering commented 3 weeks ago

WCVP Tracheophyta species counts are

So WCVP species in COL have increased by 2.818 just as you have observed. In the entire WCVP dataset the increase was 8.340 species, but we only use parts of it.

mdoering commented 3 weeks ago

btw, the WCVP Fabaceae people have also updated their annual version some weeks ago. Do you foresee a problem to sync that one also?

image

new 2024 version available:

image

2023 version we use:

image
yroskov commented 3 weeks ago

WCVP-Fabaceae: ...Do you foresee a problem to sync that one also?

Unfortunately, yes. Two issues: Broken decisions: 4308 and nested WWW genera need to be re-done.

Let see, what we can do on a week of 17th June

mdoering commented 3 weeks ago

update decisions for ambiguous synonyms are a lot of work. Maybe we should think about flagging them automatically if another synonym with the same name exists in a dataset. Then we could remove the entire status, have just synonyms and make our work a little bit simpler.

yroskov commented 3 weeks ago

...update decisions for ambiguous synonyms

This depends on data "quality" in the checklist imported in CLB: some GSD have very simple cases which can be easily resolved automatically; others need investigation on what was happened (chresonym is only one case). It would be nice if we succeed to classify such cases and work out protocols.

yroskov commented 2 weeks ago

PREVIEW release started 2024-06-17, 4:43 pm (server time) (First PREVIEW after fixes in WCVP sectors & re-sync) Finished as Annual Checklist 2024, id 298597, 2024-06-17, 6:19 pm Deployed to the preview website 2024-06-17

yroskov commented 2 weeks ago

Dear @aoern,

I would believe, we have completed our first "proof" of the 2024 Annual Checklist now (https://preview.catalogueoflife.org/).

Here is its CoLDP export https://urldefense.com/v3/https://download.checklistbank.org/job/f0/f0fec354-bc73-4352-8f22-eaef1d246b4f.zip;!!DZ3fjg!6N3jNoLLrkKddZmy9RWHZX3aD6T-3rAyB4OcdnZWyGkiGVihO7z4AVDlg0qof2pEM0BjAwiCLOGRENi83ccQ2EbezU0$ [700.6 MB]

Could you please do your routine checks? Do not dig too deep. We'll have a few days to fix only real disaster (if there is one).

aoern commented 2 weeks ago

I am able to download it tomorrow and start checking. Ariyroskov @.***> kirjoitti 18.6.2024 kello 16.49: Dear @aoern, I would believe, we have completed our first "proof" of the 2024 Annual Checklist now (https://preview.catalogueoflife.org/). Here is its CoLDP export https://urldefense.com/v3/https://download.checklistbank.org/job/f0/f0fec354-bc73-4352-8f22-eaef1d246b4f.zip;!!DZ3fjg!6N3jNoLLrkKddZmy9RWHZX3aD6T-3rAyB4OcdnZWyGkiGVihO7z4AVDlg0qof2pEM0BjAwiCLOGRENi83ccQ2EbezU0$ [700.6 MB] Could you please do your routine checks? Do not dig too deep. We'll have a few days to fix only real disaster (if there is one).

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

yroskov commented 2 weeks ago

= FIXED 2024-06-21 (MD: To fix the problem I have rematched the entire project and there is no missing match any longer. Doing a new release now)

yroskov commented 2 weeks ago

New release of 2024-06-21 completed by Markus id 298708 Deployed to the preview website 2024-06-21

CHECKS of 2024-06-21:

https://github.com/CatalogueOfLife/backend/issues/1333 also https://github.com/CatalogueOfLife/portal/issues/213

https://github.com/CatalogueOfLife/portal/issues/212

yroskov commented 1 week ago

PREVIEW release started by Markus 2024-06-24, 1:46 pm (server time) (After https://github.com/CatalogueOfLife/backend/issues/1333#issuecomment-2186627873) Finished as Annual Checklist 2024, id 298863, 2024-06-24, 3:12 pm Deployed to the preview website 2024-06-24

CHECKS of 2024-06-24:

yroskov commented 1 week ago
yroskov commented 1 week ago

PREVIEW release started 2024-06-25, 3:45 pm (server time) Finished as Annual Checklist 2024, id 298890, 2024-06-25, 5:11 pm Deployed to the preview website 2024-06-25

yroskov commented 1 week ago

PREVIEW release started 2024-06-25, 6:30 pm (server time) (check list of GSDs) Finished as Annual Checklist 2024, id 298894, 2024-06-25, 8:09 pm Deployed to the preview website 2024-06-25

yroskov commented 1 week ago

PREVIEW release started 2024-06-25, 8:41 pm (server time) Finished as Annual Checklist 2024, id 298904, 2024-06-25 Deployed to the preview website 2024-06-26

mdoering commented 1 week ago

@yroskov @gdower the UCD metadata contains 2 paragraphs of lists of contributors. That should really be in the contributors sections, thats exactly what its for:

List of Active Curators: Roger Burks (site designer) Newport Beach, CA, USA; Lucian Fusu, Al. I. Cuza University, Iasi, Romania; D. Christopher Darling, Royal Ontario Museum, Toronto, ON, Canada; John Heraty, University of California, Riverside, CA, USA; Petr Janšta, Charles University, Prague, Czech Republic; Mircea-Dan Mitroiu, Al. I. Cuza University, Iasi, Romania; Pâmella Machado Saguiah, Ciências Biológicas pela Universidade Federal do Espírito Santo, Brazil; Natalie Dale-Skey, The Museum of Natural History, London, United Kingdom; James B. Woolley, Texas A&M University, College Station, TX, USA TaxonWorks Development and Outreach Team: Matt Yoder, Dmitry Dmitriev, José Luis Pereira, Hernán Lucas Pereira, Deborah Paul.

The description also contains species numbers which quickly get out of date. I would propose to remove that sentence as we already have more than 31.000 species than mentioned:

As of today, the UCD contains 27968 valid species (including some subspecies) and 2295 valid genera (including some subgenera).

yroskov commented 1 week ago

UCD metadata will stay as they are now, until UCD group change their mind and text on the website.

yroskov commented 1 week ago

Dear @olafbanki, the dataset id 298904 of 2024-06-25 is finalized now as 2024 Annual Checklist.

mdoering commented 1 week ago

UCD metadata will stay as they are now, until UCD group change their mind and text on the website.

But it's not good practice the way it is

yroskov commented 1 week ago

I'll pass your concerns to UCD team ;) in July

olafbanki commented 1 week ago

The AC2024 has been published!