CatalogueOfLife / data

Repository for COL content
8 stars 2 forks source link

Fix all broken sectors before AC release #535

Open mdoering opened 1 year ago

mdoering commented 1 year ago

inspired by https://github.com/CatalogueOfLife/data/issues/534 I strongly think all broken sectors should be fixed before the AC release: https://www.checklistbank.org/catalogue/3/sector?broken=true&limit=100&offset=0

It would actually be a good procedure to make sure all sectors are fixed in every release.

yroskov commented 1 year ago

28 sectors are broken 2023-06-01:

  1. Alucitoidea superfamily: Alucitoidea order: Lepidoptera
  2. Brassicaceae family: Brassicaceae order: Brassicales
  3. Bryonames order: Incertae sedis subkingdom: Bryobiotina
  4. Gelechiidae family: Gelechiidae Gelechioidea
  5. IRMNG class: Bolidophyceae phylum: Ochrophyta
  6. IRMNG class: Chrysomerophyceae phylum: Ochrophyta
  7. IRMNG class: Chrysophyceae phylum: Ochrophyta
  8. IRMNG class: Dictyochophyceae phylum: Ochrophyta
  9. IRMNG class: Eustigmatophyceae phylum: Ochrophyta
  10. IRMNG class: Phaeophyceae phylum: Ochrophyta
  11. IRMNG class: Phaeothamniophyceae phylum: Ochrophyta
  12. IRMNG class: Picophagophyceae phylum: Ochrophyta
  13. IRMNG class: Pinguiophyceae phylum: Ochrophyta
  14. IRMNG class: Raphidophyceae phylum: Ochrophyta
  15. IRMNG class: Schizocladiophyceae phylum: Ochrophyta
  16. IRMNG class: Xanthophyceae phylum: Ochrophyta
  17. IRMNG order: Againococcidiida phylum: Miozoa
  18. Nepticuloidea superfamily: Nepticuloidea order: Lepidoptera
  19. PaleoBioDB class: Trilobita phylum: Arthropoda
  20. PaleoBioDB order: Ammonoidea class: Cephalopoda
  21. PaleoBioDB order: Belemnitida class: Cephalopoda
  22. Species Fungorum Plus phylum: Bigyra kingdom: Chromista
  23. Species Fungorum Plus phylum: Cercozoa kingdom: Chromista
  24. Species Fungorum Plus phylum: Oomycota kingdom: Chromista
  25. Trichomycetes class: Ichthyosporea phylum: Choanozoa
  26. Trichomycetes genus: Amoebosporus phylum: Choanozoa
  27. WoRMS brachyura section: Eubrachyura infraorder: Brachyura
  28. WoRMS brachyura section: Podotremata infraorder: Brachyura

@mdoering, I can understand reasons why IRMNG, PaleoBioDB, Species Fungorum Plus* and WoRMS Brachyura** sectors are broken. But why sectors in Lepidoptera become broken so often (Alucitoidea, Gelechiidae & Nepticuloidea)? Why sectors in Brassicaceae, Bryonames, Trichomycetes are broken, if these datasets nobody touched since last sync? Could you please investigate what caused the problem? Why sector management in CLB is still not stable?

) IRMNG = CoL uses version Mar 2018 / 2018-03-20, but not a version 2023-05-19 / 2023-05-19 as it is in CLB now ) PaleoBioDB = CoL uses version Feb 2018 / 2018-02-16, but not a version 2022-03-01 / 2022-03-01 as it is in CLB now *) Species Fungorum Plus = CoL uses version Jan 2023 / 2023-01-17, where taxa Bigyra, Cercozoa & Oomycota are not present (they were preserved from version Feb 2020 / 2020-02-14) **) WoRMS Brachyura = we know a problem with its import, and it was switched off deliberately

TonyRees commented 1 year ago

Not sure what "broken" means, but if there is anything I should look at from the IRMNG content aspect (not the harvesting/tech side) let me know...

yroskov commented 1 year ago

Tony, all are fine with IRMNG, exept the fact that CoL uses data from IRMNG of Mar 2018, and CLB has only latest version of 2023-05-19 available. Links, obviously broken.

yroskov commented 1 year ago

@mdoering & @thomasstjerne, I bring up the same problem again: sectors shown as broken in the CoL project - Sectors report, but they appear as healthy in the Assembly.

image image image

image image image

mdoering commented 1 year ago

@yroskov you can stick with old data by just not running any sync on a sector like the Chromista ones. But I would still recommend to make sure all releases have no broken sectors in any way.

TonyRees commented 1 year ago

We try to continuously upgrade, and add to holdings in Chromista as well as elsewhere, so just sticking with old data to avoid technical issues would be a bit suboptimal from where I sit :)

On Fri, 2 June 2023, 6:40 pm Markus Döring, @.***> wrote:

@yroskov https://github.com/yroskov you can stick with old data by just not running any sync on a sector like the Chromista ones. But I would still recommend to make sure all releases have no broken sectors in any way.

— Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/data/issues/535#issuecomment-1573371561, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDXIXPD6TJ7RKEXNXVLUI3XJGRHFANCNFSM6AAAAAAYXDFFUI . You are receiving this because you commented.Message ID: @.***>

mdoering commented 1 year ago

Both subject (the link to the source dataset) and target (link to the attachment point in the project itself) can be broken. Often the subject breaks because of ID changes.

With the Leipidoptera it is target which is broken. That seems to happen because it is a nested sector inside the GLI which is the source for the Lepidoptera order. Whenever GLI is synced (which happened 5 times in May), the identifier for Lepidoptera changes and the nested sectors have broken targets. BUT - at the end of every sync all nested "child" sectors should have their targets rematched. This is in the sync logs though:

Loaded 7 sectors targeting taxa from sector 3:1738 ... Child sector Sector{1752, datasetKey=3, mode=ATTACH, subjectDatasetKey=1199, subject=ACCEPTED SUPERFAMILY Pterophoroidea Kuznetzov & Stekolnikov, 1979 [5 parent=4]} cannot be rematched to synced sector Sector{1738, datasetKey=3, mode=ATTACH, subjectDatasetKey=55434, subject=ACCEPTED ORDER Lepidoptera [233256]} - lost ORDER Lepidoptera

Pterophoroidea is the only sector out of 7 that failed. Cant really say why that is. The child sectors as of today are these 10:

id | subject_dataset_key | subject_name | target_name
------+---------------------+----------------+----------------- 1761 | 1031 | Tineidae | Tineoidea 1746 | 1046 | Papilionidae | Papilionoidea 1745 | 1046 | Pieridae | Papilionoidea 1820 | 1049 | Gracillariidae | Gracillarioidea 1753 | 1172 | Nepticuloidea | Lepidoptera 1752 | 1199 | Pterophoroidea | Lepidoptera 1751 | 2207 | Alucitoidea | Lepidoptera 1750 | 2362 | Gelechiidae | Gelechioidea 1831 | 55353 | Sesiidae | Cossoidea 1812 | 219318 | Tortricidae | Tortricoidea (10 rows)

There are also many species estimates which could not be rematched and got lost/broken in the logs&_a=(columns:!(level,sectorKey,service,logger_name,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',key:sector,negate:!f,params:(query:'1738'),type:phrase),query:(match_phrase:(sector:'1738'))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',key:attempt,negate:!f,params:(query:'26'),type:phrase),query:(match_phrase:(attempt:'26')))),index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',interval:auto,query:(language:kuery,query:''),sort:!())). For example:

SpeciesEstimate Estimate{115: 30 species in FAMILY Dicksoniaceae} from project 3 cannot be rematched to dataset 3 - lost FAMILY Dicksoniaceae

mdoering commented 1 year ago

@yroskov can you keep an eye on Donalds sectors and report to me when they went broken? Especially when you sync GLI please report what happens to the child sectors!

mdoering commented 1 year ago

I believe this is for content reasons, not technical, that @yroskov prefers to stick with the old data?

We try to continuously upgrade, and add to holdings in Chromista as well as elsewhere, so just sticking with old data to avoid technical issues would be a bit suboptimal from where I sit :)

yroskov commented 1 year ago

@yroskov can you keep an eye on Donalds sectors and report to me when they went broken? Especially when you sync GLI please report what happens to the child sectors!

@mdoering, I am always reporting such findings in my GSD test reports. However, it would be more informative, if you find and investigate such cases in the logs of your software.

yroskov commented 1 year ago

With the Leipidoptera it is target which is broken. That seems to happen because it is a nested sector inside the GLI which is the source for the Lepidoptera order.

@gdower, perhaps, it is a subject for TaxonWorks and a pipeline script. Is ID for Lepidoptera the same from one import to another or different? Generally: how stable taxa IDs in the export (or name IDs? or relationship IDs? - whatever used in sector attachment links)? How TaxonWorks protocols fits to the checklistbank protocols?

Perhaps, @dhobern 's review of this problem and interaction between CoL management classification and GSDs on a level of sectors would be helpful. What need to done to reach stability with sectors in the project? (Sectors should not be broken, if Name and Rank of sector taxon remain the same in each GSD import).

mdoering commented 1 year ago

There seems to be a misunderstanding what broken actually means. The true link of a sector is via the id of the taxon. All other properties such as name, authorship, rank and direct parent are metadata to reestablish the link, i.e. do a rematch.

If the ids change on target or subject side, the sector is broken and needs to be rematched. This is also true for decisions and estimates. Every sync in a project changes all identifiers of its taxa as these are temporary UUIDs. The stable ids only get created upon release time. The rematching of the target side of sectors does happen automatically though for the nested child sectors, so unless there are exact duplicates there should be no broken nested sectors. If there are 2 Lepidoptera orders though the sector will remain broken after a sync.

yroskov commented 1 year ago

OK. It is how CLB works now. As a user, I would like to see a synchronization between CoL project - Sectors report and Assembly tool. If sector detected as a broken, it should be flagged as broken in both places. (You see, both, Sector report and Assembly tool have a button "Sync", i.e. identical functionality implemented in both places. The status of healthiness should be identical too).

yroskov commented 1 year ago

The stable ids only get created upon release time

@gdower, let's take it as a note for further discussion on usability of CoL services for clients in TW.

gdower commented 1 year ago

@dhobern, are you deleting TaxonNames and Otus within TaxonWorks and re-importing them? That would result in the TaxonWorks IDs changing. They should otherwise be stable within TaxonWorks. Or are you working with a local development copy of TaxonWorks? That could also change the IDs if you frequently rebuild the project(s).

mdoering commented 1 year ago

Btw, WoRMS Xenoturbellida does not show any taxonomic coverage in the may release. Likely the broken sectors are causing this. Well, that particular WoRMS source was already empty in AC 2021: https://www.checklistbank.org/dataset/2328/sourcemetrics

mdoering commented 1 year ago

@dhobern, are you deleting TaxonNames and Otus within TaxonWorks and re-importing them? That would result in the TaxonWorks IDs changing. They should otherwise be stable within TaxonWorks. Or are you working with a local development copy of TaxonWorks? That could also change the IDs if you frequently rebuild the project(s).

... the broken Lepidoptera target has nothing to do with source identifiers, it is about the project ids. When saying a sector is broken we should be explicit which side is broken - the subject (source dataset) or target (project attachment). In case of Donalds lists it was always the target order.

yroskov commented 1 year ago

For @mdoering attention:

Four sectors have been "successfully" fixed yesterday via Rematch button, but shown today as broken again:

yroskov commented 1 year ago

@mdoering & @dhobern (and @gdower). just let you know:

after I synced GLI on 2023-06-02, the report on broken sectors in Lepidoptera includes 4 GSDs:

image

All sectors were healthy before that sync.

@mdoering, seems there is a system problem with sectors in nested sectors.

dhobern commented 1 year ago

Thanks, @mdoering and @gdower - I have not knowingly changed ids in TaxonWorks or in my own datasets. There are certainly cases where I delete duplicate records in TW but not in a way that I believe should cause the id to be reused. In my own datasets, I guess I could have made some mistake that changed ids, but nothing I can think of. If the issue is with the identifier for Lepidoptera itself, I've not consciously done anything with any of the dataset identifiers for the order.

I've been making things hard by uploading refreshed versions of these datasets several times each month. I can change that if it makes things easier or more stable.

mdoering commented 1 year ago

after I synced GLI on 2023-06-02, the report on broken sectors in Lepidoptera includes 4 GSDs:

Thanks Yuri. I'll look into this, it is not expected behavior. But good to know the exact time and sync that the problem has appeared. It will allow for better debugging.

mdoering commented 1 year ago

The sync metrics actually already mention the problem as a warning: https://api.dev.checklistbank.org/dataset/3/sector/1738/sync/17

we just don't show that in the UI yet which we should: https://github.com/CatalogueOfLife/checklistbank/issues/1244

mdoering commented 1 year ago

The problem comes from an authorship change of Lepidoptera. It used to be without authorship, but now uses Linnaeus, 1758. I have modified the software to allow changes from no authorship to some authorship and vice versa. And to use a more normalised authorship (case, whitespace, unicode and punctuation insensitive) for comparison.

This basically means watching our for broken sectors is much more needed. And if thats the case it should be fixed with the manual reassignment of the target taxon. Please DO NOT DELETE and recreate such sectors.

image
mdoering commented 1 year ago

I have manually reassigned 4 nested Lep sectors with broken targets and different authorship