Open mdoering opened 1 year ago
28 sectors are broken 2023-06-01:
@mdoering, I can understand reasons why IRMNG, PaleoBioDB, Species Fungorum Plus* and WoRMS Brachyura** sectors are broken. But why sectors in Lepidoptera become broken so often (Alucitoidea, Gelechiidae & Nepticuloidea)? Why sectors in Brassicaceae, Bryonames, Trichomycetes are broken, if these datasets nobody touched since last sync? Could you please investigate what caused the problem? Why sector management in CLB is still not stable?
) IRMNG = CoL uses version Mar 2018 / 2018-03-20, but not a version 2023-05-19 / 2023-05-19 as it is in CLB now ) PaleoBioDB = CoL uses version Feb 2018 / 2018-02-16, but not a version 2022-03-01 / 2022-03-01 as it is in CLB now *) Species Fungorum Plus = CoL uses version Jan 2023 / 2023-01-17, where taxa Bigyra, Cercozoa & Oomycota are not present (they were preserved from version Feb 2020 / 2020-02-14) **) WoRMS Brachyura = we know a problem with its import, and it was switched off deliberately
Not sure what "broken" means, but if there is anything I should look at from the IRMNG content aspect (not the harvesting/tech side) let me know...
Tony, all are fine with IRMNG, exept the fact that CoL uses data from IRMNG of Mar 2018, and CLB has only latest version of 2023-05-19 available. Links, obviously broken.
@mdoering & @thomasstjerne, I bring up the same problem again: sectors shown as broken in the CoL project - Sectors report, but they appear as healthy in the Assembly.
@yroskov you can stick with old data by just not running any sync on a sector like the Chromista ones. But I would still recommend to make sure all releases have no broken sectors in any way.
We try to continuously upgrade, and add to holdings in Chromista as well as elsewhere, so just sticking with old data to avoid technical issues would be a bit suboptimal from where I sit :)
On Fri, 2 June 2023, 6:40 pm Markus Döring, @.***> wrote:
@yroskov https://github.com/yroskov you can stick with old data by just not running any sync on a sector like the Chromista ones. But I would still recommend to make sure all releases have no broken sectors in any way.
— Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/data/issues/535#issuecomment-1573371561, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDXIXPD6TJ7RKEXNXVLUI3XJGRHFANCNFSM6AAAAAAYXDFFUI . You are receiving this because you commented.Message ID: @.***>
Both subject (the link to the source dataset) and target (link to the attachment point in the project itself) can be broken. Often the subject breaks because of ID changes.
With the Leipidoptera it is target which is broken. That seems to happen because it is a nested sector inside the GLI which is the source for the Lepidoptera order. Whenever GLI is synced (which happened 5 times in May), the identifier for Lepidoptera changes and the nested sectors have broken targets. BUT - at the end of every sync all nested "child" sectors should have their targets rematched. This is in the sync logs though:
Loaded 7 sectors targeting taxa from sector 3:1738 ... Child sector Sector{1752, datasetKey=3, mode=ATTACH, subjectDatasetKey=1199, subject=ACCEPTED SUPERFAMILY Pterophoroidea Kuznetzov & Stekolnikov, 1979 [5 parent=4]} cannot be rematched to synced sector Sector{1738, datasetKey=3, mode=ATTACH, subjectDatasetKey=55434, subject=ACCEPTED ORDER Lepidoptera [233256]} - lost ORDER Lepidoptera
Pterophoroidea is the only sector out of 7 that failed. Cant really say why that is. The child sectors as of today are these 10:
id | subject_dataset_key | subject_name | target_name
------+---------------------+----------------+----------------- 1761 | 1031 | Tineidae | Tineoidea 1746 | 1046 | Papilionidae | Papilionoidea 1745 | 1046 | Pieridae | Papilionoidea 1820 | 1049 | Gracillariidae | Gracillarioidea 1753 | 1172 | Nepticuloidea | Lepidoptera 1752 | 1199 | Pterophoroidea | Lepidoptera 1751 | 2207 | Alucitoidea | Lepidoptera 1750 | 2362 | Gelechiidae | Gelechioidea 1831 | 55353 | Sesiidae | Cossoidea 1812 | 219318 | Tortricidae | Tortricoidea (10 rows)
There are also many species estimates which could not be rematched and got lost/broken in the logs&_a=(columns:!(level,sectorKey,service,logger_name,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',key:sector,negate:!f,params:(query:'1738'),type:phrase),query:(match_phrase:(sector:'1738'))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',key:attempt,negate:!f,params:(query:'26'),type:phrase),query:(match_phrase:(attempt:'26')))),index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',interval:auto,query:(language:kuery,query:''),sort:!())). For example:
SpeciesEstimate Estimate{115: 30 species in FAMILY Dicksoniaceae} from project 3 cannot be rematched to dataset 3 - lost FAMILY Dicksoniaceae
@yroskov can you keep an eye on Donalds sectors and report to me when they went broken? Especially when you sync GLI please report what happens to the child sectors!
I believe this is for content reasons, not technical, that @yroskov prefers to stick with the old data?
We try to continuously upgrade, and add to holdings in Chromista as well as elsewhere, so just sticking with old data to avoid technical issues would be a bit suboptimal from where I sit :)
@yroskov can you keep an eye on Donalds sectors and report to me when they went broken? Especially when you sync GLI please report what happens to the child sectors!
@mdoering, I am always reporting such findings in my GSD test reports. However, it would be more informative, if you find and investigate such cases in the logs of your software.
With the Leipidoptera it is target which is broken. That seems to happen because it is a nested sector inside the GLI which is the source for the Lepidoptera order.
@gdower, perhaps, it is a subject for TaxonWorks and a pipeline script. Is ID for Lepidoptera the same from one import to another or different? Generally: how stable taxa IDs in the export (or name IDs? or relationship IDs? - whatever used in sector attachment links)? How TaxonWorks protocols fits to the checklistbank protocols?
Perhaps, @dhobern 's review of this problem and interaction between CoL management classification and GSDs on a level of sectors would be helpful. What need to done to reach stability with sectors in the project? (Sectors should not be broken, if Name and Rank of sector taxon remain the same in each GSD import).
There seems to be a misunderstanding what broken actually means. The true link of a sector is via the id of the taxon. All other properties such as name, authorship, rank and direct parent are metadata to reestablish the link, i.e. do a rematch.
If the ids change on target or subject side, the sector is broken and needs to be rematched. This is also true for decisions and estimates. Every sync in a project changes all identifiers of its taxa as these are temporary UUIDs. The stable ids only get created upon release time. The rematching of the target side of sectors does happen automatically though for the nested child sectors, so unless there are exact duplicates there should be no broken nested sectors. If there are 2 Lepidoptera orders though the sector will remain broken after a sync.
OK. It is how CLB works now. As a user, I would like to see a synchronization between CoL project - Sectors report and Assembly tool. If sector detected as a broken, it should be flagged as broken in both places. (You see, both, Sector report and Assembly tool have a button "Sync", i.e. identical functionality implemented in both places. The status of healthiness should be identical too).
The stable ids only get created upon release time
@gdower, let's take it as a note for further discussion on usability of CoL services for clients in TW.
@dhobern, are you deleting TaxonNames and Otus within TaxonWorks and re-importing them? That would result in the TaxonWorks IDs changing. They should otherwise be stable within TaxonWorks. Or are you working with a local development copy of TaxonWorks? That could also change the IDs if you frequently rebuild the project(s).
Btw, WoRMS Xenoturbellida does not show any taxonomic coverage in the may release. Likely the broken sectors are causing this. Well, that particular WoRMS source was already empty in AC 2021: https://www.checklistbank.org/dataset/2328/sourcemetrics
@dhobern, are you deleting TaxonNames and Otus within TaxonWorks and re-importing them? That would result in the TaxonWorks IDs changing. They should otherwise be stable within TaxonWorks. Or are you working with a local development copy of TaxonWorks? That could also change the IDs if you frequently rebuild the project(s).
... the broken Lepidoptera target has nothing to do with source identifiers, it is about the project ids. When saying a sector is broken we should be explicit which side is broken - the subject (source dataset) or target (project attachment). In case of Donalds lists it was always the target order.
For @mdoering attention:
Four sectors have been "successfully" fixed yesterday via Rematch button, but shown today as broken again:
@mdoering & @dhobern (and @gdower). just let you know:
after I synced GLI on 2023-06-02, the report on broken sectors in Lepidoptera includes 4 GSDs:
All sectors were healthy before that sync.
@mdoering, seems there is a system problem with sectors in nested sectors.
Thanks, @mdoering and @gdower - I have not knowingly changed ids in TaxonWorks or in my own datasets. There are certainly cases where I delete duplicate records in TW but not in a way that I believe should cause the id to be reused. In my own datasets, I guess I could have made some mistake that changed ids, but nothing I can think of. If the issue is with the identifier for Lepidoptera itself, I've not consciously done anything with any of the dataset identifiers for the order.
I've been making things hard by uploading refreshed versions of these datasets several times each month. I can change that if it makes things easier or more stable.
after I synced GLI on 2023-06-02, the report on broken sectors in Lepidoptera includes 4 GSDs:
Thanks Yuri. I'll look into this, it is not expected behavior. But good to know the exact time and sync that the problem has appeared. It will allow for better debugging.
The sync metrics actually already mention the problem as a warning: https://api.dev.checklistbank.org/dataset/3/sector/1738/sync/17
we just don't show that in the UI yet which we should: https://github.com/CatalogueOfLife/checklistbank/issues/1244
The problem comes from an authorship change of Lepidoptera. It used to be without authorship, but now uses Linnaeus, 1758
.
I have modified the software to allow changes from no authorship to some authorship and vice versa. And to use a more normalised authorship (case, whitespace, unicode and punctuation insensitive) for comparison.
This basically means watching our for broken sectors is much more needed. And if thats the case it should be fixed with the manual reassignment of the target taxon. Please DO NOT DELETE and recreate such sectors.
I have manually reassigned 4 nested Lep sectors with broken targets and different authorship
inspired by https://github.com/CatalogueOfLife/data/issues/534 I strongly think all broken sectors should be fixed before the AC release: https://www.checklistbank.org/catalogue/3/sector?broken=true&limit=100&offset=0
It would actually be a good procedure to make sure all sectors are fixed in every release.