Closed yroskov closed 2 years ago
I see indeed lots of broken sectors because the "subject_id" is wrong. Apart from WSC all of these sources have been imported recently in February:
col=> select subject_dataset_key,d.alias,d.attempt,di.started, count(*) from sector s left join name_usage sub on sub.dataset_key=subject_dataset_key AND subject_id=sub.id JOIN dataset d ON d.key=subject_dataset_key left join dataset_import di on di.dataset_key=d.key and di.attempt=d.attempt where s.dataset_key=3 and sub.id is null group by 1,2,3,4 order by 3 desc limit 50;
subject_dataset_key | alias | attempt | started | count
---------------------+------------------------+---------+----------------------------+-------
1199 | Pterophoroidea | 96 | 2022-02-24 00:02:43.970783 | 1
2207 | Alucitoidea | 60 | 2022-02-03 05:22:42.831697 | 1
2144 | ITIS | 48 | 2022-02-15 19:49:01.329855 | 92
1049 | Global Gracillariidae | 32 | 2022-02-16 09:31:48.267905 | 1
1175 | WoRMS Ostracoda | 28 | 2022-02-15 18:25:05.652115 | 1
1179 | WoRMS Ceriantharia | 27 | 2022-02-15 18:24:32.621959 | 1
1059 | WoRMS Ophiuroidea | 26 | 2022-02-15 18:10:16.692917 | 1
1186 | WoRMS Ascidiacea | 26 | 2022-02-15 18:23:17.990865 | 1
1176 | WoRMS Actiniaria | 25 | 2022-02-15 18:24:39.99811 | 1
1178 | WoRMS Appendicularia | 25 | 2022-02-15 18:24:36.445693 | 1
1182 | WoRMS Loricifera | 25 | 2022-02-15 18:24:22.301926 | 1
1183 | WoRMS Pycnogonida | 25 | 2022-02-15 18:24:05.306712 | 1
1196 | WoRMS Scleractinia | 25 | 2022-02-15 18:17:45.157042 | 1
1152 | WoRMS Merostomata | 24 | 2022-02-15 18:26:11.140514 | 1
1154 | WoRMS Cephalochordata | 24 | 2022-02-15 18:26:01.445245 | 1
1194 | WoRMS Antipatharia | 24 | 2022-02-15 18:19:55.244132 | 1
1195 | WoRMS Corallimorpharia | 24 | 2022-02-15 18:19:52.431902 | 1
1197 | WoRMS Zoantharia | 24 | 2022-02-15 18:17:41.019743 | 1
1200 | WoRMS MilliBase | 24 | 2022-02-15 18:16:30.639108 | 3
1202 | WoRMS Amphipoda | 24 | 2022-02-15 18:14:52.17531 | 1
1107 | WoRMS Holothuroidea | 23 | 2022-02-15 18:30:25.682188 | 1
1124 | WoRMS Priapulida | 23 | 2022-02-15 18:28:44.996292 | 1
1128 | WoRMS Trematoda | 23 | 2022-02-15 18:27:26.237415 | 1
1150 | WoRMS Rhombozoa | 23 | 2022-02-15 18:26:13.345288 | 1
1153 | WoRMS Kinorhyncha | 23 | 2022-02-15 18:26:03.838704 | 1
1185 | WoRMS Thaliacea | 23 | 2022-02-15 18:23:58.641088 | 1
1191 | WoRMS Copepoda | 23 | 2022-02-15 18:22:09.251895 | 1
1193 | WoRMS Turbellarians | 23 | 2022-02-15 18:20:00.025736 | 2
1029 | WSC | 22 | 2020-07-31 20:13:23.419361 | 1
1058 | WoRMS Cumacea | 22 | 2022-02-15 18:09:58.309352 | 1
1088 | WoRMS Mystacocarida | 22 | 2022-02-15 18:35:52.388221 | 1
1100 | WoRMS Xenoturbellida | 22 | 2022-02-15 18:32:19.173297 | 1
1105 | WoRMS Leptostraca | 22 | 2022-02-15 18:32:08.129 | 1
1106 | WoRMS Echinoidea | 22 | 2022-02-15 18:30:58.070993 | 1
1109 | WoRMS Polycystina | 22 | 2022-02-15 18:30:11.308402 | 1
1126 | WoRMS Monogenea | 22 | 2022-02-15 18:27:52.791916 | 1
1129 | WoRMS Myxozoa | 22 | 2022-02-15 18:27:17.52793 | 1
1130 | WoRMS Mollusca | 22 | 2022-02-15 18:26:38.252602 | 1
1131 | WoRMS Octocorallia | 22 | 2022-02-15 18:26:22.302909 | 1
1086 | WoRMS Bochusacea | 21 | 2022-02-15 18:35:56.392291 | 1
1087 | WoRMS Brachypoda | 21 | 2022-02-15 18:35:54.431652 | 1
1092 | WoRMS Tantulocarida | 21 | 2022-02-15 18:33:40.598148 | 1
1093 | WoRMS Thermosbaenacea | 21 | 2022-02-15 18:33:38.249176 | 1
1095 | WoRMS Asteroidea | 21 | 2022-02-15 18:32:56.434689 | 1
1099 | WoRMS Oligochaeta | 21 | 2022-02-15 18:32:21.250006 | 1
1110 | WoRMS Tanaidacea | 21 | 2022-02-15 18:29:53.748289 | 1
1091 | WoRMS Remipedia | 20 | 2022-02-15 18:33:42.961627 | 1
1103 | WoRMS Strepsiptera | 20 | 2022-02-15 18:32:11.141882 | 1
1127 | WoRMS Cestoda | 20 | 2022-02-15 18:27:26.673489 | 1
1094 | WoRMS Isopoda | 19 | 2022-02-15 18:33:16.369696 | 1
...
Did the source identifiers change? Did you enable auto rematching?
No, IDs have been stable all the time. For example Donalds Pterophoroidea:
dataset_key | id | created | subject_id
-------------+------+----------------------------+------------
3 | 1190 | 2021-01-13 17:40:18.741514 |
2237 | 488 | 2019-11-20 11:08:05.945814 | 5
2242 | 488 | 2019-11-20 11:08:05.945814 | 5
2274 | 6 | 2021-09-07 03:45:19.557241 | 5
2296 | 1190 | 2021-01-13 17:40:18.741514 | 5
2303 | 1190 | 2021-01-13 17:40:18.741514 | 5
2315 | 1190 | 2021-01-13 17:40:18.741514 | 5
2328 | 1190 | 2021-01-13 17:40:18.741514 | 5
2332 | 1190 | 2021-01-13 17:40:18.741514 | 5
2344 | 1190 | 2021-01-13 17:40:18.741514 | 5
2349 | 1190 | 2021-01-13 17:40:18.741514 | 5
2351 | 1190 | 2021-01-13 17:40:18.741514 | 5
2366 | 1190 | 2021-01-13 17:40:18.741514 | 5
2368 | 1190 | 2021-01-13 17:40:18.741514 | 5
2370 | 48 | 2022-01-20 04:20:57.678095 | 5
9803 | 1190 | 2021-01-13 17:40:18.741514 | 5
9804 | 1190 | 2021-01-13 17:40:18.741514 | 5
Maybe the matching does sth wrong. I will look into this on monday...
Found a bug in rematching decisions and sectors from projects after a new import of a source. But that still does not explain the entire problem. Rematching the broken Alucitoidea sector manually gives a warning:
Sector Sector{1189, datasetKey=3, mode=ATTACH, subjectDatasetKey=2207, subject=ACCEPTED SUPERFAMILY Alucitoidea Minet, 1986 [ parent=4]} from project 3 cannot be rematched to dataset 2207 - lost ACCEPTED SUPERFAMILY Alucitoidea Minet, 1986 [ parent=4]
That should not be, there is a clear single matching record...
https://www.checklistbank.org/dataset/2207/taxon/5
{
"created": "2021-01-13T17:35:01.830428",
"createdBy": 102,
"modified": "2022-02-24T16:02:24.583883",
"modifiedBy": 102,
"datasetKey": 3,
"id": 1189,
"target": {
"id": "3f1cc7f0-ff9b-476b-8399-8b40a0f0d8c0",
"name": "Lepidoptera",
"rank": "order",
"broken": false,
"label": "Lepidoptera",
"labelHtml": "Lepidoptera"
},
"subjectDatasetKey": 2207,
"subject": {
"name": "Alucitoidea",
"authorship": "Minet, 1986",
"rank": "superfamily",
"code": "zoological",
"status": "accepted",
"parent": "4",
"broken": true,
"label": "Alucitoidea Minet, 1986",
"labelHtml": "Alucitoidea Minet, 1986"
},
"originalSubjectId": "5",
"mode": "attach",
"syncAttempt": 7,
"size": 480
}
Found it. The matching wrongly comparent the parent property as being the parent name. Both is allowed now in matching.
@yroskov @gdower @thomasstjerne when creating a sector we should maybe not always add all the subject information to not be too restrictive? If the author, parent or rank changes we will see broken sectors. But maybe thats fine to then manually reassign the sector, but to be sure that those changes are fine. Removing the author or parent when creating a sector in the UI would still be sth to consider...
2022-03-01:
RematchAllSectors by GSD: 92 in ITIS = FIXED 15 in WCSP = FIXED 8 in WWW = FIXED 7 in 3i Auchenorrhyncha = 1 broken superfamily Cicadoidea in infraorder Cicadomorpha remains
RematchAllSectors in the project: 15 sectors remain broken, of them: 13 in IRMNG = OK 1 in 3i Auchenorrhyncha (superfamily Cicadoidea in infraorder Cicadomorpha) = FIXED (rematched manually) 1 in Global Gracillariidae (family Gracillariidae in superfamily Gracillarioidea) = FIXED (rematched manually)
FIXED
@mdoering, do you know why it happens, if I didn't touch these sectors?
Is there any way to prevent such event at least for those GSDs which we do not update and touch??
Originally posted by @yroskov in https://github.com/CatalogueOfLife/testing/issues/184#issuecomment-1051325465