CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

Why CoL project in CLB has 220 broken sectors? #185

Closed yroskov closed 2 years ago

yroskov commented 2 years ago

@mdoering, do you know why it happens, if I didn't touch these sectors?

Is there any way to prevent such event at least for those GSDs which we do not update and touch??

Originally posted by @yroskov in https://github.com/CatalogueOfLife/testing/issues/184#issuecomment-1051325465

mdoering commented 2 years ago

I see indeed lots of broken sectors because the "subject_id" is wrong. Apart from WSC all of these sources have been imported recently in February:

col=> select subject_dataset_key,d.alias,d.attempt,di.started, count(*) from sector s left join name_usage sub on sub.dataset_key=subject_dataset_key AND subject_id=sub.id JOIN dataset d ON d.key=subject_dataset_key left join dataset_import di on di.dataset_key=d.key and di.attempt=d.attempt  where s.dataset_key=3 and sub.id is null group by 1,2,3,4 order by 3 desc limit 50;
 subject_dataset_key |         alias          | attempt |          started           | count 
---------------------+------------------------+---------+----------------------------+-------
                1199 | Pterophoroidea         |      96 | 2022-02-24 00:02:43.970783 |     1
                2207 | Alucitoidea            |      60 | 2022-02-03 05:22:42.831697 |     1
                2144 | ITIS                   |      48 | 2022-02-15 19:49:01.329855 |    92
                1049 | Global Gracillariidae  |      32 | 2022-02-16 09:31:48.267905 |     1
                1175 | WoRMS Ostracoda        |      28 | 2022-02-15 18:25:05.652115 |     1
                1179 | WoRMS Ceriantharia     |      27 | 2022-02-15 18:24:32.621959 |     1
                1059 | WoRMS Ophiuroidea      |      26 | 2022-02-15 18:10:16.692917 |     1
                1186 | WoRMS Ascidiacea       |      26 | 2022-02-15 18:23:17.990865 |     1
                1176 | WoRMS Actiniaria       |      25 | 2022-02-15 18:24:39.99811  |     1
                1178 | WoRMS Appendicularia   |      25 | 2022-02-15 18:24:36.445693 |     1
                1182 | WoRMS Loricifera       |      25 | 2022-02-15 18:24:22.301926 |     1
                1183 | WoRMS Pycnogonida      |      25 | 2022-02-15 18:24:05.306712 |     1
                1196 | WoRMS Scleractinia     |      25 | 2022-02-15 18:17:45.157042 |     1
                1152 | WoRMS Merostomata      |      24 | 2022-02-15 18:26:11.140514 |     1
                1154 | WoRMS Cephalochordata  |      24 | 2022-02-15 18:26:01.445245 |     1
                1194 | WoRMS Antipatharia     |      24 | 2022-02-15 18:19:55.244132 |     1
                1195 | WoRMS Corallimorpharia |      24 | 2022-02-15 18:19:52.431902 |     1
                1197 | WoRMS Zoantharia       |      24 | 2022-02-15 18:17:41.019743 |     1
                1200 | WoRMS MilliBase        |      24 | 2022-02-15 18:16:30.639108 |     3
                1202 | WoRMS Amphipoda        |      24 | 2022-02-15 18:14:52.17531  |     1
                1107 | WoRMS Holothuroidea    |      23 | 2022-02-15 18:30:25.682188 |     1
                1124 | WoRMS Priapulida       |      23 | 2022-02-15 18:28:44.996292 |     1
                1128 | WoRMS Trematoda        |      23 | 2022-02-15 18:27:26.237415 |     1
                1150 | WoRMS Rhombozoa        |      23 | 2022-02-15 18:26:13.345288 |     1
                1153 | WoRMS Kinorhyncha      |      23 | 2022-02-15 18:26:03.838704 |     1
                1185 | WoRMS Thaliacea        |      23 | 2022-02-15 18:23:58.641088 |     1
                1191 | WoRMS Copepoda         |      23 | 2022-02-15 18:22:09.251895 |     1
                1193 | WoRMS Turbellarians    |      23 | 2022-02-15 18:20:00.025736 |     2
                1029 | WSC                    |      22 | 2020-07-31 20:13:23.419361 |     1
                1058 | WoRMS Cumacea          |      22 | 2022-02-15 18:09:58.309352 |     1
                1088 | WoRMS Mystacocarida    |      22 | 2022-02-15 18:35:52.388221 |     1
                1100 | WoRMS Xenoturbellida   |      22 | 2022-02-15 18:32:19.173297 |     1
                1105 | WoRMS Leptostraca      |      22 | 2022-02-15 18:32:08.129    |     1
                1106 | WoRMS Echinoidea       |      22 | 2022-02-15 18:30:58.070993 |     1
                1109 | WoRMS Polycystina      |      22 | 2022-02-15 18:30:11.308402 |     1
                1126 | WoRMS Monogenea        |      22 | 2022-02-15 18:27:52.791916 |     1
                1129 | WoRMS Myxozoa          |      22 | 2022-02-15 18:27:17.52793  |     1
                1130 | WoRMS Mollusca         |      22 | 2022-02-15 18:26:38.252602 |     1
                1131 | WoRMS Octocorallia     |      22 | 2022-02-15 18:26:22.302909 |     1
                1086 | WoRMS Bochusacea       |      21 | 2022-02-15 18:35:56.392291 |     1
                1087 | WoRMS Brachypoda       |      21 | 2022-02-15 18:35:54.431652 |     1
                1092 | WoRMS Tantulocarida    |      21 | 2022-02-15 18:33:40.598148 |     1
                1093 | WoRMS Thermosbaenacea  |      21 | 2022-02-15 18:33:38.249176 |     1
                1095 | WoRMS Asteroidea       |      21 | 2022-02-15 18:32:56.434689 |     1
                1099 | WoRMS Oligochaeta      |      21 | 2022-02-15 18:32:21.250006 |     1
                1110 | WoRMS Tanaidacea       |      21 | 2022-02-15 18:29:53.748289 |     1
                1091 | WoRMS Remipedia        |      20 | 2022-02-15 18:33:42.961627 |     1
                1103 | WoRMS Strepsiptera     |      20 | 2022-02-15 18:32:11.141882 |     1
                1127 | WoRMS Cestoda          |      20 | 2022-02-15 18:27:26.673489 |     1
                1094 | WoRMS Isopoda          |      19 | 2022-02-15 18:33:16.369696 |     1
...

Did the source identifiers change? Did you enable auto rematching?

mdoering commented 2 years ago

No, IDs have been stable all the time. For example Donalds Pterophoroidea:

 dataset_key |  id  |          created           | subject_id 
-------------+------+----------------------------+------------
           3 | 1190 | 2021-01-13 17:40:18.741514 | 
        2237 |  488 | 2019-11-20 11:08:05.945814 | 5
        2242 |  488 | 2019-11-20 11:08:05.945814 | 5
        2274 |    6 | 2021-09-07 03:45:19.557241 | 5
        2296 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2303 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2315 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2328 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2332 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2344 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2349 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2351 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2366 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2368 | 1190 | 2021-01-13 17:40:18.741514 | 5
        2370 |   48 | 2022-01-20 04:20:57.678095 | 5
        9803 | 1190 | 2021-01-13 17:40:18.741514 | 5
        9804 | 1190 | 2021-01-13 17:40:18.741514 | 5

Maybe the matching does sth wrong. I will look into this on monday...

mdoering commented 2 years ago

Found a bug in rematching decisions and sectors from projects after a new import of a source. But that still does not explain the entire problem. Rematching the broken Alucitoidea sector manually gives a warning:

Sector Sector{1189, datasetKey=3, mode=ATTACH, subjectDatasetKey=2207, subject=ACCEPTED SUPERFAMILY Alucitoidea Minet, 1986 [ parent=4]} from project 3 cannot be rematched to dataset 2207 - lost ACCEPTED SUPERFAMILY Alucitoidea Minet, 1986 [ parent=4]

That should not be, there is a clear single matching record...

mdoering commented 2 years ago

https://www.checklistbank.org/dataset/2207/taxon/5

{
  "created": "2021-01-13T17:35:01.830428",
  "createdBy": 102,
  "modified": "2022-02-24T16:02:24.583883",
  "modifiedBy": 102,
  "datasetKey": 3,
  "id": 1189,
  "target": {
    "id": "3f1cc7f0-ff9b-476b-8399-8b40a0f0d8c0",
    "name": "Lepidoptera",
    "rank": "order",
    "broken": false,
    "label": "Lepidoptera",
    "labelHtml": "Lepidoptera"
  },
  "subjectDatasetKey": 2207,
  "subject": {
    "name": "Alucitoidea",
    "authorship": "Minet, 1986",
    "rank": "superfamily",
    "code": "zoological",
    "status": "accepted",
    "parent": "4",
    "broken": true,
    "label": "Alucitoidea Minet, 1986",
    "labelHtml": "Alucitoidea Minet, 1986"
  },
  "originalSubjectId": "5",
  "mode": "attach",
  "syncAttempt": 7,
  "size": 480
}
mdoering commented 2 years ago

Found it. The matching wrongly comparent the parent property as being the parent name. Both is allowed now in matching.

@yroskov @gdower @thomasstjerne when creating a sector we should maybe not always add all the subject information to not be too restrictive? If the author, parent or rank changes we will see broken sectors. But maybe thats fine to then manually reassign the sector, but to be sure that those changes are fine. Removing the author or parent when creating a sector in the UI would still be sth to consider...

yroskov commented 2 years ago

2022-03-01:

RematchAllSectors by GSD: 92 in ITIS = FIXED 15 in WCSP = FIXED 8 in WWW = FIXED 7 in 3i Auchenorrhyncha = 1 broken superfamily Cicadoidea in infraorder Cicadomorpha remains

RematchAllSectors in the project: 15 sectors remain broken, of them: 13 in IRMNG = OK 1 in 3i Auchenorrhyncha (superfamily Cicadoidea in infraorder Cicadomorpha) = FIXED (rematched manually) 1 in Global Gracillariidae (family Gracillariidae in superfamily Gracillarioidea) = FIXED (rematched manually)

FIXED