Closed camiplata closed 5 months ago
@yroskov this is blocking the annual release!
The problem still comes from the accidental ITIS merge into the project a while ago. I had removed all new ITIS names and name_usages, but not the merged references.
And the problem is much bigger than that still, there are 23.626 secondary sources where ITIS is used to either add the publishedIn nomenclatural reference, the missing name authorship and even to augment the classification with missing ranks!
source_dataset_key | type | count
--------------------+--------------+-------
2144 | AUTHORSHIP | 15865
2144 | PUBLISHED_IN | 7049
2144 | PARENT | 712
I will try to reset the data so that authorship and published_in are empty again. The parent classification change I don't know how to fix. It might be best to sync effected sectors again. I will try to get an overview which these are, could be a lot...
Here are 10 examples:
clb=> select * from verbatim_source_secondary where dataset_key=296511 limit 10;
id | dataset_key | type | source_id | source_dataset_key
-------+-------------+--------------+-----------+--------------------
7DZL | 296511 | PARENT | 12861 | 2144
QK4 | 296511 | PARENT | 12866 | 2144
79X | 296511 | PUBLISHED_IN | 11582 | 2144
85X | 296511 | PUBLISHED_IN | 955037 | 2144
853H3 | 296511 | PUBLISHED_IN | 977988 | 2144
CB73F | 296511 | AUTHORSHIP | 64636 | 2144
CBBXM | 296511 | AUTHORSHIP | 66003 | 2144
84KL9 | 296511 | PARENT | 67644 | 2144
84KR6 | 296511 | PARENT | 67665 | 2144
CCF9W | 296511 | AUTHORSHIP | 67772 | 2144
CLB:
API:
A PARENT case Sciadophycus E.Y. Dawson, 1944 from IRMNG has the following classification:
unranked: Biota kingdom: Plantae phylum: Rhodophyta subphylum: Eurhodophytina class: Florideophyceae Cronquist order: Rhodymeniales F. Schmitz family: Rhodymeniaceae Harvey, 1849 genus: Sciadophycus E.Y. Dawson, 1944
The original IRMNG one is this:
unranked: Biota kingdom: Plantae Haeckel, 1866 phylum: Rhodophyta subphylum: Eurhodophytina class: Florideophyceae Cronquist order: Rhodymeniales F. Schmitz family: Rhodymeniales incertae sedis genus: Sciadophycus E.Y. Dawson, 1944
ITIS has:
kingdom: Plantae subkingdom: Biliphyta phylum: Rhodophyta subphylum: Eurhodophytina class: Florideophyceae subclass: Rhodymeniophycidae order: Rhodymeniales family: Rhodymeniaceae genus: Sciadophycus
You can see IRMNG did not have a family (Rhodymeniales incertae sedis), which was then added by the ITIS merge as Rhodymeniaceae. Nice to see that the merge did a good job, but I have no idea how to revert that in the project at this stage - other than syncing again IRMNG which might be more troublesome than keeping the hierarchy insertions.
You can see the version without the parent merge in the April release: https://www.checklistbank.org/dataset/294826/taxon/7DZL
There are 55 sectors effected by the PARENT merge, coming from 40 different source datasets:
SELECT s.subject_dataset_key, count(distinct s.id) as sectors, count(*) as usages, d.alias from verbatim_source_secondary v JOIN name_usage u ON u.dataset_key=v.dataset_key and u.id=v.id left join sector s on s.dataset_key=v.dataset_key and s.id=u.sector_key left join dataset d on d.key=s.subject_dataset_key where v.dataset_key=296511 and v.type='PARENT' group by 1,4 order by 3 desc;
subject_dataset_key | sectors | usages | alias
---------------------+---------+--------+-----------------------
1101 | 1 | 408 | Systema Dipterorum
2073 | 2 | 48 | Species Fungorum Plus
170394 | 1 | 32 | Bryonames
1141 | 8 | 28 | World Plants
2304 | 1 | 27 | WCVP-Fabaceae
2232 | 4 | 21 | WCVP
1094 | 1 | 19 | WoRMS Isopoda
1193 | 1 | 15 | WoRMS Turbellarians
1095 | 1 | 14 | WoRMS Asteroidea
1042 | 1 | 7 | ChiloBase
1130 | 1 | 7 | WoRMS Mollusca
1044 | 1 | 6 | WoRMS Porifera
2007 | 3 | 6 | IRMNG
2302 | 1 | 6 | WoRMS Nemys
1204 | 2 | 5 | StaphBase
1090 | 1 | 5 | WoRMS Polychaeta
1055 | 1 | 5 | LDL Neuropterida
1059 | 1 | 4 | WoRMS Ophiuroidea
1144 | 1 | 4 | Lace Bugs Database
1107 | 1 | 4 | WoRMS Holothuroidea
2317 | 1 | 4 | 3i Auchenorrhyncha
1134 | 1 | 4 | SF Coreoidea
125101 | 1 | 4 | WOL
1146 | 1 | 3 | Carabcat
1081 | 1 | 3 | WoRMS Bryozoa
1200 | 1 | 3 | WoRMS MilliBase
1191 | 1 | 2 | WoRMS Copepoda
1032 | 1 | 2 | TITAN
1104 | 1 | 2 | Phoronida Database
1008 | 2 | 2 | ReptileDB
| 0 | 2 |
1196 | 1 | 1 | WoRMS Scleractinia
1183 | 1 | 1 | WoRMS Pycnogonida
1175 | 1 | 1 | WoRMS Ostracoda
2256 | 1 | 1 | WCO
1157 | 1 | 1 | WoRMS Foraminifera
1131 | 1 | 1 | WoRMS Octocorallia
1128 | 1 | 1 | WoRMS Trematoda
1110 | 1 | 1 | WoRMS Tanaidacea
1099 | 1 | 1 | WoRMS Oligochaeta
1202 | 1 | 1 | WoRMS Amphipoda
(41 rows)
By far the largest is Systema Dipterorum with more than half of the records. @yroskov maybe we can resync that and the few other larger ones?
I have reset the published in names:
UPDATE name n SET published_in_id=null
FROM verbatim_source_secondary v, name_usage u
WHERE n.dataset_key=3 AND
u.dataset_key=3 AND u.name_id=n.id AND
v.dataset_key=3 AND v.type='PUBLISHED_IN' AND v.id=u.id;
The added authorship can also be removed, but identifiers are different between releases as the authorship enforces an identifier change. Example Lepidonotus dentatus:
https://www.checklistbank.org/dataset/296511/taxon/CB73F https://www.checklistbank.org/dataset/294826/taxon/85337
I have just synced the small sector The World List of Cycads which resolved the authorship and parent merges. Since the last sync of that source was in 2020 we also were missing some track record of the actual source, it seems we have implemented that shortly after only. I would really be good to resync all sectors that last have been synced before 2021 or even 2022.
authorship reset for 14.339 names:
UPDATE name n SET authorship=null,
basionym_authors='{}', basionym_ex_authors='{}', basionym_year=null,
combination_authors='{}', combination_ex_authors='{}', combination_year=null,
sanctioning_author=null
FROM verbatim_source_secondary v, name_usage u
WHERE n.dataset_key=3 AND
u.dataset_key=3 AND u.name_id=n.id AND
v.dataset_key=3 AND v.type='AUTHORSHIP' AND v.id=u.id;
Only 115 PARENT secondary sources left
Well, if you take my advice in the beginning to leave Expended Catalog as a separate project in CLB, then we don't have such problems now.
Also, you now have a good illustration of the problem when the "old" version (i.e. one classification) does not match the new one in the merged sectors. Do we really have enough expertise to resolve conflicts?
By far the largest is Systema Dipterorum with more than half of the records. @yroskov maybe we can resync that and the few other larger ones?
Yes, we can re-sync SD without problem now. Last sync was done 2024-06-04 8:11 PM. @mdoering, let me know if CLB is ready for re-sync and I'll kick it off.
Potential resyncing of other GSDs needs to be discussed on a case-by-case basis, including @gdower recommendations.
please give me 1.5h (18:30 CET) to deploy a new version, then you are welcome to sync or release as you like
OK
@mdoering, can I start syncs now?
(so far, no syncs of 2024-06-07 from my side)
yes
OK, I am going ahead with remaining re-syncs
see: https://www.catalogueoflife.org/data/taxon/D5LL