CatalogueOfLife / data

Repository for COL content
8 stars 2 forks source link

xrelease reference information is leaking to main COL #668

Closed camiplata closed 5 months ago

camiplata commented 5 months ago

see: https://www.catalogueoflife.org/data/taxon/D5LL

Captura de pantalla 2024-06-06 a la(s) 6 55 52 p m
mdoering commented 5 months ago

@yroskov this is blocking the annual release!

mdoering commented 5 months ago

The problem still comes from the accidental ITIS merge into the project a while ago. I had removed all new ITIS names and name_usages, but not the merged references.

And the problem is much bigger than that still, there are 23.626 secondary sources where ITIS is used to either add the publishedIn nomenclatural reference, the missing name authorship and even to augment the classification with missing ranks!

 source_dataset_key |     type     | count 
--------------------+--------------+-------
               2144 | AUTHORSHIP   | 15865
               2144 | PUBLISHED_IN |  7049
               2144 | PARENT       |   712
mdoering commented 5 months ago

I will try to reset the data so that authorship and published_in are empty again. The parent classification change I don't know how to fix. It might be best to sync effected sectors again. I will try to get an overview which these are, could be a lot...

Here are 10 examples:

clb=> select * from verbatim_source_secondary where dataset_key=296511 limit 10;
  id   | dataset_key |     type     | source_id | source_dataset_key 
-------+-------------+--------------+-----------+--------------------
 7DZL  |      296511 | PARENT       | 12861     |               2144
 QK4   |      296511 | PARENT       | 12866     |               2144
 79X   |      296511 | PUBLISHED_IN | 11582     |               2144
 85X   |      296511 | PUBLISHED_IN | 955037    |               2144
 853H3 |      296511 | PUBLISHED_IN | 977988    |               2144
 CB73F |      296511 | AUTHORSHIP   | 64636     |               2144
 CBBXM |      296511 | AUTHORSHIP   | 66003     |               2144
 84KL9 |      296511 | PARENT       | 67644     |               2144
 84KR6 |      296511 | PARENT       | 67665     |               2144
 CCF9W |      296511 | AUTHORSHIP   | 67772     |               2144

CLB:

API:

mdoering commented 5 months ago

A PARENT case Sciadophycus E.Y. Dawson, 1944 from IRMNG has the following classification:

unranked: Biota kingdom: Plantae phylum: Rhodophyta subphylum: Eurhodophytina class: Florideophyceae Cronquist order: Rhodymeniales F. Schmitz family: Rhodymeniaceae Harvey, 1849 genus: Sciadophycus E.Y. Dawson, 1944

The original IRMNG one is this:

unranked: Biota kingdom: Plantae Haeckel, 1866 phylum: Rhodophyta subphylum: Eurhodophytina class: Florideophyceae Cronquist order: Rhodymeniales F. Schmitz family: Rhodymeniales incertae sedis genus: Sciadophycus E.Y. Dawson, 1944

ITIS has:

kingdom: Plantae subkingdom: Biliphyta phylum: Rhodophyta subphylum: Eurhodophytina class: Florideophyceae subclass: Rhodymeniophycidae order: Rhodymeniales family: Rhodymeniaceae genus: Sciadophycus

You can see IRMNG did not have a family (Rhodymeniales incertae sedis), which was then added by the ITIS merge as Rhodymeniaceae. Nice to see that the merge did a good job, but I have no idea how to revert that in the project at this stage - other than syncing again IRMNG which might be more troublesome than keeping the hierarchy insertions.

You can see the version without the parent merge in the April release: https://www.checklistbank.org/dataset/294826/taxon/7DZL

mdoering commented 5 months ago

There are 55 sectors effected by the PARENT merge, coming from 40 different source datasets:

SELECT s.subject_dataset_key, count(distinct s.id) as sectors, count(*) as usages, d.alias from verbatim_source_secondary v JOIN name_usage u ON u.dataset_key=v.dataset_key and u.id=v.id left join sector s on s.dataset_key=v.dataset_key and s.id=u.sector_key left join dataset d on d.key=s.subject_dataset_key where v.dataset_key=296511 and v.type='PARENT' group by 1,4 order by 3 desc;

 subject_dataset_key | sectors | usages |         alias         
---------------------+---------+--------+-----------------------
                1101 |       1 |    408 | Systema Dipterorum
                2073 |       2 |     48 | Species Fungorum Plus
              170394 |       1 |     32 | Bryonames
                1141 |       8 |     28 | World Plants
                2304 |       1 |     27 | WCVP-Fabaceae
                2232 |       4 |     21 | WCVP
                1094 |       1 |     19 | WoRMS Isopoda
                1193 |       1 |     15 | WoRMS Turbellarians
                1095 |       1 |     14 | WoRMS Asteroidea
                1042 |       1 |      7 | ChiloBase
                1130 |       1 |      7 | WoRMS Mollusca
                1044 |       1 |      6 | WoRMS Porifera
                2007 |       3 |      6 | IRMNG
                2302 |       1 |      6 | WoRMS Nemys
                1204 |       2 |      5 | StaphBase
                1090 |       1 |      5 | WoRMS Polychaeta
                1055 |       1 |      5 | LDL Neuropterida
                1059 |       1 |      4 | WoRMS Ophiuroidea
                1144 |       1 |      4 | Lace Bugs Database
                1107 |       1 |      4 | WoRMS Holothuroidea
                2317 |       1 |      4 | 3i Auchenorrhyncha
                1134 |       1 |      4 | SF Coreoidea
              125101 |       1 |      4 | WOL
                1146 |       1 |      3 | Carabcat
                1081 |       1 |      3 | WoRMS Bryozoa
                1200 |       1 |      3 | WoRMS MilliBase
                1191 |       1 |      2 | WoRMS Copepoda
                1032 |       1 |      2 | TITAN
                1104 |       1 |      2 | Phoronida Database
                1008 |       2 |      2 | ReptileDB
                     |       0 |      2 | 
                1196 |       1 |      1 | WoRMS Scleractinia
                1183 |       1 |      1 | WoRMS Pycnogonida
                1175 |       1 |      1 | WoRMS Ostracoda
                2256 |       1 |      1 | WCO
                1157 |       1 |      1 | WoRMS Foraminifera
                1131 |       1 |      1 | WoRMS Octocorallia
                1128 |       1 |      1 | WoRMS Trematoda
                1110 |       1 |      1 | WoRMS Tanaidacea
                1099 |       1 |      1 | WoRMS Oligochaeta
                1202 |       1 |      1 | WoRMS Amphipoda
(41 rows)
mdoering commented 5 months ago

By far the largest is Systema Dipterorum with more than half of the records. @yroskov maybe we can resync that and the few other larger ones?

mdoering commented 5 months ago

I have reset the published in names:

UPDATE name n SET published_in_id=null 
 FROM verbatim_source_secondary v, name_usage u  
 WHERE n.dataset_key=3 AND 
  u.dataset_key=3 AND u.name_id=n.id AND
  v.dataset_key=3 AND v.type='PUBLISHED_IN' AND v.id=u.id;

The added authorship can also be removed, but identifiers are different between releases as the authorship enforces an identifier change. Example Lepidonotus dentatus:

https://www.checklistbank.org/dataset/296511/taxon/CB73F https://www.checklistbank.org/dataset/294826/taxon/85337

mdoering commented 5 months ago

I have just synced the small sector The World List of Cycads which resolved the authorship and parent merges. Since the last sync of that source was in 2020 we also were missing some track record of the actual source, it seems we have implemented that shortly after only. I would really be good to resync all sectors that last have been synced before 2021 or even 2022.

mdoering commented 5 months ago

authorship reset for 14.339 names:

UPDATE name n SET authorship=null, 
 basionym_authors='{}', basionym_ex_authors='{}', basionym_year=null, 
 combination_authors='{}', combination_ex_authors='{}', combination_year=null, 
 sanctioning_author=null

 FROM verbatim_source_secondary v, name_usage u  
 WHERE n.dataset_key=3 AND
  u.dataset_key=3 AND u.name_id=n.id AND
  v.dataset_key=3 AND v.type='AUTHORSHIP' AND v.id=u.id;
mdoering commented 5 months ago

Only 115 PARENT secondary sources left

yroskov commented 5 months ago

Well, if you take my advice in the beginning to leave Expended Catalog as a separate project in CLB, then we don't have such problems now.

Also, you now have a good illustration of the problem when the "old" version (i.e. one classification) does not match the new one in the merged sectors. Do we really have enough expertise to resolve conflicts?

yroskov commented 5 months ago

By far the largest is Systema Dipterorum with more than half of the records. @yroskov maybe we can resync that and the few other larger ones?

Yes, we can re-sync SD without problem now. Last sync was done 2024-06-04 8:11 PM. @mdoering, let me know if CLB is ready for re-sync and I'll kick it off.

Potential resyncing of other GSDs needs to be discussed on a case-by-case basis, including @gdower recommendations.

mdoering commented 5 months ago

please give me 1.5h (18:30 CET) to deploy a new version, then you are welcome to sync or release as you like

yroskov commented 5 months ago

OK

yroskov commented 5 months ago

@mdoering, can I start syncs now?

(so far, no syncs of 2024-06-07 from my side)

mdoering commented 5 months ago

yes

yroskov commented 5 months ago

OK, I am going ahead with remaining re-syncs