gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Duplicate (by title + publisher) datasets #5415

Open MattBlissett opened 1 month ago

MattBlissett commented 1 month ago
SELECT COUNT(*), publishing_organization_key, STRING_AGG(key::text, ' '), title
FROM dataset
WHERE deleted IS NULL
GROUP BY title, publishing_organization_key
HAVING COUNT(*) > 1
ORDER BY 1;

dup-ds.csv

These are datasets where a publisher has used exactly the same title twice. At least some of them seem to be errors, e.g. these 38 datasets from Plazi: ecfc200a-4918-4f81-b4a2-bf0d6a3930c8 8cf36954-e5b6-443a-8b17-c771cbe8edc7 23ab49ae-a546-41d8-9103-47b81205fee1 bc627440-4efb-43ba-94d2-6cf4ae908537 587ffb8d-2207-4de0-99b5-ad6b1e2e81a0 b194cde8-3ceb-44a3-831c-4833f740bc74 ddfb132e-8b88-4ea9-aa0f-703c3434a5aa a26bc045-b81b-496f-a6fb-bb668ea6323f 0c3b3cb8-5377-4445-bb90-e43acebf6350 170fa09f-4c6e-47a0-a197-35cb81fd12fb 7edde4a2-bdfd-4e95-b177-dfce8c653a74 28f9a933-67af-409e-a515-fb4fce6c94d7 e2b3002c-76ec-414b-b90c-430bd67d085c bf0fb24b-3b06-49bf-b002-aa7ff0620aa0 5ebf46f1-a549-4f47-adfd-03e0cf17f3a3 7f20f6f9-f516-47ec-b390-d28a68c7e430 a2ab1982-ea5d-4ee8-84eb-878e0814bb89 35c69d85-6a70-408c-afa5-317267d6e859 4b12f1ab-9aa7-44cf-b058-8198e9f48010 188ee6b9-83d0-433d-ae5b-ea43cd80c5f7 178154c3-4bfe-4185-b720-819fbf98bdc2 fee5b6f4-a279-46b6-b823-b0495be92060 0f2a1c3c-a031-4d7c-8fbe-954d317dee58 8d9356c3-6dcb-4973-9b59-c1791eba8647 cb14e4d1-f59b-41c5-a222-4cd4f8d95e67 0e86e9dc-1187-4a5a-bdfa-df692a25792f 262a83e3-d765-4217-b253-97b261044780 79385eb2-d8f2-4217-8013-c36125d6ad4b f1f8e106-3fb9-48b9-85a7-3cc1dde3a7b6 97d5b6f4-225a-4649-856d-ef5e3bd4f822 15ee0ac2-491c-4646-a437-cd108f5cd285 3d9fc626-085d-4688-b061-28d9d73c667b 86a31bab-b2e8-47a7-83d8-11bb6089348e 63dd0088-0d27-4dbd-837a-54f3c428a7c3 8bf4e8a2-34b4-40cf-8ff1-fde6d9015228 b5dbe978-06f8-433f-99f9-ac689943dc29 b40ed2cc-a330-4acf-9390-bf3ad5c691cc 6ea88692-2336-4456-bfca-0255d8d6a804