gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

#964 linking grscicoll occurrences #975

Closed marcos-lg closed 7 months ago

marcos-lg commented 8 months ago

This needs reprocessing of the datasets but the biggest datasets like ebird and artportalen can be skipped because they don't have material sample or material citation records. The cache doesn't have to be truncated.

This query returns the datasets that can be skipped:

from occurrence occ select occ.datasetkey, count(*) num_records where occ.basisofrecord NOT IN('MATERIAL_SAMPLE', 'MATERIAL_CITATION') 
group by occ.datasetkey
order by num_records desc

It's aproximately 23K datasets to process and 40K to exclude.

But if it's easier we can just skip the biggest ones:

DatasetKey Number of Records
4fa7b334-ce0d-4e88-aaae-2e0c138d049e 1277552378
38b4c89f-584c-41bb-bd8f-cd1def33e92f 101285331
8a863029-f435-446a-821e-275f4f641165 79550796
50c9509d-22c7-4a22-a47d-8c48425ef4a7 73797195
95db4db8-f762-11e1-a439-00145eb45e9a 31838699
b124e1e0-4755-430f-9eab-894f25a9b59c 30753936
75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d 20999334
906e6978-e292-4a8b-9c39-adf6bb0f3323 20401489
6ac3f774-d9fb-4796-b3e9-92bf6c81c084 14500313
721a99a4-71f4-4466-b346-83c367889238 14079367
0645ccdb-e001-4ab0-9729-51f1755e007e 13552798
67fabcac-a638-40a6-9bea-aeca8aced9f1 13269255
292a71df-588b-48fa-9ab5-29ae868ba88c 13066620
e7cbb0ed-04c6-44ce-ac86-ebe49f4efb28 12811851
14d5676a-2c54-4f94-9023-1e8dcd822aa0 12142287
740df67d-5663-41a2-9d12-33ec33876c47 11987804
4bf1cca8-832c-4891-9e17-7e7a65b7cc81 11635460
83fdfd3d-3a25-4705-9fbe-3db1d1892b13 10893982