#964 linking grscicoll occurrences

This needs reprocessing of the datasets but the biggest datasets like ebird and artportalen can be skipped because they don't have material sample or material citation records. The cache doesn't have to be truncated.

This query returns the datasets that can be skipped:

from occurrence occ select occ.datasetkey, count(*) num_records where occ.basisofrecord NOT IN('MATERIAL_SAMPLE', 'MATERIAL_CITATION') 
group by occ.datasetkey
order by num_records desc

It's aproximately 23K datasets to process and 40K to exclude.

But if it's easier we can just skip the biggest ones:

DatasetKey	Number of Records
4fa7b334-ce0d-4e88-aaae-2e0c138d049e	1277552378
38b4c89f-584c-41bb-bd8f-cd1def33e92f	101285331
8a863029-f435-446a-821e-275f4f641165	79550796
50c9509d-22c7-4a22-a47d-8c48425ef4a7	73797195
95db4db8-f762-11e1-a439-00145eb45e9a	31838699
b124e1e0-4755-430f-9eab-894f25a9b59c	30753936
75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d	20999334
906e6978-e292-4a8b-9c39-adf6bb0f3323	20401489
6ac3f774-d9fb-4796-b3e9-92bf6c81c084	14500313
721a99a4-71f4-4466-b346-83c367889238	14079367
0645ccdb-e001-4ab0-9729-51f1755e007e	13552798
67fabcac-a638-40a6-9bea-aeca8aced9f1	13269255
292a71df-588b-48fa-9ab5-29ae868ba88c	13066620
e7cbb0ed-04c6-44ce-ac86-ebe49f4efb28	12811851
14d5676a-2c54-4f94-9023-1e8dcd822aa0	12142287
740df67d-5663-41a2-9d12-33ec33876c47	11987804
4bf1cca8-832c-4891-9e17-7e7a65b7cc81	11635460
83fdfd3d-3a25-4705-9fbe-3db1d1892b13	10893982

gbif / pipelines

#964 linking grscicoll occurrences #975