This document aims to describe current best practices for dealing with primary occurrence data for sensitive species and provide guidance on how to make as freely data available as possible and as protected as necessary.
Thanks for a wonderfully written document! I'm providing few insights from point of view of FinBIF (http://species.fi) that you might find interesting.
We have a data warehouse that has about 40M occurrences from ~20 IT systems and ~400 datasets. We have implemented a "securing system" for sensitive species, that has some features that may be quite new compared to any other IT systems that do data generalization. Unfortunately we have only written about this in Finnish and even that does not go into details.
First things that are similar to concepts found in your document and many implementations:
Everything starts from our taxonomy. To each taxon concept we can define a secure level. We use the following grading: SPAM,HIGHEST,KM100,KM50,KM25,KM10,KM5,KM1,NONE (default). A working group of Finnish experts produced a list of ~200 taxon concepts that should be secured. (There is a little bit of info about this in English https://laji.fi/en/about/875). The level can be defined to 1) all occurrences of a taxon or 2) by time period (currently two fixed time periods: summer and winter) or 3) just breeding sites/nests, 4) separately for occurrences located at nature protection areas. Depending on the level, of coarsening, also other data than spatial data is removed; for example at KM100 time is coarsened to decade, at KM10 municipality information is removed etc etc. For Finnish occurrences we coarse using our metric based grid coordinate system, for foreign occurrences we use wgs84 degree rounding.
Data generalization can also occur from other reasons: 1) user has asked to hide place, 2) user has asked to hide person name, 3) dataset has a research embargo (for example 2-4 years): data from this period (event date) is generalized, once embargo expires, data becomes public), 4) data sources may provide us a different version to private side and limited version to public side of the data warehouse.
We have two sides to our repository: private and public. Public contains data where some info is generalized, private side contains full data. Government officials etc can access the full data using a separate service. Researchers or anyone interested of a full version of the data can do a data request. We have IT system for handling requests where the requester fills in the reasoning why they want the data. Owners of datasets approve or decline requests. Once approved, a download becomes available.
We have structured documents that contain lots of occurrences, for example a document that contains all occurrences done during a bird/butterfly line-transect count or a long list of species captured with a malaise trap. If this kind of document contains a sensitive species, it would be shame to generalize all occurrences. Instead, we split out the occurrences that need generalization, leaving rest of the document intact. The original document is marked incomplete: that something has been detached from the document. The occurrences that were split are generated a different identifier (guid), so that it is not possible to join it with the original. The loading of the split document is also randomly delayed by 10-20 days so that it is not possible to use load dates to join the original and split documents.
Splitting documents causes head-aches for annotations. Annotations can be done for both the split and original document (original being available in the private side of the data warehouse). We still must be able to join annotations made in private side to the public split document and vice versa. This is accomplished quite simply by keeping track of original ids and generated split ids.
Strong consideration has to be taken about what info can be revealed, so that is not possible for a knowledgable person to determine the original document of a split occurrence using person names, collection identifiers, etc. In rarely observed/remote areas there may be very few documents each year from a not-so-popular taxon group, leaving very few candidates. Sometimes even just two: the splitted document and the original. This is a weak point in our system.
There is at least one feature in your document that I'll add to our TODO list: We remove fields without leaving placeholders. As you note in your document, it would be better to leave placeholders (localization is an issue since we run a 3 language services, but occurrence data is often only in one language anyway). I'll read the document again with more thought.
Dear all contributors,
Thanks for a wonderfully written document! I'm providing few insights from point of view of FinBIF (http://species.fi) that you might find interesting.
We have a data warehouse that has about 40M occurrences from ~20 IT systems and ~400 datasets. We have implemented a "securing system" for sensitive species, that has some features that may be quite new compared to any other IT systems that do data generalization. Unfortunately we have only written about this in Finnish and even that does not go into details.
First things that are similar to concepts found in your document and many implementations:
Then some things that are not so common:
Source codes can be found here: https://bitbucket.org/luomus/laji-etl/src/master/WEB-INF/src/main/fi/laji/datawarehouse/etl/models/Securer.java
There is at least one feature in your document that I'll add to our TODO list: We remove fields without leaving placeholders. As you note in your document, it would be better to leave placeholders (localization is an issue since we run a 3 language services, but occurrence data is often only in one language anyway). I'll read the document again with more thought.
Cheers, Esko Piirainen Luomus / FinBIF