floswald / CASDr

R tools for work on CASD
https://floswald.github.io/CASDr/
1 stars 0 forks source link

CASDr : R Tools for the secure CASD Environment

lifecycle R build status

CASD

CASD is the secure data platform for French Admin data. Since it's inception in 2010, there have been 321 publications registered.

CASD is typical high-security data access system, which allows researchers to work on a remote server which can be accessed with fingerprint and card authentication. The remote server is not connected to the internet, and data export from the server is subject to statistical disclosure controls. It's a challenging development environment.

Why Does The World Need This Package

It does not necessarily have to be this package. But it could be part of a start.

What this Package Could Deliver

Imagine the following rosy scenario:

  1. You obtain money and security clearance via the committee du secret statistique to work on some data on CASD. Suppose DADS - what else?
  2. You have of course no idea how to even load the damn thing. It's in SAS.
  3. Imagine you had a website like this one here, where a simple vignette would show in a worked example how to load this step by step?
  4. Imagine it would go further and introduce you to some of more finicky details and tips and tricks in order to achieve a certain task.

Which Tasks?

CASD hosts a huge number of datasets, many of them in several versions that vary over time. So which tasks?

  1. Read a set of columns out of a certain SAS database.
  2. Compute straight line distance between two vectors of lat-lon coordinates and other geospatial operations.
  3. General data cleaning of a certain database.

But ... User-specific Requirements?

Each project has unique data cleaning requirements

loadRP(year = 2015, ... )

where ... would hold keywords on which to subset the census.

The key difference is that all researchers could agree on the best version of this custom code and share with others.

What does Tested mean?

We want to run automated unit tests on code we use on CASD like here on github actions: R build status

But CASD is an offline environment!

Correct:

  1. One would have to build fake datasets to test package functionality outside of CASD. No single bite of sensitive data needs to be exported from CASD.
  2. One would have to setup a way that allows to import the R package into CASD at a certain regular frequency. It's generally no problem to import code onto one's user space.
  3. We would then run the same unit tests inside CASD, make changes to the code on github. Rinse, and repeat.

Is there any involvement of CASD