Support synthetic data use

alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io

BSD 3-Clause "New" or "Revised" License

61 stars 15 forks source link

Support synthetic data use #2000

Open JimMadge opened 4 months ago

JimMadge commented 4 months ago

Development outside the TRE would be enhanced with access to synthetic data that mimics the structure of sensitive data.

Such synthetic data could be used to validate code without the need for code ingress. It would also help debug code as there would be no need to find a method for egress of error messages from the TRE.

What could we do in the way of,

Providing/encouraging the use of synthetic data tools
Developing tools for synthetic data use alongside the TRE

craddm commented 3 months ago

Would we need to do synthetic data, or simply dummy data? The latter is a far smaller ask.

We'd only be aiming for people to be able to test that their code runs - it's not necessary for the data to have comparable statistical qualities to the original.

JimMadge commented 3 months ago

Good point. I think either synthetic or dummy data would give a benefit for researchers.

Both should give a good indication of whether the code will run or not. Synthetic data would give the extra advantage of giving more representative/interpretable results.

JimMadge commented 3 months ago

I don't think we have the capacity to invent the synthetic/dummy data tools ourselves.

However, we could think about can we,

Integrate existing tools
Provide a way to extract dummy data
Do anything else to support people testing against synthetic/dummy data outside the TRE before running against sensitive data inside the TRE