alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
57 stars 14 forks source link

Support synthetic data use #2000

Open JimMadge opened 2 months ago

JimMadge commented 2 months ago

Development outside the TRE would be enhanced with access to synthetic data that mimics the structure of sensitive data.

Such synthetic data could be used to validate code without the need for code ingress. It would also help debug code as there would be no need to find a method for egress of error messages from the TRE.

What could we do in the way of,

craddm commented 1 month ago

Would we need to do synthetic data, or simply dummy data? The latter is a far smaller ask.

We'd only be aiming for people to be able to test that their code runs - it's not necessary for the data to have comparable statistical qualities to the original.

JimMadge commented 1 month ago

Good point. I think either synthetic or dummy data would give a benefit for researchers.

Both should give a good indication of whether the code will run or not. Synthetic data would give the extra advantage of giving more representative/interpretable results.

JimMadge commented 1 month ago

I don't think we have the capacity to invent the synthetic/dummy data tools ourselves.

However, we could think about can we,