bnowok / synthpop

Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control
40 stars 8 forks source link

Getting slower with every variable? #11

Closed cjvanlissa closed 4 years ago

cjvanlissa commented 5 years ago

It seems like syn() is gets slower with every next variable. In a data.frame with 305 variables (all integers on a 0-4 range), the first variables take +/- a second to synthesize, and the last ones take about 15 minutes each. Any idea what causes this behavior?

gillian-raab commented 5 years ago

This is a well known issue. I assume you are using the default of everything which means CART models where each variable is predicted by everything that comes before it and CART models can be slow with as many variables as that.

Do tyou actually get to the end of the fit? If not and even if you do, there are several tactics you can use to get over this. The most obvious is to use a predictor matrix which allows you to select which variables predict which others. If you check out the synthpop web page www.synthpop.org.uk (a bit incomplete I'm afraid) then you will find a page with links to various papers and presentations we have written that might help. We are always interested to know what people use synthpop for. DO drop us a note to let us know how you are using it and feel free to ask any other questions. Best Gillian Raab gillian.raab@ed.ac.uk

cjvanlissa commented 5 years ago

So just to clarify - is it not the case that each variable is predicted by every other variable? Each variable is predicted only by the ones preceding it in the data.frame?

gillian-raab commented 5 years ago

Not quite. It is predicted by all variables that come before it in the visit.sequence. If you don't specify visit sequence then it is the same as what you said. Changing the visit sequence is another way of customising your synthesis. If it is important to maintain the relationships between a set of variables than put them together near the start of the visit sequence.

gillian-raab commented 5 years ago

PS do let me know what you are synthesising. G

cjvanlissa commented 5 years ago

Dear Gillian, I think I understand how the package operates now, but I'm not clear on why "the choice of explanatory variables is restricted by the synthesis sequence and variables that are not synthesised yet cannot be used in prediction models." It seems that, this way, structural relationships among variables are only fully preserved for the final variable in the visit.sequence?

gillian-raab commented 4 years ago

If byou tried to use a variable that was not yet synthesised it would not work at all. You are building up the synthetic data from conditional distyributions. In each case you fit a model from the unsynthsised data to get the parameters of the prediction model for the next variable. Then you make a synthetic version of the next variable in the synthetic data set by getting its predicted values from the variables synthesised already. A simple example of why your suggestion will not work is the following. The first variable (v1) to be synthesised is usually just a bootstrap sample of the original data. THen you fit a model from the real data of the second variable (V2) predicted from v1. THis prediction model is then used to get v2 in the synthetic data by predicting it from v1. If you tried to use a prediction from another variable in the original data it would not work because you would not have a version of it in the synthetic data and the version in the original data would not line up with the synthetic data at all.

You also suggested that the only relationships that would be maintained between variables would be the ones defined by these models, so here the relationship between the later variables and the earlier ones. But remember that a relationship between an earlier variable and a later one is maintained because this same fit makes the earlier variable in the synthetic data dependent on the later one.

ynren1020 commented 5 months ago

Hi Gillian, I have a similar question as the one @cjvanlissa posted here - the synthesis speed can be dramatically different if I change the visit.sequence. Is there a tip or trick I can use to speed up the synthesis speed? For example, should I put the variable with more factor levels at a later time? Should I put the variable with some missing values at a later time too? In general, is there any strategy you recommend to use to speed up the synthesis process while keep the synthetic data utility? Thank you, Yanan

gillian-raab commented 5 months ago

See below. Also check out a short paper that might help.

https://arxiv.org/pdf/1712.04078

Hi Gillian, I have a similar question as the one @cjvanlissahttps://github.com/cjvanlissa posted here - the synthesis speed can be dramatically different if I change the visit.sequence. Is there a tip or trick I can use to speed up the synthesis speed? For example, should I put the variable with more factor levels at a later time?

Yes that is a good idea because it stops making the models too big. ANother option for large complex data sets is to restrict the variables used in prediction by editing the predictor matrix to exclude them. Start by running a model with m=0 to give an objevt i might call syn0, and then edit syno$predictor.matrix to exclude some of the variables with many categories.

Should I put the variable with some missing values at a later time too?

No that won't make any difference.

In general, is there any strategy you recommend to use to speed up the synthesis process while keep the synthetic data utility? Thank you, Yanan

Best Gillian

— Reply to this email directly, view it on GitHubhttps://github.com/bnowok/synthpop/issues/11#issuecomment-2179366866, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3HB7GH5JIGWRHEEHFG33DZIHNBVAVCNFSM6AAAAABJSTPPCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGM3DMOBWGY. You are receiving this because you commented.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.