bnowok / synthpop

Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control
40 stars 8 forks source link

Error when comparing data without numeric data #17

Closed joseph-allen closed 3 years ago

joseph-allen commented 3 years ago

If I run compare(syn_df, df)

I receive the following error: Error in hist.default(vardata, breaks = breaks[[i]], plot = FALSE) : 'x' must be numeric

Summary of my df

                        variable   class nmiss perctmiss ndistinct                                                            details
1                      age_group  factor     0      0.00         6                                                  See table in labs
2                            sex  factor     0      0.00         2                                                    'Female' 'Male'
3                   ethnic_group  factor     0      0.00         3                                 'Not answered' 'Not white' 'White'
4                sexual_identity  factor     0      0.00         3 'Heterosexual/straight' 'Not answered' 'Not heterosexual/straight'
5            relationship_status  factor    91      2.40         7                                                  See table in labs
6         opp_one_night_stand_ok  factor     9      0.24         6                                                  See table in labs
7        opp_sex_without_love_ok  factor    10      0.26         6                                                  See table in labs
8       opp_pressure_to_have_sex  factor     9      0.24         6                                                  See table in labs
9  opp_men_have_higher_sex_drive  factor     9      0.24         6                                                  See table in labs
10        opp_too_much_sex_media  factor     8      0.21         6                                                  See table in labs
11                     has_child logical     0      0.00         2                                                               
gillian-raab commented 3 years ago

Dear Joseph, Thanks for your email. I hope this reaches you. We will look in to this (I'll take a first look later troday). I'm copying to Beata who knows the code betetr than I do. Your email suggests that it happens because you have no variables that are not factors or logical. Is that right? It could help us to trace the error if it is. Best wishes Gillian

Gillian M Raab

Emeritus Professor, Edinburgh Napier University

Part-time Research Fellow

Administrative Data Research Centre - Scotland

Edinburgh

+44 7748 678 551


From: Joseph Allen @.> Sent: 13 June 2021 22:37 To: bnowok/synthpop @.> Cc: Subscribed @.***> Subject: [bnowok/synthpop] Error when comparing data without numeric data (#17)

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

If I run compare(syn_df, df)

I receive the following error: Error in hist.default(vardata, breaks = breaks[[i]], plot = FALSE) : 'x' must be numeric

Summary of my df

                    variable   class nmiss perctmiss ndistinct                                                            details

1 age_group factor 0 0.00 6 See table in labs 2 sex factor 0 0.00 2 'Female' 'Male' 3 ethnic_group factor 0 0.00 3 'Not answered' 'Not white' 'White' 4 sexual_identity factor 0 0.00 3 'Heterosexual/straight' 'Not answered' 'Not heterosexual/straight' 5 relationship_status factor 91 2.40 7 See table in labs 6 opp_one_night_stand_ok factor 9 0.24 6 See table in labs 7 opp_sex_without_love_ok factor 10 0.26 6 See table in labs 8 opp_pressure_to_have_sex factor 9 0.24 6 See table in labs 9 opp_men_have_higher_sex_drive factor 9 0.24 6 See table in labs 10 opp_too_much_sex_media factor 8 0.21 6 See table in labs 11 has_child logical 0 0.00 2

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/bnowok/synthpop/issues/17, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3HB7HYVEXNJMCFFJEW2NDTSUQKDANCNFSM46UDVCRA.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

joseph-allen commented 3 years ago

Yes that is correct, I think the .hist function within might not work with factors?

gillian-raab commented 3 years ago

That is not the case. We have used it extensively for factors. I thought perhaps it might not like ONLY factors being synthesised, but the exanple below shows this is not the case. I' d be happy to try to work out what is going on if you send me some more details For example the code you used to create the synthetic data. Did any plots appear before you got the failure and can you get any to appear by setting the vars parameter in compare/ Also when you get the failure do you get anything useful from traceback?

I don't suppose you can send me the original data, but maybe a synthesised data set would work.

Best Gillian

#################### 4 variables from data with package ############### tosyn <- SD2011[,1:4] head(tosyn) syn1<- syn(tosyn) compare(syn1,tosyn)

##################### now 3 variables all factors tosyn <- SD2011[,c(1,3,4)] head(tosyn) syn1<- syn(tosyn) compare(syn1,tosyn)

Gillian M Raab

Emeritus Professor, Edinburgh Napier University

Part-time Research Fellow

Administrative Data Research Centre - Scotland

Edinburgh

+44 7748 678 551


From: Joseph Allen @.> Sent: 14 June 2021 10:19 To: bnowok/synthpop @.> Cc: RAAB Gillian @.>; Comment @.> Subject: Re: [bnowok/synthpop] Error when comparing data without numeric data (#17)

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

Yes that is correct, I think the .hist function within might not work with factors?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/bnowok/synthpop/issues/17#issuecomment-860534303, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3HB7HOEB6UHEVXOHOX753TSXCRRANCNFSM46UDVCRA.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

bnowok commented 3 years ago

It's the logical variable that is causing the problem (no need to send us any data). I will look into this asap.

joseph-allen commented 3 years ago

I've converted the logical column to a factor, though I don't know enough about R to be confident it doesn't consider a 2-factor column a logical type.

df$has_child=as.factor(df$has_child)

Outputs from codebook confirm

##                         variable  class nmiss perctmiss ndistinct
## 1                      age_group factor     0      0.00         6
## 2                            sex factor     0      0.00         2
## 3                   ethnic_group factor     0      0.00         3
## 4                sexual_identity factor     0      0.00         3
## 5            relationship_status factor    91      2.40         7
## 6         opp_one_night_stand_ok factor     9      0.24         6
## 7        opp_sex_without_love_ok factor    10      0.26         6
## 8       opp_pressure_to_have_sex factor     9      0.24         6
## 9  opp_men_have_higher_sex_drive factor     9      0.24         6
## 10        opp_too_much_sex_media factor     8      0.21         6
## 11                     has_child factor     0      0.00         2

Then running compare(syn_df, df, vars = "has_child", msel = 1:5) or compare(syn_df, df, msel = 1:5) result in the same error Error in hist.default(vardata, breaks = breaks[[i]], plot = FALSE) : 'x' must be numeric

joseph-allen commented 3 years ago

Apologies @gillian-raab I missed your reply.

The dataset is available here -> https://github.com/UKDataServiceOpen/Synthetic-Data/blob/main/code-demo/NATSAL/natsal_3_teaching_open_with_personal.csv

I'm actually running a training series on synthetic data at the moment!

Plotting the individual variables except for has_child works.

Here is an RPubs link showing it works with one variable.

bnowok commented 3 years ago

It should work after conversion to factor. The below code didn't return an error:

df <- read_csv("natsal_3_teaching_open_with_personal.csv")
drops <- c("first_name","last_name","email", 'importance_religion', 'age_at_first_child')
df <- df[ , !(names(df) %in% drops)]
df <- as.data.frame(unclass(df), stringsAsFactors = TRUE)
df$has_child <- as.factor(df$has_child)
syn_df <- syn(df, m = 5)
compare(syn_df, df, vars = "has_child", msel = 1:5)
joseph-allen commented 3 years ago

Yep that's it, probably a sign of my lack of R knowledge over anything with your library but perhaps logical columns should be automatically converted to factors?

joseph-allen commented 3 years ago

I forgot to say, this fixed my issue!

bnowok commented 3 years ago

Great! You shouldn't need to convert logical variables to factors though. I will deal with this issue. Thanks for reporting the problem.

gillian-raab commented 3 years ago

Well spotted Beata. Joseph we'd be interested to know more about your training course.

Gillian M Raab

Emeritus Professor, Edinburgh Napier University

Part-time Research Fellow

Administrative Data Research Centre - Scotland

Edinburgh

+44 7748 678 551


From: bnowok @.> Sent: 15 June 2021 10:55 To: bnowok/synthpop @.> Cc: RAAB Gillian @.>; Mention @.> Subject: Re: [bnowok/synthpop] Error when comparing data without numeric data (#17)

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

Great! You shouldn't need to convert logical variables to factors though. I will deal with this issue. Thanks for reporting the problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bnowok/synthpop/issues/17#issuecomment-861360670, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3HB7G3XLY7RNLRYB5IMDLTS4PQFANCNFSM46UDVCRA.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

joseph-allen commented 3 years ago

details are here -> https://ukdataservice.ac.uk/news-and-events/eventsitem/?id=5780

Feel free to e-mail me joseph.allen@manchester.ac.uk if you want to chat more

bnowok commented 3 years ago

In the new version of synthpop (1.7-0 that is available on GitHub) you do not have to convert logical variables to factors.