Closed joseph-allen closed 3 years ago
Dear Joseph, Thanks for your email. I hope this reaches you. We will look in to this (I'll take a first look later troday). I'm copying to Beata who knows the code betetr than I do. Your email suggests that it happens because you have no variables that are not factors or logical. Is that right? It could help us to trace the error if it is. Best wishes Gillian
Gillian M Raab
Emeritus Professor, Edinburgh Napier University
Part-time Research Fellow
Administrative Data Research Centre - Scotland
Edinburgh
+44 7748 678 551
From: Joseph Allen @.> Sent: 13 June 2021 22:37 To: bnowok/synthpop @.> Cc: Subscribed @.***> Subject: [bnowok/synthpop] Error when comparing data without numeric data (#17)
This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
If I run compare(syn_df, df)
I receive the following error: Error in hist.default(vardata, breaks = breaks[[i]], plot = FALSE) : 'x' must be numeric
Summary of my df
variable class nmiss perctmiss ndistinct details
1 age_group factor 0 0.00 6 See table in labs 2 sex factor 0 0.00 2 'Female' 'Male' 3 ethnic_group factor 0 0.00 3 'Not answered' 'Not white' 'White' 4 sexual_identity factor 0 0.00 3 'Heterosexual/straight' 'Not answered' 'Not heterosexual/straight' 5 relationship_status factor 91 2.40 7 See table in labs 6 opp_one_night_stand_ok factor 9 0.24 6 See table in labs 7 opp_sex_without_love_ok factor 10 0.26 6 See table in labs 8 opp_pressure_to_have_sex factor 9 0.24 6 See table in labs 9 opp_men_have_higher_sex_drive factor 9 0.24 6 See table in labs 10 opp_too_much_sex_media factor 8 0.21 6 See table in labs 11 has_child logical 0 0.00 2
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/bnowok/synthpop/issues/17, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3HB7HYVEXNJMCFFJEW2NDTSUQKDANCNFSM46UDVCRA.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
Yes that is correct, I think the .hist function within might not work with factors?
That is not the case. We have used it extensively for factors. I thought perhaps it might not like ONLY factors being synthesised, but the exanple below shows this is not the case. I' d be happy to try to work out what is going on if you send me some more details For example the code you used to create the synthetic data. Did any plots appear before you got the failure and can you get any to appear by setting the vars parameter in compare/ Also when you get the failure do you get anything useful from traceback?
I don't suppose you can send me the original data, but maybe a synthesised data set would work.
Best Gillian
#################### 4 variables from data with package ############### tosyn <- SD2011[,1:4] head(tosyn) syn1<- syn(tosyn) compare(syn1,tosyn)
##################### now 3 variables all factors tosyn <- SD2011[,c(1,3,4)] head(tosyn) syn1<- syn(tosyn) compare(syn1,tosyn)
Gillian M Raab
Emeritus Professor, Edinburgh Napier University
Part-time Research Fellow
Administrative Data Research Centre - Scotland
Edinburgh
+44 7748 678 551
From: Joseph Allen @.> Sent: 14 June 2021 10:19 To: bnowok/synthpop @.> Cc: RAAB Gillian @.>; Comment @.> Subject: Re: [bnowok/synthpop] Error when comparing data without numeric data (#17)
This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
Yes that is correct, I think the .hist function within might not work with factors?
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/bnowok/synthpop/issues/17#issuecomment-860534303, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3HB7HOEB6UHEVXOHOX753TSXCRRANCNFSM46UDVCRA.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
It's the logical variable that is causing the problem (no need to send us any data). I will look into this asap.
I've converted the logical column to a factor, though I don't know enough about R to be confident it doesn't consider a 2-factor column a logical type.
df$has_child=as.factor(df$has_child)
Outputs from codebook confirm
## variable class nmiss perctmiss ndistinct
## 1 age_group factor 0 0.00 6
## 2 sex factor 0 0.00 2
## 3 ethnic_group factor 0 0.00 3
## 4 sexual_identity factor 0 0.00 3
## 5 relationship_status factor 91 2.40 7
## 6 opp_one_night_stand_ok factor 9 0.24 6
## 7 opp_sex_without_love_ok factor 10 0.26 6
## 8 opp_pressure_to_have_sex factor 9 0.24 6
## 9 opp_men_have_higher_sex_drive factor 9 0.24 6
## 10 opp_too_much_sex_media factor 8 0.21 6
## 11 has_child factor 0 0.00 2
Then running compare(syn_df, df, vars = "has_child", msel = 1:5)
or compare(syn_df, df, msel = 1:5)
result in the same error Error in hist.default(vardata, breaks = breaks[[i]], plot = FALSE) : 'x' must be numeric
Apologies @gillian-raab I missed your reply.
The dataset is available here -> https://github.com/UKDataServiceOpen/Synthetic-Data/blob/main/code-demo/NATSAL/natsal_3_teaching_open_with_personal.csv
I'm actually running a training series on synthetic data at the moment!
Plotting the individual variables except for has_child works.
Here is an RPubs link showing it works with one variable.
It should work after conversion to factor. The below code didn't return an error:
df <- read_csv("natsal_3_teaching_open_with_personal.csv")
drops <- c("first_name","last_name","email", 'importance_religion', 'age_at_first_child')
df <- df[ , !(names(df) %in% drops)]
df <- as.data.frame(unclass(df), stringsAsFactors = TRUE)
df$has_child <- as.factor(df$has_child)
syn_df <- syn(df, m = 5)
compare(syn_df, df, vars = "has_child", msel = 1:5)
Yep that's it, probably a sign of my lack of R knowledge over anything with your library but perhaps logical columns should be automatically converted to factors?
I forgot to say, this fixed my issue!
Great! You shouldn't need to convert logical variables to factors though. I will deal with this issue. Thanks for reporting the problem.
Well spotted Beata. Joseph we'd be interested to know more about your training course.
Gillian M Raab
Emeritus Professor, Edinburgh Napier University
Part-time Research Fellow
Administrative Data Research Centre - Scotland
Edinburgh
+44 7748 678 551
From: bnowok @.> Sent: 15 June 2021 10:55 To: bnowok/synthpop @.> Cc: RAAB Gillian @.>; Mention @.> Subject: Re: [bnowok/synthpop] Error when comparing data without numeric data (#17)
This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
Great! You shouldn't need to convert logical variables to factors though. I will deal with this issue. Thanks for reporting the problem.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bnowok/synthpop/issues/17#issuecomment-861360670, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3HB7G3XLY7RNLRYB5IMDLTS4PQFANCNFSM46UDVCRA.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
details are here -> https://ukdataservice.ac.uk/news-and-events/eventsitem/?id=5780
Feel free to e-mail me joseph.allen@manchester.ac.uk if you want to chat more
In the new version of synthpop (1.7-0 that is available on GitHub) you do not have to convert logical variables to factors.
If I run
compare(syn_df, df)
I receive the following error:
Error in hist.default(vardata, breaks = breaks[[i]], plot = FALSE) : 'x' must be numeric
Summary of my df