Open dannyparsons opened 7 years ago
I've now updated this (#4006) after discussion with @rdstern. I've slightly changed how this works. The new parameter is now called bStrict
instead of bOnlyExcludeOppositeType
, with the default being False
.
The main thing you need to know is that:
numeric
and bStrict = False
(the default), then factor and character columns are excluded (and all others are included)factor
and bStrict = False
(the default), then factor and logical columns are included (and all others excluded).So this means by default more columns will be included. If we want it to strictly only include a type, then set bStrict = True
.
You don't need to do anything different for other types like Date
etc. because the bStrict
has no effect currently. I hope few dialogs will have to change, but it would be useful for everyone to test dialogs with different column types and see how the commands behave with them. Here is a small example data frame I prepared, which might be useful for this. It has 7 columns of 7 different types.
@africanmathsinitiative/developers We have discussed this issue before, which is when should receivers allow columns like
logical
,Date
and other non standard data types to be entered.For example, some of the frequency table dialogs have factor receivers. But a logical column is like a factor with two levels:
TRUE
andFALSE
. And it's sensible to sometimes have alogical
column as one of your table's factors.We also use
Date
columns. It's notnumeric
, but I can add and subtract a number from them, take themin
,max
,mean
of it etc.. Similarlylogical
columns are internally stored in R as 0 and 1 so you can do almost all same operations on them as you do onnumeric
columns.Therefore, it can be very frustrating for users who want to use these different types of columns (which we produce) but they are often excluded from receivers because when we set the type as
numeric
orfactor
,logical
andDate
(and everything else) are always excluded even when they might not need to be.This has become more urgent with the procurement data work because these data sets have
logical
andDate
columns and we need to be able to analyse them sensibly.So what I have just done (https://github.com/africanmathsinitiative/R-Instat/pull/4003) is changed how we set data types for a receiver. There is now an optional parameter to
SetDataType
andSetIncludedDataTypes
which isbOnlyExcludeOppositeType
. The default isTrue
and whenTrue
, if for example, the data type is set tonumeric
, instead of only includingnumeric
columns, it will instead excludecharacter
andfactor
columns. And the reverse forcharacter
andfactor
(excludenumeric
). This means now by default setting the type tonumeric
, orfactor
, will allowlogical
andDate
(and other) column types.This change doesn't affect setting to other types, so setting the type to
Date
still just includesDate
s because I think that is still what we want. Similarly, if you set multiple included types, likefactor
andDate
, it will just include those, because its not clear what to exclude.When
bOnlyExcludeOppositeType = False
then it does what used to be done, and only include that type. We still need this option, for example the Levels/Labels dialog shouldn't include logical columns because this only works on factor columns. I have already corrected this for the Factor menu dialogs but there may be others.And so this has introduced some instability because there are likely other dialogs which will now give errors when using these other types.
I would really like everyone to test this out on all our dialogs which set specific data types so it quickly becomes stable again. If a command only works with a specific data type then we should change it back to only allowing this.
And some might be less obvious and need discussion. For example, the Canonical Correlations command (
cancor
) works withlogical
columns, but notDate
columns. So should this be set back to only allowing strictlynumeric
columns? Or keep the new setting so thatlogical
is allowed, but so are others which give an error?It would be good for us to decide what our rules are in these cases, whether we go for more cautiousness to prevent errors, or more flexibility, to allow more columns to be used. I think we sort of wanted to go for a bit more flexibility. And we do want users to be aware of the column types they have and what is and isn't sensible, especially when you have unusual types. If there are sensible errors when you use an unusual type, are we happy with allowing that?