IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Receiver data types - when to allow non standard column types #4004

Open dannyparsons opened 7 years ago

dannyparsons commented 7 years ago

@africanmathsinitiative/developers We have discussed this issue before, which is when should receivers allow columns like logical, Date and other non standard data types to be entered.

For example, some of the frequency table dialogs have factor receivers. But a logical column is like a factor with two levels: TRUE and FALSE. And it's sensible to sometimes have a logical column as one of your table's factors.

We also use Date columns. It's not numeric, but I can add and subtract a number from them, take the min, max, mean of it etc.. Similarly logical columns are internally stored in R as 0 and 1 so you can do almost all same operations on them as you do on numeric columns.

Therefore, it can be very frustrating for users who want to use these different types of columns (which we produce) but they are often excluded from receivers because when we set the type as numeric or factor, logical and Date (and everything else) are always excluded even when they might not need to be.

This has become more urgent with the procurement data work because these data sets have logical and Date columns and we need to be able to analyse them sensibly.

So what I have just done (https://github.com/africanmathsinitiative/R-Instat/pull/4003) is changed how we set data types for a receiver. There is now an optional parameter to SetDataType and SetIncludedDataTypes which is bOnlyExcludeOppositeType. The default is True and when True, if for example, the data type is set to numeric, instead of only including numeric columns, it will instead exclude character and factor columns. And the reverse for character and factor (exclude numeric). This means now by default setting the type to numeric, or factor, will allow logical and Date (and other) column types.

This change doesn't affect setting to other types, so setting the type to Date still just includes Dates because I think that is still what we want. Similarly, if you set multiple included types, like factor and Date, it will just include those, because its not clear what to exclude.

When bOnlyExcludeOppositeType = False then it does what used to be done, and only include that type. We still need this option, for example the Levels/Labels dialog shouldn't include logical columns because this only works on factor columns. I have already corrected this for the Factor menu dialogs but there may be others.

And so this has introduced some instability because there are likely other dialogs which will now give errors when using these other types.

I would really like everyone to test this out on all our dialogs which set specific data types so it quickly becomes stable again. If a command only works with a specific data type then we should change it back to only allowing this.

And some might be less obvious and need discussion. For example, the Canonical Correlations command (cancor) works with logical columns, but not Date columns. So should this be set back to only allowing strictly numeric columns? Or keep the new setting so that logical is allowed, but so are others which give an error?

It would be good for us to decide what our rules are in these cases, whether we go for more cautiousness to prevent errors, or more flexibility, to allow more columns to be used. I think we sort of wanted to go for a bit more flexibility. And we do want users to be aware of the column types they have and what is and isn't sensible, especially when you have unusual types. If there are sensible errors when you use an unusual type, are we happy with allowing that?

dannyparsons commented 7 years ago

I've now updated this (#4006) after discussion with @rdstern. I've slightly changed how this works. The new parameter is now called bStrict instead of bOnlyExcludeOppositeType, with the default being False.

The main thing you need to know is that:

So this means by default more columns will be included. If we want it to strictly only include a type, then set bStrict = True.

You don't need to do anything different for other types like Date etc. because the bStrict has no effect currently. I hope few dialogs will have to change, but it would be useful for everyone to test dialogs with different column types and see how the commands behave with them. Here is a small example data frame I prepared, which might be useful for this. It has 7 columns of 7 different types.