can we still encode Range values in samples?

ANRGenstar / genstar

Generation of Synthetic Populations Library

20 stars 2 forks source link

can we still encode Range values in samples? #49

Open samthiriot opened 6 years ago

samthiriot commented 6 years ago

When we were creating Range attributes before the huge refactoring, it was possible to give both a list of codes (like "1","2"...) and the corresponding textual counterparts ("less than 10m","11 to 16"...). Now we are only constructing these ranges with the textual version. This works well to read aggregate stats from CSV files where we expect all the columns to explicitly contain "less than 10m"; but for sample files, the values is often encoded as "1","2"...

is it still possible to deal with that?

Tks !

chapuisk commented 6 years ago

Hey, I am not quite sure to understand the problem. If attribute is encoded as "1", "2" ... you can go for int attribute or if it is not integer value per se, for an ordered value attribute. If this is just another way to encode range attribute, then use a mapped attribute with a record mapper where you can define a mapping like: {1 : less than 10; 2 : 11 to 16 ...}. This option force you to define two attributes: referent range attribute and mapped "int to range" (or "ordered to range") attribute. Hope it can help you to overcome you issue.

samthiriot commented 6 years ago

thanks ! I think you understood my question ^^ The cases are, as you say:

range: never stored in samples as "0 to 10" or "11 to 15" but 0 or 1
boolean: stored as 0 or 1 and not FALSE or TRUE only for integers and double we have a direct correspondence between the encoded value and its textual counterpart. It's a bit weird to always create a mapped attribute, no? I mean, isn't that part of the semantics of the "Value" to be either encoded or to have a litteral version ? For instance in the INSEE dico, they always propose: <code of the variable>;<label of the variable>;<code of the modality (value)><label of the value>

samthiriot commented 6 years ago

thinking about it: typically to write the content of a value in a generated sample, one would like to also write the encoded value, not (always) the long version. in this case we need to be able to retrieve the short version (encoded) for a value; I'm not sure how to do it using a mapped attribute.

samthiriot commented 6 years ago

(I'll think about it, no worry & thanks)

chapuisk commented 6 years ago

Thinking about it and saw one good reason not to encode various codes for one attribute. In many case, data "simple" encoding like {"1", "2", ...} are used for several different attributes: e.g. boolean are 1 and 2; range are 1, 2 ... x and so on. Hence they can be confusion on translation: that is which "1" code will be related to which "complex" encoding ? The unique way to solve the problem is to bind modalities or codes. In that case, and if you use the mapped version of the attribute, you can choose between simple code or complex one (using DemographicAttribute#findMappedAttributeValues(IValue))