Open romaintailhurat opened 4 years ago
My understanding of columns
was bad: it's in fact not metadata but data that is provided, the difference with rows
being that you can provide the name of the column.
😞
We still need a solution for providing information about columns...
I think the best solution would be to keep the simple
type,
resolve
datastructure but to return a more complex thing with resolve(), ideally the same
dataStructure,
dataPoints
With dataPoints referring to the dataforge.DataFrame.
A last one: how about extending the dataforge class in order to add a metadata member ?
Hey @romaintailhurat, good inputs thanks.
I am writing down a few banalities just to make sure we both understand the problem in the same way.
This internal structure is used inside the runtime to declare types. Each new variable gets a new function calling the resolve of other functions effectively making a tree call. This works fine for scalars.
{
type: Parser.INTEGER,
resolve: function(bindings) { return otherVarRef.resolve(bindings); }
}
When it comes to dataset, we need the types / roles & names of the columns before the call to resolve is done. If we put the types / roles & names inside the return value of the resolve function, we will either need to call it several times during parsing (not really an issue but it could degrade prefs or lead to bugs) or save a reference to the result of resolve which could lead to the same problems I mentioned above.
This is an approach we used at SSB because the streaming nature of our implementation so I might suffer from tunnel vision. Could you elaborate on why you thought of having the dataStructure
inside the return value of the resolve
function and not as a field within the internal representation?
Thank you @hadrienk
After our last discussion, you suggest to continue with the hackathon implementation and add a columns member to the object returned when resolving the Variable.
I'm working on datasets arithmetic operations with our internal representation using dataforge.Dataframe.
The current arithmetic test suite is using the following structure:
(see https://github.com/InseeFr/VTL-Tools/blob/master/src/engine/tests/interpretors/arithmetic.spec.js#L14)
The idea here is to store type and role of a column, for example
This structure helps when doing arithmetic operations (eg. when summing two dataframes, we want to only apply the operator to columns having a NUMBER type and a MEASURE role)
On the other side, our current implementation of a dataset is simpler:
(see https://github.com/InseeFr/VTL-Tools/blob/master/src/engine/visitors/Variable.js#L52)
In this case, we return a generic shape for a variable, only providing the type and the resolve function. It seems unpractical then to add members to this structure depending on the type of the parsed variable (eg adding a
columns
field for a dataset).But if we keep this simple structure, how to pass columns types and roles at runtime ?
Here is a third way: using the "metadata" information dataforge style.
Indeed, the dataforge Dataframe can be provided with informations about the columns, for example:
It looks like the cleaner solution. Further parsing the dataset object provided by clients will be necessary though in order to create this
columns
object.Do you validate this idea ?