Columns information for a dataset implementation

romaintailhurat commented 4 years ago

I'm working on datasets arithmetic operations with our internal representation using dataforge.Dataframe.

The current arithmetic test suite is using the following structure:

{
    type,
    columns,
    resolve
}

(see https://github.com/InseeFr/VTL-Tools/blob/master/src/engine/tests/interpretors/arithmetic.spec.js#L14)

The idea here is to store type and role of a column, for example

{
    Id_1: { type: VtlParser.STRING, role: VtlParser.DIMENSION },
[...]
}

This structure helps when doing arithmetic operations (eg. when summing two dataframes, we want to only apply the operator to columns having a NUMBER type and a MEASURE role)

On the other side, our current implementation of a dataset is simpler:

{
    type,
    resolve
}

(see https://github.com/InseeFr/VTL-Tools/blob/master/src/engine/visitors/Variable.js#L52)

In this case, we return a generic shape for a variable, only providing the type and the resolve function. It seems unpractical then to add members to this structure depending on the type of the parsed variable (eg adding a columns field for a dataset).

But if we keep this simple structure, how to pass columns types and roles at runtime ?

Here is a third way: using the "metadata" information dataforge style.

Indeed, the dataforge Dataframe can be provided with informations about the columns, for example:

const df = new dataForge.DataFrame({
  columnNames: ["Col1","Col2"],
  columns: {
    Col1: [{type: "NUMBER", role: "MEASURE"}]
  },
  rows: [
    [1, 'hello'],
    [5, 'computer'],
    [10, 'good day']
  ]
});

It looks like the cleaner solution. Further parsing the dataset object provided by clients will be necessary though in order to create this columns object.

Do you validate this idea ?

romaintailhurat commented 4 years ago

My understanding of columns was bad: it's in fact not metadata but data that is provided, the difference with rows being that you can provide the name of the column.

😞

We still need a solution for providing information about columns...

romaintailhurat commented 4 years ago

I think the best solution would be to keep the simple

type,
resolve

datastructure but to return a more complex thing with resolve(), ideally the same

dataStructure, 
dataPoints

With dataPoints referring to the dataforge.DataFrame.

romaintailhurat commented 4 years ago

A last one: how about extending the dataforge class in order to add a metadata member ?

hadrienk commented 4 years ago

Hey @romaintailhurat, good inputs thanks.

I am writing down a few banalities just to make sure we both understand the problem in the same way.

This internal structure is used inside the runtime to declare types. Each new variable gets a new function calling the resolve of other functions effectively making a tree call. This works fine for scalars.

{
   type: Parser.INTEGER, 
   resolve: function(bindings) { return otherVarRef.resolve(bindings); }
}

When it comes to dataset, we need the types / roles & names of the columns before the call to resolve is done. If we put the types / roles & names inside the return value of the resolve function, we will either need to call it several times during parsing (not really an issue but it could degrade prefs or lead to bugs) or save a reference to the result of resolve which could lead to the same problems I mentioned above.

This is an approach we used at SSB because the streaming nature of our implementation so I might suffer from tunnel vision. Could you elaborate on why you thought of having the dataStructure inside the return value of the resolve function and not as a field within the internal representation?

romaintailhurat commented 4 years ago

Thank you @hadrienk

After our last discussion, you suggest to continue with the hackathon implementation and add a columns member to the object returned when resolving the Variable.

See https://github.com/InseeFr/VTL-Tools/blob/master/src/engine/tests/interpretors/arithmetic.spec.js#L16

InseeFr / Trevas-TS

Columns information for a dataset implementation #53