frictionlessdata / tableschema-go

A Go library for working with Table Schema.
MIT License
46 stars 10 forks source link

Infer a Table Schema descriptor from a supplied sample of data #18

Closed danielfireman closed 7 years ago

danielfireman commented 7 years ago

Python code: https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/infer.py

danielfireman commented 7 years ago

@pwalsh

I didn't want to only port the code, so I checked the code (JS and Python), issues (i.e. tableschema-py/issues/105 and gitter discussions around infer. After that, I am not entirely sure what is the actual scope of infer in within Go library V1.0. In other words, is super critical to have the exact same behavior -- and maybe the exact same problems -- or is it ok to have a simpler but more consistent behavior (by consistent I mean consistent with other cases in the target language)?

That said, I would like to base the first Go implementation of infer on json.Unmarshal, which is well known amongst Go developers. More specifically:

To unmarshal JSON into an interface value, Unmarshal stores one of
these in the interface value:

bool, for JSON booleans
float64, for JSON numbers
string, for JSON strings
[]interface{}, for JSON arrays
map[string]interface{}, for JSON objects
nil for JSON null

That would give users a pretty nice bootstrap Schema. This can also be implemented in a quite efficient way and reuse a pretty solid code. From that, users could refine the types very easily.

What do you think?

pwalsh commented 7 years ago

hey @danielfireman ! @roll will be best placed to help you with this.

roll commented 7 years ago

@danielfireman Sorry for the late answer. TBH can't be very helpful on internals. I think infer still could work consistent between different implementations because it's kinda pure mapping from text file (cvs, or values from this csv) to text file (json, table schema).

danielfireman commented 7 years ago

Sorry @roll .. but could please clarify?

Is that ok if I follow the widely known rules of json.Unmarshall for infer?

boolean, for JSON booleans
float64, for JSON numbers
string, for JSON strings
array, for JSON arrays
object, for JSON objects
nil for JSON null

Asking because in the gitter channel you asked for contributors to think about it. I am suggesting we simplify the rules and, at least in Go, we have a solid implementation for this slightly simpler version.

I am also curious about the usage/use cases of infer. That would help to think/suggest improvements.

danielfireman commented 7 years ago

Another project that was an inspiration for my suggestion. Very powerful features:

pwalsh commented 7 years ago

@danielfireman

A use case is where a user has a plain CSV file, and we use infer to generate a Table Schema for it, without requiring the user to know anything about Table Schema. An example of where we do this is in the user interface of OpenSpending (which does a bunch of other stuff too, beyond basic schema inference).

About the rules of json.Unmarshall: It is a hard question for us to give you a definitive answer on from outside. There are several steps in the infer process, all relevant even if we put aside bugs that result from the very naive "algorithm" in place at present:

  1. Get a sample of data (per column)
  2. Attempt to cast the samples to a number of possible types + formats
  3. Choose the best matching type + format, based on some type of specificity
  4. Return the inferred type + format to the user.

For a subset of types in Table Schema, I assume json.Unmarshall will work. But it will not help you, for, example, geojson. Also, it will not help you with format of the type. You can ignore that (format) for sure at this stage, as I do not think format guessing is actually implemented anywhere.

My advice to you is to not overthink the infer method too much - get a version that works in the spirit of the current implementations, and then explicitly outline the pros and cons of the approach in an issue for future consideration.

danielfireman commented 7 years ago

Thanks for the answer, @pwalsh !

danielfireman commented 7 years ago

I am going to aim with the following conversion rules for the first version (please notice that bellow we have with frictionless data types -> json type (no Go) :

boolean, for JSON booleans
float64, for JSON numbers
string, for JSON strings
array, for JSON arrays
object, for JSON objects
nil for JSON null
danielfireman commented 7 years ago

Just had a chat with @pwalsh and we agreed on implementing:

We've got an open question at the end of the chat: whether the Infer function should return invalid schemas.

Looking at the Python that implementation, the Infer function chooses the type with most occurrences. That might return an invalid schema, i.e. if 95% of the cells in a certain column are numbers and 5% are strings. To check that out, the user would need to call the Validate function on the returned schema.

I am going to follow the Python implementation for now.

roll commented 7 years ago

Looking at the Python that implementation, the Infer function chooses the type with most occurrences. That might return an invalid schema, i.e. if 95% of the cells in a certain column are numbers and 5% are strings. To check that out, the user would need to call the Validate function on the returned schema.

tableschema.validate validates only Table Schema descriptor against Table Schema profile (jsonschema). That's across all our implementations for now. So please take it into account - because implementing it is much easier :smile:

So talking about schema validity we talks only about schema check. Not data. So output of infer can't be invalid. It could be just non-optimal for given data sample. But it's pretty OK. I would say improving infer is an incremental process.

Consider real-life scenario of working with dataset - https://github.com/frictionlessdata/tableschema-js#table - user creates table model for existent data, infers schema, updates it and saves it. So infer here it's more initial step for getting a perfect schema but of course it still should generate valid Table Schema descriptor. And then user could tweak it.

Sorry for the initial short comment - I had internet problems. So expanding it - it's OK for initial implementations don't be too precise on inferring but any implementations should be able to pass test like this:

id,name,joining
1,Alex,2015-05-06
2,John,2014-03-07

infer

"fields": [
  {"name": "id", "type": "integer", "format": "default"},
  {"name": "name", "type": "string", "format": "default"},
  {"name": "joining", "type": "date", "format": "default"},
]

So what I've meant by infer is kinda pure mapping from text file to text file is that we have text input and should produce text output. So it should be easier for static typed languages then having a deal with native types. At least I hope so)

So related to the Gitter discussion with @georgeslabreche I've meant that current infer implementation is really simple - https://github.com/frictionlessdata/tableschema-js/blob/master/src/schema.js#L170-L197. 20 lines of code high-level algorithm. But it's not optimal because it does to much type castings and don't do row-by-row approach.

Internally it works pretty simple:

What I've suggested that high-level algorithm (2) could be improved (link to the code above). It's not critical but if you write infer from scratch why not because it should be very easy.

danielfireman commented 7 years ago

Thanks for the awesome explanation, @roll !

So talking about schema validity we talks only about schema check. Not data. So output of infer can't be invalid. It could be just non-optimal for given data sample. But it's pretty OK. I would say improving infer is an incremental process.

Got it. I meant the use of CastRow on that data would return errors. But this the user perspective I was trying to get.

Just for the records: another way to deal with this problem is to use the wider type. For instance, if we have 3 integers and one float in a column, we could use number as the field type (instead of integer) and somehow warn the user. In that way, the user could cast rows without errors, but he/she could refine the schema manually later.

Consider real-life scenario of working with dataset - https://github.com/frictionlessdata/tableschema-js#table - user creates table model for existent data, infers schema, updates it and saves it. So infer here it's more initial step for getting a perfect schema but of course it still should generate valid Table Schema descriptor. And then user could tweak it.

Thanks for pointing me to the example. Great stuff! inferbeing an initial step totally match my expectations.

If we change the example adding the column I mentioned above, traversing the table using the keyed=true (or CastRow) before changing the schema it would error out (i.e. if there are 3 integers and 1 float in a column). But as far as I understood, this is the time where you either refine the schema or change the data.

Finally, thanks a lot for the explanations about infer!

roll commented 7 years ago

Just for the records: another way to deal with this problem is to use the wider type. For instance, if we have 3 integers and one float in a column, we could use float as the field type and somehow warn the user. In that way, the user could cast rows without errors, but he/she could refine the schema manually later.

So that's exact case (high-level algorithm) when I've suggested implementers to go creative) And later we could re-use the best approach in existent implementations.

danielfireman commented 7 years ago

Perfect. I am going to give this a shot.

danielfireman commented 7 years ago

The proposed method seems to be better in performance. Benchmark results:

BenchmarkInferSmall-4 20000 113778 ns/op BenchmarkInferMedium-4 100 10257832 ns/op BenchmarkInferBig-4 20 86271503 ns/op BenchmarkInferImplicitCastingSmall-4 30000 42894 ns/op BenchmarkInferImplicitCastingMedium-4 500 2939785 ns/op BenchmarkInferImplicitCastingBig-4 50 32592837 ns/op PASS ok github.com/frictionlessdata/tableschema-go/schema 12.000s

danielfireman commented 7 years ago

Need to improve documentation. Leaving this issue open.