frictionlessdata / tableschema-go

A Go library for working with Table Schema.
MIT License
46 stars 10 forks source link

Number being inferred as boolean #71

Closed russmack closed 5 years ago

russmack commented 5 years ago

I have .csv file with ~15000 rows. One of the fields is decimal. The first ~10000 values of said field are either 0 or 1; the remaining are decimal eg 0.67.

Infer suggests the field is boolean rather than number.

The issue can be fixed by changing schema/infer.go#41 from: orderedTypes = []string{BooleanType, YearType, IntegerType, GeoPointType, NumberType, YearMonthType, DateType, DateTimeType, TimeType, DurationType, ArrayType, ObjectType}

to: orderedTypes = []string{NumberType, BooleanType, YearType, IntegerType, GeoPointType, YearMonthType, DateType, DateTimeType, TimeType, DurationType, ArrayType, ObjectType}

I haven't submitted a PR for this, because the Types ordered from narrower to wider suggests I need to think about this more, or get your input.

danielfireman commented 5 years ago

That happened even when you specified the limit to -1? If yes, this a bug in our implementation. It should be number if it scans the whole file.

The type priority order was that to be in sync minimally in par with other implementations (e.g. tableschema-py). Due to that, I wouldn't simply change this priority order, but think of something custom based on the inferConfig you introduced. Let's only go in that direction if necessary.

Is that still a problem after your pr?

russmack commented 5 years ago

I've had a chance to look properly. When a field has sloppy data eg a decimal field containing 0 instead of 0.0 the highest count wins. So, eg, using a file such as:

ADecimal 1.2 0 0 0 5.2

Infers the type as boolean. Options:

The latter would need good tests.

russmack commented 5 years ago

InferImplicitCasting infers the mixed bool/number field as number, which is good, but if there are also missing values then it infers string.

danielfireman commented 5 years ago

Does that mean a column in which all rolls are filled with null values? If yes, I believe the best way is to remove this column before inference (because there is nothing to infer, actually). Or you could use the option described in the next paragraph.

Another option is you to specify the schema.missingValues after the schema is inferred (and the schema should behave properly). Please take a look at Schema inference and configuration section of the Readme.md.

Finally, we could add a new option to infer which behaves like schema missing value (please search for Missing Values here for a definition and click here for its implementation in schema.go). That will need a lot more thought, though.

russmack commented 5 years ago

Hi Daniel, in my working fork I've added some demo test examples, as well as an InferWithPrecedence function to illustrate. I've been sidetracked so I haven't run it over much data yet. Edit: forgot to post the link: https://github.com/frictionlessdata/tableschema-go/compare/master...russmack:infer-with-precedence

danielfireman commented 5 years ago

Hi Russ!

Cool stuff! I liked the precedence idea. I took a look at the code and I believe it could be very much simplified using the InferOpt, right? For instance, we don't actually need the InferWithPrecedence method. What do you think?

danielfireman commented 5 years ago

Hi @russmack

Thanks for sending https://github.com/frictionlessdata/tableschema-go/pull/74 ! I took a look and thought of another proposal. What do you think of https://github.com/frictionlessdata/tableschema-go/pull/75 ?

There is an example/test that exercises exactly the case of 0 being treated as a number. I used the same slice of types you mentioned in https://github.com/frictionlessdata/tableschema-go/issues/71#issue-417074167

russmack commented 5 years ago

Hi @danielfireman , that is more flexible, bound to be more useful.

danielfireman commented 5 years ago

Great! Rolled out https://github.com/frictionlessdata/tableschema-go/releases/tag/v1.3 with this new feature.