Closed russmack closed 5 years ago
That happened even when you specified the limit to -1? If yes, this a bug in our implementation. It should be number if it scans the whole file.
The type priority order was that to be in sync minimally in par with other implementations (e.g. tableschema-py). Due to that, I wouldn't simply change this priority order, but think of something custom based on the inferConfig you introduced. Let's only go in that direction if necessary.
Is that still a problem after your pr?
I've had a chance to look properly.
When a field has sloppy data eg a decimal field containing 0
instead of 0.0
the highest count wins.
So, eg, using a file such as:
ADecimal 1.2 0 0 0 5.2
Infers the type as boolean. Options:
typePrecedence = []string{DurationType, DateTimeType, DateType, TimeType, YearMonthType, YearType, NumberType, IntegerType, BooleanType, GeoPointType, ArrayType, ObjectType}
The latter would need good tests.
InferImplicitCasting infers the mixed bool/number field as number, which is good, but if there are also missing values then it infers string.
Does that mean a column in which all rolls are filled with null values? If yes, I believe the best way is to remove this column before inference (because there is nothing to infer, actually). Or you could use the option described in the next paragraph.
Another option is you to specify the schema.missingValues after the schema is inferred (and the schema should behave properly). Please take a look at Schema inference and configuration section of the Readme.md.
Finally, we could add a new option to infer which behaves like schema missing value (please search for Missing Values here for a definition and click here for its implementation in schema.go). That will need a lot more thought, though.
Hi Daniel, in my working fork I've added some demo test examples, as well as an InferWithPrecedence function to illustrate. I've been sidetracked so I haven't run it over much data yet. Edit: forgot to post the link: https://github.com/frictionlessdata/tableschema-go/compare/master...russmack:infer-with-precedence
Hi Russ!
Cool stuff! I liked the precedence idea. I took a look at the code and I believe it could be very much simplified using the InferOpt, right? For instance, we don't actually need the InferWithPrecedence method. What do you think?
Hi @russmack
Thanks for sending https://github.com/frictionlessdata/tableschema-go/pull/74 ! I took a look and thought of another proposal. What do you think of https://github.com/frictionlessdata/tableschema-go/pull/75 ?
There is an example/test that exercises exactly the case of 0 being treated as a number. I used the same slice of types you mentioned in https://github.com/frictionlessdata/tableschema-go/issues/71#issue-417074167
Hi @danielfireman , that is more flexible, bound to be more useful.
Great! Rolled out https://github.com/frictionlessdata/tableschema-go/releases/tag/v1.3 with this new feature.
I have .csv file with ~15000 rows. One of the fields is decimal. The first ~10000 values of said field are either 0 or 1; the remaining are decimal eg 0.67.
Infer suggests the field is boolean rather than number.
The issue can be fixed by changing schema/infer.go#41 from:
orderedTypes = []string{BooleanType, YearType, IntegerType, GeoPointType, NumberType, YearMonthType, DateType, DateTimeType, TimeType, DurationType, ArrayType, ObjectType}
to:
orderedTypes = []string{NumberType, BooleanType, YearType, IntegerType, GeoPointType, YearMonthType, DateType, DateTimeType, TimeType, DurationType, ArrayType, ObjectType}
I haven't submitted a PR for this, because the
Types ordered from narrower to wider
suggests I need to think about this more, or get your input.