JuliaData / DBFTables.jl

Read and write DBF (dBase) tabular data in Julia
Other
10 stars 11 forks source link

Handle Integers correctly #20

Open rafaqz opened 1 year ago

rafaqz commented 1 year ago

From the dbase spec

I | Long | 4 bytes.        Leftmost bit used to indicate sign, 0 negative.

A zero sign bit means negative numbers. E.g. not a regular julia/C Int32 at all:

julia> bitstring(Int32(-1))
"11111111111111111111111111111111"
visr commented 1 year ago

How odd. So the sign is reversed, and also it is 32 bits instead of 64:

https://github.com/JuliaData/DBFTables.jl/blob/6b4ef1ab5843225a0e0fae04abbc3bbb44fcac44/src/DBFTables.jl#L57-L58

Are integers in practice mostly encoded as type N without decimals, and that this therefore hasn't really been an issue before? Regardless would be good to fix.

rafaqz commented 1 year ago

Yeah this is a pretty weird format.

I think yes, mostly .dbf uses string numbers like 'N' for everything so we just haven't noticed yet. We'll need to find some test data that has the column types missing from the current tests.

rafaqz commented 1 year ago

According to this I is only used by visual fox pro anyway:

http://www.independent-software.com/dbase-dbf-dbt-file-format.html

~~And Microsoft ODBC doesn't use them at all https://learn.microsoft.com/en-us/sql/odbc/microsoft/dbase-data-types?view=sql-server-ver16 Maybe we can just not handle I at all.~~

Hmm maybe its not so clear what ODBC uses. The dbase 7 spec seems to be what this package was built from? but probably most files are III or V ?

visr commented 1 year ago

dc0cafb5e712807a7460847bdbc5ddb5e423fa8c mentions dBase III+ / xBase. Most of that code is still the same as far as dBase support. Later I used the references under https://github.com/JuliaData/DBFTables.jl#format-description-resources, of which the .dk site mentions:

Note that this structure is valid for Xbase - and dBASE v. III - 5. Later versions of dBASE has a different layout, like dBASE 7

So I wouldn't say this package is based on v7, but older versions, that seem to be more commonly used with shapefiles.

rafaqz commented 1 year ago

Ok so its dbase III with some types mixed in from later versions and Fox Pro.

This python package has another breakdown of the versions: https://github.com/ethanfurman/dbf/tree/master/dbf

Maybe for correctness and simplicity we should only support dBase III ?

I still don't understand the I it seems to be meant to be a Sign–magnitude zero negative long int but most packages in other languages seem to be just reading it in as a regular twos-complement 32 it integer. If everyone does it wrong then we're fine, right??

visr commented 1 year ago

Ok so its dbase III with some types mixed in from later versions and Fox Pro.

That is basically what people call xBase it seems. This is what Wikipedia says:

xBase is a name applied to clones of the dBase, typically dBASE III+–V. Most xBase programs either use the format directly or uses a derived format with custom extensions.

So far my approach wasn't to implement a spec, but to add what is needed based on real world data.

rafaqz commented 1 year ago

Yes that blog post linked above for the .Net version seems to say the same thing, the spec for early versions is unclear. It's hilarious how widely this is used in GIS given there is no concrete spec