JuliaData / DBFTables.jl

Read and write DBF (dBase) tabular data in Julia
Other
10 stars 11 forks source link

Return correct type for column if no missing value is detected #16

Closed juliohm closed 3 years ago

juliohm commented 3 years ago

Can we refactor this code so that the type is inferred correctly when no missing values are present?

https://github.com/JuliaData/DBFTables.jl/blob/98520b9a8819c99f065dad12f5c17b4d40054e4c/src/DBFTables.jl#L295

Returning always a Union{T,Missing} is too rigid.

visr commented 3 years ago

The Schema is currently calculated from the header only, so there we don't know yet if missings are present in the table. As far as I know the only way to find this out is by going over all data, which doesn't seem like a very attractive option. How does this issue affect you?

Using the test DBF, it indeed shows all types as possibly missing:

julia> dbf = DBFTables.Table("test.dbf")
DBFTables.Table with 7 rows and 6 columns
Tables.Schema:
 :CHAR     Union{Missing, String}
 :DATE     Union{Missing, String}
 :BOOL     Union{Missing, Bool}
 :FLOAT    Union{Missing, Float64}
 :NUMERIC  Union{Missing, Float64}
 :INTEGER  Union{Missing, Int64}

In this case all columns except DATE have missing values:

julia> DataFrame(dbf)
7×6 DataFrame
 Row │ CHAR     DATE      BOOL     FLOAT            NUMERIC          INTEGER
     │ String?  String    Bool?    Float64?         Float64?         Int64?
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Bob      19900102    false       10.21            11.21              100
   2 │ John     20010203     true      100.99            12.21              101
   3 │ Bill     20100304    false        0.0             13.21              102
   4 │ missing  19700101  missing        0.0              0.0                 0
   5 │ missing  19700101     true  missing                1.11111e9  2222222222
   6 │ missing  19700101     true        3.33333e9  missing          4444444444
   7 │ missing  19700101     true        5.55556e9        6.66667e9     missing

However when actually getting the data, the copied vector is correctly narrowed:

julia> dbf.DATE
7-element Vector{String}:
 "19900102"
 "20010203"
 "20100304"
 "19700101"
 "19700101"
 "19700101"
 "19700101"
juliohm commented 3 years ago

I got it. If the type is correctly inferred when the column is accessed directly, that shouldn't be an issue. Thanks for the quick reply.