Closed juliohm closed 3 years ago
The Schema
is currently calculated from the header only, so there we don't know yet if missings are present in the table. As far as I know the only way to find this out is by going over all data, which doesn't seem like a very attractive option. How does this issue affect you?
Using the test DBF, it indeed shows all types as possibly missing:
julia> dbf = DBFTables.Table("test.dbf")
DBFTables.Table with 7 rows and 6 columns
Tables.Schema:
:CHAR Union{Missing, String}
:DATE Union{Missing, String}
:BOOL Union{Missing, Bool}
:FLOAT Union{Missing, Float64}
:NUMERIC Union{Missing, Float64}
:INTEGER Union{Missing, Int64}
In this case all columns except DATE
have missing values:
julia> DataFrame(dbf)
7×6 DataFrame
Row │ CHAR DATE BOOL FLOAT NUMERIC INTEGER
│ String? String Bool? Float64? Float64? Int64?
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Bob 19900102 false 10.21 11.21 100
2 │ John 20010203 true 100.99 12.21 101
3 │ Bill 20100304 false 0.0 13.21 102
4 │ missing 19700101 missing 0.0 0.0 0
5 │ missing 19700101 true missing 1.11111e9 2222222222
6 │ missing 19700101 true 3.33333e9 missing 4444444444
7 │ missing 19700101 true 5.55556e9 6.66667e9 missing
However when actually getting the data, the copied vector is correctly narrowed:
julia> dbf.DATE
7-element Vector{String}:
"19900102"
"20010203"
"20100304"
"19700101"
"19700101"
"19700101"
"19700101"
I got it. If the type is correctly inferred when the column is accessed directly, that shouldn't be an issue. Thanks for the quick reply.
Can we refactor this code so that the type is inferred correctly when no missing values are present?
https://github.com/JuliaData/DBFTables.jl/blob/98520b9a8819c99f065dad12f5c17b4d40054e4c/src/DBFTables.jl#L295
Returning always a
Union{T,Missing}
is too rigid.