Open sylvaticus opened 7 years ago
Note that the problem regards only the first column, i.e. this work:
a = Union{String,DataArrays.NAtype}["aa","bb","aa","bb",NA,NA]
b = Union{String,DataArrays.NAtype}["c","c","d","d","c","d"]
v = Float64[1.0,2.0,3.0,4.0,5.0,6.0]
t = IndexedTables.Table(Columns(a=b,b=a),v)
There are some odd behavior with NA value
using DataArrays, JuliaDB
a = @data([NA,1,2,3])
t1 = table(a, names=[:a])
t1[1] # MethodError
rows(t1) # Error displaying
But it works fine if column name is not defined
t2 = table(a)
t2[1]
rows(t2)
This is because of:
julia> eltype(DataArrays.DataArray{Int64, 1})
Int64
It should really be Union{Int64, DataArrays.NAType}
...
Some options here:
convert the columns into a DataValueArray see https://github.com/davidanthoff/DataValueArrays.jl#constructors (this type is now exposed by DataValues.jl -- using DataValues
should help)
contruct the table manually using columns.
I do wish this just worked.
So NA
will be deprecated from Julia, I guess I should learn to deal with new null
type.
Since missing
is used by DataFrame 0.11 and will be in a base. I guess It's better to use Missings
instead of DataValues
.
converting the columns into a Union{Missing, T}
seems to be working. I will try this approach.
Thanks for your help!
Yeah, that will work too. As long as the array type doesn't lie about its element type, we're in business. We will be switching JuliaDB over to missing
only in the 0.7-compatible release once 0.7 alpha is released. This is because performance would suffer on 0.6.
Just a minor comment here. So actually the issue with eltype
for DataArrays has been fixed in version 0.7.0 but unfortunately, it will probably be difficult to ever install that version because a lot of upper bounds have been added to other packages.
@YongHee-Kim You might be able to see why DataArrays
isn't updated by executing Pkg.update("DataArrays")
.
@andreasnoack Thank you for the tip! It seems ExcelReaders
is holding DataArrays
update.
I've created a issue on ExcelReaders
This should work smoothly by just loading IterableTables.jl and then doing something like this:
using DataFrames, IndexedTables, IterableTables
df = DataFrame(
param = ["price","price","price","price","waterContent","waterContent"],
item = ["banana","banana","apple","apple","banana", "apple"],
region = ["FR","UK","FR","UK","",""],
value = [3.2,2.9,1.2,0.8,0.2,0.8]
)
df[5,:region] = NA
df[6,:region] = NA
it = table(df)
IterableTables.jl will handle all the various different representations of missing data that float around. In general when things get converted via that route I just respect what a given container sees as its default representation for missing data, and then convert accordingly.
Only caveat is that I haven't merged the support for DataFrames.jl v0.11 yet, but that should happen relatively soon (and then it will work with both the old and new DataFrames at the same time).
Hello, I am trying to convert a DataFrame with NA values as IndexedTable, e.g.:
I did try to construct the Table from vectors where there are NA values, but the constructor fails:
However if I add NA values once the IndexedTable is created, it works great. So, in split of possible performance problems, I thought of first creating an empty IndexedTable, and then "append" to it, but it seems that when you construct an empty IndexedTable, this object misbehaves:
So, which is the preferred way to deal with IndexedTables when some dimensions may present NA values ?