JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
390 stars 141 forks source link

Opening Pytables/Pandas file #92

Open houshuang opened 10 years ago

houshuang commented 10 years ago

I am very interested in opening files created with Pandas/Pytables in Julia. I didn't see it mentioned anywhere that it was not supposed to work, so I tried. I can open the HDF5 file and see the contents, and read in a Pandas series, which works great.

However, when I try to read in the main table, I get the following:

julia> read(a["db"])
ERROR: no method hdf5_type_id(Type{FixedArray{Float64,(DimSize{17},)}})
 in read at /home/stian/.julia/v0.3/HDF5/src/plain.jl:1240
 in read at /home/stian/.julia/v0.3/HDF5/src/plain.jl:1060
 in read at /home/stian/.julia/v0.3/HDF5/src/plain.jl:1048
 in read at /home/stian/.julia/v0.3/HDF5/src/datafile.jl:45

The main table does indeed have 17 columns with floats. Ideally it would be possible to read these into a DataFrame... Am I doing something wrong? Is this supposed to be working, but there's a bug? Or is it not currenlty implemented (in which case I might play around with trying to get it to work)?

Thanks!

timholy commented 10 years ago

To my knowledge this hasn't been tried, but the goal is to get to the point where we can read any HDF5 file.

From where the error is occurring, this might be an easy fix or it could take some digging. This is a pretty long message (sorry), but most is background and the strategy (near the end) should be pretty simple.

Here's my suspicion of what's happening. It's reading an HDF5 Compound data type, perhaps corresponding to a row of the DataFrame. Compound types correspond to Julia immutables or C structs. In this case, one of the fields inside that compound data type is an array of 17 Float64s. In HDF5 parlance this is called an H5T_ARRAY type; these differ from more commonly-used arrays by having a fixed size (17 in this case).

Now some HDF5.jl background. Since Julia doesn't have a fixed-size array type, a FixedArray is just a "dummy type" internal to the HDF5 module that encapsulates the information about how the object should be represented. If you search for FixedArray in plain.jl, you'll find that when read they normally get loaded into a regular array. However, since in this case this is just one field of a H5T_COMPOUND type, that won't work; you'll need to read this in either as one field of an immutable or just as a set of bytes in a plain buffer.

HDF5.jl's support for H5T_COMPOUND objects is on the rudimentary side, but that may not be a bad thing here. What will happen is that your information will be returned as little more than an opaque buffer (an HDF5Compound object), but you could reinterpret is as an array of whatever immutable type you want, and from there convert to a DataFrame.

I'd guess that a great (and fairly easy) first step would be simply to define that missing version of hdf5_type_id. It's essentially the inverse of hdf5array, going from the Julia type to declaring an H5T_ARRAY with the proper information in it.

tbenst commented 4 years ago

Found this issue from google.

julia> h5 = h5open("example.h5","r")
julia> x = read(h5["/pandas/frame_df"])
Dict{String,Any} with 3 entries:
  "meta"     => Dict{String,Any}("values_block_2"=>Dict{String,Any}("meta"=>Dict{String,Any}("_i_table"=>Dict{String,Any}("values"=>Dict{String,Any}("mbounds"=>String[],"abounds"=>String[],"mr…
  "_i_table" => Dict{String,Any}("index"=>Dict{String,Any}("mbounds"=>[512, 1536, 2560, 3584, 4608, 5632, 6656, 7680, 8704, 9728  …  178688, 179712, 180736, 181760, 182784, 183808, 184832, 185…
  "table"    => HDF5.HDF5Compound{4}[HDF5Compound{4}((0, [0.0, 0.0], [0, 0], Int8[0]), ("index", "values_block_0", "values_block_1", "values_block_2"), (Int64, FixedArray{Float64,(2,)}, FixedA…
julia> x = read(h5["/pandas/frame_df/table"])
346507-element Array{HDF5.HDF5Compound{4},1}:
 HDF5.HDF5Compound{4}((0, [0.0, 0.0], [0, 0], Int8[0]), ("index", "values_block_0", "values_block_1", "values_block_2"), (Int64, HDF5.FixedArray{Float64,(2,)}, HDF5.FixedArray{Int64,(2,)}, HDF5.FixedArray{Int8,(1,)}))
[...]
julia> df = DataFrame(x);
julia> first(df,1)
1×3 DataFrame
│ Row │ data                         │ membername                                                      │ membertype                                                                       │
│     │ Tuple…                       │ NTuple{4,String}                                                │ NTuple{4,DataType}                                                               │
├─────┼──────────────────────────────┼─────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ 1   │ (0, [0.0, 0.0], [0, 0], [0]) │ ("index", "values_block_0", "values_block_1", "values_block_2") │ (Int64, FixedArray{Float64,(2,)}, FixedArray{Int64,(2,)}, FixedArray{Int8,(1,)}) │
julia> names(attrs(h5["/pandas/frame_df"]))
16-element Array{String,1}:
 "CLASS"         
 "TITLE"         
 "VERSION"       
 "data_columns"  
 "encoding"      
 "errors"        
 "index_cols"    
 "info"          
 "levels"        
 "metadata"      
 "nan_rep"       
 "non_index_axes"
 "pandas_type"   
 "pandas_version"
 "table_type"    
 "values_cols"

It seems the situation has improved! Can read everything at least. Should this now be a feature request on DataFrames.jl?

musm commented 3 years ago

Without the test files it's hard to know what is working or not. Are we missing something on the HDF5 here or can I close this?