Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
47 stars 16 forks source link

Can't read file from pandas library #53

Open chuongmep opened 5 months ago

chuongmep commented 5 months ago

When I'm try read from pandas python, it return nothing. Whether it relate to schema version of HDF5 ?

Thank you

import numpy as np
import pandas as pd
#%pip install tables -U
import warnings
import os
import time
from tables import NaturalNameWarning
warnings.filterwarnings('ignore', category=NaturalNameWarning)
filePath =r"file.h5"
store = pd.HDFStore(filePath)
store.open()
group  = store.groups()
group

This is testing in cs:

[Test]
    public void TestSaveHdf()
    {
        var file = new H5File()
        {
            ["my-group"] = new H5Group()
            {
                ["numerical-dataset"] = new double[] { 2.0, 3.1, 4.2 },
                ["string-dataset"] = new string[] { "One", "Two", "Three" },
                Attributes = new()
                {
                    ["numerical-attribute"] = new double[] { 2.0, 3.1, 4.2 },
                    ["string-attribute"] = new string[] { "One", "Two", "Three" }
                }
            }
        };
        file.Write("file.h5");
    }

    [Test]
    public void TestReadHdf()
    {
        // root group
        var file = H5File.OpenRead("file.h5");

// sub group
        var group = file.Group("my-group");

// dataset
        var dataset = group.Dataset("numerical-dataset");
        var datasetData = dataset.Read<double[]>();
        foreach (var item in datasetData)
        {
            Console.WriteLine(item);
        }
    }
Apollo3zehn commented 5 months ago

Thanks for your issue report. Maybe pandas does not support the newest hdf5 file layout. I will check today evening :-)

Apollo3zehn commented 5 months ago

Apart from that I think pandas uses a different HDF5 dataset layout (i.e. multidimensional dataset to represent a dataframe). Maybe you have more luck with H5py.

chuongmep commented 5 months ago

Thank you for your help, meaning the problem now is different version of hdf5 ? I'm just confused because I'm extract data from C# and read data from python

chuongmep commented 5 months ago

Just provide more information H5py it work well, just small problem with encoder string .

# read hdf5 file
filePath =r"file3.h5"
import h5py
f = h5py.File(filePath, 'r')
list(f.keys())
# get dataset inside group
dataset = f['Category']
# get member inside dataset
list(dataset.keys())
member = dataset['table']
# show dataframe 
import pandas as pd
import numpy as np
arr = np.array(member)
print(arr.dtype)
# cast to string
df = pd.DataFrame(arr)
df
Id Name Address
1 b'Hoang' b'Hanoi'
2 b'Chuong' b'Hanoi'
3 b'Huy' b'Hanoi'
4 b'Hieu' b'Hanoi'

Cs Files :

[Test]
    public void TestSaveDatatableToHdf()
    {
        string group = "Category";
        DataTable dataTable = new DataTable();
        dataTable.Columns.Add("Id", typeof(int));
        dataTable.Columns.Add("Name", typeof(string));
        dataTable.Columns.Add("Address", typeof(string));
        dataTable.Rows.Add(1, "Hoang", "Hanoi");
        dataTable.Rows.Add(2, "Chuong", "Hanoi");
        dataTable.Rows.Add(3, "Huy", "Hanoi");
        dataTable.Rows.Add(4, "Hieu", "Hanoi");

// Convert DataTable to array of TableRow
        TableRow[] array = dataTable.AsEnumerable()
            .Select(row => new TableRow
            {
                Id = row.Field<int>("Id"),
                Name = row.Field<string>("Name"),
                Address = row.Field<string>("Address")
            })
            .ToArray();

// Add to HDF using the compound data type
        var file = new H5File()
        {
            [group] = new H5Group()
            {
                ["table"] = array,
            }
        };

        file.Write("file3.h5");
    }
struct TableRow
{
    public int Id;
    public string Name;
    public string Address;
}
Apollo3zehn commented 5 months ago

This is the line where Pandas calls into pytables:

https://github.com/pandas-dev/pandas/blob/84aca21d06574b72c5c1da976dd76f7024336e20/pandas/io/pytables.py#L1501

I debugged until that line of code and could see that pytables found the group named my-group.

And here pytables filters out all groups that do not have the pandas_type attribute set: https://github.com/pandas-dev/pandas/blob/84aca21d06574b72c5c1da976dd76f7024336e20/pandas/io/pytables.py#L1505

When you create a very simple pandas HDF5 file like this

import numpy as np
import pandas as pd
hdf = pd.HDFStore('hdf_file.h5')
df = pd.DataFrame(np.random.rand(5,3))
hdf.put('test', df)

and then open that file in e.g. HDFView, you will see how Pandas stores the data internally (and also the attribute pandas_type mentioned above):

grafik

So I think you need to create a HDF5 file with a group/attribute/dataset structure that Pandas expects.

chuongmep commented 5 months ago

@Apollo3zehn , do you have any c# example can help with that ?

Apollo3zehn commented 5 months ago

@chuongmep I do not have any examples to create Pandas compatible HDF5 files via PureHDF. But as shown in my previous post you can easily create a test HDF5 file using Pandas and have a look into what Pandas expects to be present in that file. It does not look too difficult to mimic that format with PureHDF.

Alternatively you could read the data produced by PureHDF into Python via h5py and then convert that dataset to a Pandas dataframe (I think this is what you did here: https://github.com/Apollo3zehn/PureHDF/issues/53#issuecomment-1893850100).

So to summarize: No, unfortunately, there is no "Pandas compatible" mode yet, but I will add it to my todo list.

chuongmep commented 5 months ago

Thank you for your help, that will useful for me !