DataFrame support - Githubissues

gicmo commented 7 years ago

Table like data-structure with each individual column having its own name, data-type, unit.

gicmo commented 7 years ago

Proof of Concept based on variants can be found here.

jgrewe commented 7 years ago

I'm loving it! This opens a lot of possibilities (like a link to pandas in python) and solves some issues that are inconvenient at the moment. At the same time, implementing this new feature and it's integration into the current data model may require changes that would make it a version 2.

s0nskar commented 7 years ago

I asked about DataFrame project in the forum and got redirect to here. When I first read about the project, I thought it is going to be implemented using numpy C bindings. But anyways the implementation mentioned above by @gicmo seems better as it also deals with the h5 file type. (Correct me if I'm wrong)

It would be very helpful if someone can provide me with details of how it is going to be used, link to issues that are inconvenient at the moment as mentioned by @jgrewe and any other resources for better understanding of the project. Thanks

jgrewe commented 7 years ago

Hi @s0nskar, yes the project is in the first place a C++ project. @gicmo implemented a proof of concept how one could actually store data-frame-like data in hdf5, respectively the nix format. If you need more background then the pandas documentation (http://pandas.pydata.org/pandas-docs/stable) might be a good starting point. It is not the aim to re-implement pandas but rather to implement the table- or spread-sheet-like storing of arbitrary data in nix using the hdf5 backend. If you are already familiar with pandas, the hdf documentation about compound types is a valuable resource.

A good starting point would be to checkout the gist referenced in @gicmo's comment get it running and start thinking how you would like to work with table like datasets and how an API should look like.

s0nskar commented 7 years ago

hey @gicmo, still unable to run the script, can me help me with all those compiler flags you mentioned on IRC.

jgrewe commented 7 years ago

Hi @s0nskar, actually I do not understand the problem. From the make output you posted on pastebin, you can see that the playground has been built successfully.

[100%] Built target playground

so there should now be an executable named playground in your build folder.

try:

./playground

It may be confusing that the main.cpp is compiled into playground, but that's what's happening.

s0nskar commented 7 years ago

Finally it worked !! And there was a missing break statement in script, I fixed that in my fork

s0nskar commented 7 years ago

Hey, everyone! I've been reading docs for HDF5 and NIX for quite a few time. And familiarized myself with nix codebase and data models. I've few doubts regarding this project which I want to discuss.

what's gonna be the scope of PoC. I mean what is the expected result of this project? Project description says "develop a proof-of-concept for such DataFrames in NIX and its python bindings(so pandas DataFrames can be read and written to NIX files)". What I get from it that we have to implement DataFrame class just like other Data models providing an interface class, implContainer class, and other required files but only with basic API's like create, delete and append and to develop python bindings such that pandas DataFrame can be taken as input by nixpy and can be written to NIX file. Or is that mean that I have to build something like gicmo created like whole code in single file thing.
how we gonna implement DataFrame. I saw the solution offered by @gicmo that create compound datatype and then write it to a dataset. But while reading through HDF5 docs, i found HDF5 table api which seem perfect for this as It handles the mess of manipulating tables like data by proving basic apis like create, read, append, delete and it also keeps the code clean. So any suggestion on which to use or any other ways would be helpful. And should I start preparing a basic PoC using table API?
One more, please bear with me. What is specific need of DataFrame. Using a DataFrame is better than creating multiple DataArrays for storing data. I will be easy to manipulate and visualize the data by the user instead of dealing with multiple dataarray. Is there any other significant use that I am missing.

Thanks

jgrewe commented 7 years ago

Hi @s0nskar,

as you correctly stated the data frames (or tables) are a more convenient way of storing related values of different type in a compact way. You can also imagine it as an excel spreadsheet stored in a hdf5 file. In our scientific contexts we often deal with parameter sets that are hard (in the sense of inconvenient) to persist as individual dataArrays. Regarding the scope of the project: The primary aim should be to indeed implement the support for such tables in the nix API. As I see it, the tricky part is to create the compound data type, to read/write/interpret the data and offering it to the user in a convenient way. The basic implementation offered by @gicmo is part of the solution. tbh, I need to read the hdf5 table api, but from a quick scan I understand that one still has to define the compound type... A PoC based on the TableAPI would, of course be fine. So far, nix builds on HDF5 version 1.8.18.

Supporting reading in pandas DataFrames to nix and exporting nix-dfs to pandas is an add-on. Since Python is highly relevant for data analysis, anything that can be achieved on that end would be great.

jgrewe commented 7 years ago

@s0nskar, one thing that crossed my mind looking at the examples. We do not know what the table structure would be. Further the example codes do not use variable length strings.... The solution we are aiming for is a little more dynamic ;)

s0nskar commented 7 years ago

Thanks, @jgrewe for your insight. I reanalyzed both solutions, and now I understand why @gicmo implementation would be better in our context. At the moment I'm working what API's to be provided for DataFrame by looking at other data models implementation (and a little reference from pandas API too). Will update soon.

s0nskar commented 7 years ago

A brief description of the implementation of DataFrame. (Feedback Required) We've to create these files:

include/nix/base/IDataFrame.hpp -> Provides a Abstract/Base class IDataFrame for DataFrame that inherits from different classes too.
include/nix/DataFrame.hpp -> Implementation of DataFrame class, declaration of all the member functions.
src/DataFrame.cpp -> Which contains definations of various menber functions.
src/Block.cpp -> Add functions defination and declaration related to DataFrame in Block class too.
BaseTestDataFrame.cpp and BaseTestDataFrame.hpp for basic tests (is that required for PoC?)

It's still not clear to me that i should create all these files or develop it as a script (standalone) :confused: Anyway i'm going with the described one

Frontend api it adds. Block:

createDataFrame -> Create DataFrame from name, and header.
hasDataFrame -> Check a DataFrame exists in Block
deleteDataFrame -> Delete certain DataFrame from Block (Also adding declaring them in Block class defination)

DataFrame: (Functions name are self-explanatory)

writeRow
readRow
readCol
writeRows
readRows
readCols
dataType(s)
unit(s)
other required methods like output stream and other stuff like that.

s0nskar commented 7 years ago

^ ping @jgrewe @achilleas-k

jgrewe commented 7 years ago

^pong @s0nskar you are not forgotten, have some patience

Hell0Kitty commented 7 years ago

Hello guys,

I am also a GSoC candidate student. I have been following this issue, and I wanna discuss my own idea of the project with your guys. (ps: I am not a native English speaker, so please kindly ask me question if I make things unclear.) Hope to get feedback from you :)

Firstly, I wanna Thank s0nkar, since the question he asked is really helpful and enlightening, which makes me think deeper of the problem.

However, I do not agree with some of the implementations you mention above. I think using inheritance here is somehow unclear. Since in inheritance, the type of data members in the base type can't be changed. Then whatever types you choose for the base class data member, children class will inherit it. Of course you can create new data members in the children type, but then these inherited from base type will become unused and inefficient. There may be a way to arrange a inheritance structure nicely, but there is a more straight forward "C++" solution.

My Implementation Idea:

Using Template Class for Column Object : C++ has one overwhelming feature is that it supports templates. Then we don't need to bother with the inheritance structure. We can directly plug in the idea of template here: I will build a Column_Object template class here:

template \<typename T> class Column_Object{ std::string title; // name of column std::vector column;
unsigned row_num; // count number of existing rows; unsigned colomn_num; // position of this column; };

With this Column_Object, we can easily delete/add columns, add/delete elements of rows. (which supports adding/deleting of rows, explained later)

The Column_Objects are organized in a multi-map: map_of_row, with their titles as key value.
Support Simple Row Deletion and Addition The need of simple add/delete rows is very import in scientific concept ! When adding a row, we just traversal the map_of_row and inert element in one by one.

Hell0Kitty commented 7 years ago

I am very passionate about data structure and algorithm. The resources in previous discussion make me learned a lot about pandas and htf5 :) Thank you guys very much.

s0nskar commented 7 years ago

Hi @Hell0Kitty, I just took a quick overview of your implementation. And I think the idea of adding Column_Object is nice, But Isn't it's same as creating multiple one-dimensional arrays and storing them in a multimap (Correct me if I am wrong). If it's what I said, then it will create a problem in visualizing data with HDF Viewer, which I think is kind of important in our context.

UPD: Well this raises one more question for my implementation that how it is going to handle Add/Delete Column which is facilitated nicely by @Hell0Kitty implementation. What happening in the current implementation is we are creating a one-dimensional DataSet which stores compound DataType. And memory blocks assigned for each row is contagious. So supporting Addition and Deletion is not gonna be that easy. The only solution comes to my mind that on Add/Delete column operation, we can create a new a new DataFrame exactly same as before but with new memory block or deleted memory block. But it looks a lot expensive, and I'm keen to know a better solution for this.

Hell0Kitty commented 7 years ago

Sorry, can you specify the problem a little bit more :)

Hell0Kitty commented 7 years ago

@s0nskar I some how don't know why "multiple one-dimensional arrays and stored in a multi-map" can't work with HDF Viewer? Could you please explain it for me? I think I have some backup plans, but let's specify the problem first ! :)

s0nskar commented 7 years ago

Well if you download this you can see first one is how it is gonna open if we store data in Compound data type DataFrame and the other folder named column_.. is how data will store in your implementation. So now if there are like 100 columns than it is gonna be quite complex visualizing data, though it's completely my opinion. You should wait until a mentor comments.

Hell0Kitty commented 7 years ago

Thanks ~ I will try it

jgrewe commented 7 years ago

@s0nskar: you are on the right track. The proposed changes would be needed to directly integrate the DataFrame directly into NIX. For such an integration proper testing would be a requirement. In fact I suggest to write tests right from the start. It helps to formalise your expectations and reassures when you need to make changes. Regarding the API for the DataFrame you are kind of scratching the surface. How, for example, do you want the user to handle the table header, or interact with reading/writing of slices? For your proposal I suggest to elaborate on one of such problems and show how you would solve it with a real piece of code.

@Hell0Kitty: Interesting idea which might work out for an in-memory system but did you consider that the NIX library does actually not store information in memory? Almost the whole lib works file-attached. Further, the templating alone would not solve the problem. You would still need to map the types to a compound type. Actually, I am not sure, if I correctly understood how you want to persist your Objects in the HDF5 file. When this is done, as sOnskar suggests, as individual datasets then this is definitely not the way to go. I suggest you familiarise yourself with the code base a bit more and maybe play around with the POC linked at the top of this thread.

s0nskar commented 7 years ago

Thanks, @jgrewe. By handling the table header, I think you mean providing API for changing column name and units. For that we can provide an API like

bool editHeader(vector<vector<string>> edited_hdr)

where,
edited_hdr = {
    {"int64", "mV"},
    {"units", "V"}
} // Same as constructor header but lacking the DataType attribute.

We can add more attached API's, like for editing individual columns. But IMO the mentioned one will be good for starting.

For reading/writing of slices I'm already providing

vector<vector<Variant>> readRows(init_row_index, final_row_index)
vector<vector<Varient>> readCols(vector<string> &columns_list)

wirteRows can be a bit tricky because we might need to overwrtie some rows and insert some rows depending on the indices provided and I can provide combine rows and columns too(I am not sure what you meant in above statement)

vector<vector<Variant>> readRowsCols(init_row_index, final_row_index, vector<string> &columns_list)

I suggest to elaborate on one of such problems and show how you would solve it with a real piece of code.

And for proposal, i can add readRows/writeRows implementation extending the above PoC. Let me know if that's sufficient.

jgrewe commented 7 years ago

@s0nskar Do you actually sleep at some time of the day? ;) Yes it is a start but remember, if accepted, you would have three months of coding time... Try to imagine, what you, as a user, would like to be able to do. Try also to show us in the proposal (not in this thread) how you would solve it on the C++ level. The best way would be to have small pieces of real running code that sketch your approach and which you can submit with your proposal. Edit: It is not expected that you have everything solved at the time of the proposal but we need to know that you will be able to solve the task and have the skills that are needed.

s0nskar commented 7 years ago

@jgrewe Yes I do sleep, that's what college lectures are for. :stuck_out_tongue_winking_eye: Just kidding actually it's my exam time, so I am pulling all nighter for few days. And yes I will add few code snippets for few methods. I will also add you and other mentors to my proposal repo once the I complete the first draft for review (probably in 2-3 days. will add code snippets and other stuff after exams)

if accepted, you would have three months of coding time... Try to imagine, what you, as a user, would like to be able to do

Providing better and more API's, of course, will make more convenience to the user and i'd love to add more API's but isn't the purpose of the project is providing proof of concept. 'Cause I think the task of providing support for NIX data frame and providing its python wrapper is already huge task. We can add more API's in optional section. But it's just my own opinion, Any views?

gicmo commented 6 years ago

Done with PR #708

G-Node / nix

DataFrame support #656