datajoint / datajoint-python

Relational data pipelines for the science lab
https://datajoint.com/docs
GNU Lesser General Public License v2.1
169 stars 84 forks source link

Using `pandas` data structures in core implementation #70

Closed eywalker closed 9 years ago

eywalker commented 9 years ago

Some handling of data requires us to perform operations like project and join on the fetched data structure with a data structure passed in by the user. It appears like our current go-to data structure is the numpy record array, with no such methods available (at least by default). On Matlab side, join is also no available by default but provided by datajoint as dj.struct.join. We could do this for record array but pandas package already provides DataFrame object with all such methods implemented. I believe that pandas is a pretty standard data analysis package in Python numerical computing, so may be it wouldn't be a bad idea for us to use pandas and it's data structures (DataFrame and perhaps Series) directly in our implementation.

dimitri-yatsenko commented 9 years ago

please include examples of query results and how they can be used.

dimitri-yatsenko commented 9 years ago

I think datajoint should interface nicely with pandas but not depend on it excessively.

fabiansinz commented 9 years ago

Agreed, although I don't know what heavy dependence would mean. If we use pandas, datajoint will depend on it. However, if you mean we shouldn't do stuff with pandas, which mysql or datajoint can do just as well, I agree.

dimitri-yatsenko commented 9 years ago

agreed. DataJoint's internal workings will not depend on pandas but the results of queries should be easily converted to pandas for further transformations. Insert commands may accept panda data structures as well.

eywalker commented 9 years ago

It would make sense to implement pandas related features defensively. This would mean

  1. Check if pandas package actually exists or not (probably within __init__.py) and setting datajoint package wide flag
  2. In methods like insert, check if the passed in data structure is pandas data (e.g. DataFrame or Series) and handle accordingly. Unfortunately pandas DataFrame iteration doesn't work the same was as numpy record array, so we would have to make explicit checks.
dimitri-yatsenko commented 9 years ago

It does not appear that pandas is strictly necessary and the interface is straightforward. I think we should just document how to convert fetched numpy arrays into pandas dataframes.

dimitri-yatsenko commented 9 years ago

I closed this because I think these approaches need to be addressed individually. For example, if we need to insert a pandas dataframe, a new issue should be submitted with the specific problem. We will not provide broad integration with pandas as suggested by the title of this issue.

fabiansinz commented 9 years ago

Inserting a pandas dataframe is easy: rel.insert(df.iterrows()). I think we are good with pandas. It is easy to insert and fetch data from and into dataframes. Then it makes more sense to not depend on an additional library.