Closed eywalker closed 9 years ago
please include examples of query results and how they can be used.
I think datajoint should interface nicely with pandas but not depend on it excessively.
Agreed, although I don't know what heavy dependence would mean. If we use pandas, datajoint will depend on it. However, if you mean we shouldn't do stuff with pandas, which mysql or datajoint can do just as well, I agree.
agreed. DataJoint's internal workings will not depend on pandas but the results of queries should be easily converted to pandas for further transformations. Insert commands may accept panda data structures as well.
It would make sense to implement pandas
related features defensively. This would mean
pandas
package actually exists or not (probably within __init__.py
) and setting datajoint
package wide flaginsert
, check if the passed in data structure is pandas
data (e.g. DataFrame
or Series
) and handle accordingly. Unfortunately pandas
DataFrame
iteration doesn't work the same was as numpy
record array, so we would have to make explicit checks.It does not appear that pandas is strictly necessary and the interface is straightforward. I think we should just document how to convert fetched numpy arrays into pandas dataframes.
I closed this because I think these approaches need to be addressed individually. For example, if we need to insert a pandas dataframe, a new issue should be submitted with the specific problem. We will not provide broad integration with pandas as suggested by the title of this issue.
Inserting a pandas dataframe is easy: rel.insert(df.iterrows())
. I think we are good with pandas. It is easy to insert and fetch data from and into dataframes. Then it makes more sense to not depend on an additional library.
Some handling of data requires us to perform operations like
project
andjoin
on the fetched data structure with a data structure passed in by the user. It appears like our current go-to data structure is thenumpy record array
, with no such methods available (at least by default). On Matlab side,join
is also no available by default but provided bydatajoint
asdj.struct.join
. We could do this forrecord array
butpandas
package already providesDataFrame
object with all such methods implemented. I believe thatpandas
is a pretty standard data analysis package in Python numerical computing, so may be it wouldn't be a bad idea for us to usepandas
and it's data structures (DataFrame
and perhapsSeries
) directly in our implementation.