BlueBrain / libsonata

A python and C++ interface to the SONATA format
https://libsonata.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
12 stars 12 forks source link

{Edge,Node}Population should have a .to_pandas method. #140

Open matz-e opened 3 years ago

matz-e commented 3 years ago

See title. For better usability, SONATA™ should provide functionality to provide a subset of the populations as Pandas dataframes for easier manipulation. Ideal usage from my side:

import libsonata as so
pop = so.EdgeStorage("foo.h").open_population("bar")
df = pop.to_pandas(so.Selection([(123, 666)])
stuff = df[(df.source_node_id > 313) & (df.axonal_delay < 3)]

(paraphrasing a bit)

mgeplf commented 3 years ago

This sort of functionality is part of SNAP. I'd prefer to avoid having pandas as a requirement of python libsonata, because it's a heavy dependency.

matz-e commented 3 years ago

Sure it is a heavy dependency, but we already depend on numpy, which itself is heavy:

Input spec
--------------------------------
 -   py-pandas

Concretized
--------------------------------
[+]  py-pandas@1.1.4%gcc@9.3.0 arch=linux-rhel7-x86_64
[^]      ^py-bottleneck@1.2.1%gcc@9.3.0 arch=linux-rhel7-x86_64
[^]          ^py-numpy@1.19.4%gcc@9.3.0+blas+lapack arch=linux-rhel7-x86_64
[^]              ^intel-mkl@2018.3.222%gcc@9.3.0~ilp64+shared threads=none arch=linux-rhel7-x86_64
[^]              ^py-cython@0.29.21%gcc@9.3.0 arch=linux-rhel7-x86_64
[^]                  ^py-setuptools@50.3.2%gcc@9.3.0 arch=linux-rhel7-x86_64
[^]                      ^python@3.8.3%gcc@9.3.0+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87 arch=linux-rhel7-x86_64
[^]      ^py-numexpr@2.7.0%gcc@9.3.0 arch=linux-rhel7-x86_64
[^]      ^py-python-dateutil@2.8.0%gcc@9.3.0 arch=linux-rhel7-x86_64
[^]          ^py-setuptools-scm@4.1.2%gcc@9.3.0~toml arch=linux-rhel7-x86_64
[^]          ^py-six@1.14.0%gcc@9.3.0 arch=linux-rhel7-x86_64
[^]      ^py-pytz@2020.1%gcc@9.3.0 arch=linux-rhel7-x86_64

not that much more in the dependency tree that isn't numpy

Seems like the counter-argument is to depend on something that is heavier, and pulls in a bunch of morphology dependencies. To me, it more seems like the API augmentations of snap should be migrated here…

mgeplf commented 3 years ago

numexpr/dateutil/pytz/etc are quite a bit more than just numpy (spack concretization is deceptive - pip install numpy only installs numpy; pandas installs more.

libsonata is supposed to be very low-level, very low dependy; the productivity stuff goes in SNAP.

matz-e commented 3 years ago

I disagree: compared to numpy, these additional dependencies don't seem all that heavy. Having to work with SNAP instead seems a little like saying we should use Qt for comfortable XML reading in C++.

mgeplf commented 3 years ago

Put another way, numpy is a required dependency in that it's the compact way to return numeric data in python. It would be hard/impossible to not use numpy, which is why it fits with the minimalist purpose of the library. The improvement you're describing is an ergonomic/convenience one, which should be handled by higher level libraries (ie: SNAP).

The idea is that this is safe to use by anything (ex: neurodamus-py), with the mimimal set of requirements.

What is your use case?

matz-e commented 3 years ago

My use case is to bulk load SONATA into Pandas to pass through to Spark. If I look into a file manually, I would also use this to compare between SONATA, Parquet, and binary data… so having some .to_df that returns something with columns ['source_node_id', 'target_node_id', 'delay', 'conductivity'…] would be very nice and still pretty basic.

mgeplf commented 3 years ago

Since you have to implement it for your use case, we should be able to take a look at it, and then make a decision.

alkino commented 2 years ago

For exemple, report_reader.hpp with DataFrame is ready to load inside pandas. Is it a solution for you? @matz-e

There is no dependency to pandas inside libsonata, but the output data is oriented pandas.

matz-e commented 2 years ago

Can I add columns to it?