Open matz-e opened 3 years ago
This sort of functionality is part of SNAP. I'd prefer to avoid having pandas as a requirement of python libsonata
, because it's a heavy dependency.
Sure it is a heavy dependency, but we already depend on numpy
, which itself is heavy:
Input spec
--------------------------------
- py-pandas
Concretized
--------------------------------
[+] py-pandas@1.1.4%gcc@9.3.0 arch=linux-rhel7-x86_64
[^] ^py-bottleneck@1.2.1%gcc@9.3.0 arch=linux-rhel7-x86_64
[^] ^py-numpy@1.19.4%gcc@9.3.0+blas+lapack arch=linux-rhel7-x86_64
[^] ^intel-mkl@2018.3.222%gcc@9.3.0~ilp64+shared threads=none arch=linux-rhel7-x86_64
[^] ^py-cython@0.29.21%gcc@9.3.0 arch=linux-rhel7-x86_64
[^] ^py-setuptools@50.3.2%gcc@9.3.0 arch=linux-rhel7-x86_64
[^] ^python@3.8.3%gcc@9.3.0+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87 arch=linux-rhel7-x86_64
[^] ^py-numexpr@2.7.0%gcc@9.3.0 arch=linux-rhel7-x86_64
[^] ^py-python-dateutil@2.8.0%gcc@9.3.0 arch=linux-rhel7-x86_64
[^] ^py-setuptools-scm@4.1.2%gcc@9.3.0~toml arch=linux-rhel7-x86_64
[^] ^py-six@1.14.0%gcc@9.3.0 arch=linux-rhel7-x86_64
[^] ^py-pytz@2020.1%gcc@9.3.0 arch=linux-rhel7-x86_64
not that much more in the dependency tree that isn't numpy
…
Seems like the counter-argument is to depend on something that is heavier, and pulls in a bunch of morphology dependencies. To me, it more seems like the API augmentations of snap
should be migrated here…
numexpr/dateutil/pytz/etc are quite a bit more than just numpy
(spack concretization is deceptive - pip install numpy
only installs numpy; pandas installs more.
libsonata
is supposed to be very low-level, very low dependy; the productivity stuff goes in SNAP.
I disagree: compared to numpy
, these additional dependencies don't seem all that heavy. Having to work with SNAP instead seems a little like saying we should use Qt for comfortable XML reading in C++.
Put another way, numpy is a required dependency in that it's the compact way to return numeric data in python. It would be hard/impossible to not use numpy, which is why it fits with the minimalist purpose of the library. The improvement you're describing is an ergonomic/convenience one, which should be handled by higher level libraries (ie: SNAP).
The idea is that this is safe to use by anything (ex: neurodamus-py), with the mimimal set of requirements.
What is your use case?
My use case is to bulk load SONATA into Pandas to pass through to Spark. If I look into a file manually, I would also use this to compare between SONATA, Parquet, and binary data… so having some .to_df
that returns something with columns ['source_node_id', 'target_node_id', 'delay', 'conductivity'…]
would be very nice and still pretty basic.
Since you have to implement it for your use case, we should be able to take a look at it, and then make a decision.
For exemple, report_reader.hpp with DataFrame
is ready to load inside pandas. Is it a solution for you? @matz-e
There is no dependency to pandas inside libsonata, but the output data is oriented pandas.
Can I add columns to it?
See title. For better usability, SONATA™ should provide functionality to provide a subset of the populations as Pandas dataframes for easier manipulation. Ideal usage from my side:
(paraphrasing a bit)