Closed bnaul closed 7 years ago
As far as I know, joblib uses pickles to serialize. This is not transportable. We could use npy or npz, with a thin wrapper for handling unpacking back into a pandas array.
Would it make sense to provide a utility function to extract all channels of a certain type from a DataFrame?
Yes, joblib does use pickle under the hood.
On Wednesday, March 22, 2017, Stefan van der Walt notifications@github.com wrote:
Would it make sense to provide a utility function to extract all channels of a certain type from a DataFrame?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cesium-ml/cesium/pull/247#issuecomment-288448220, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5TTW5BkmwSJNZ9lARF6_JxVZhPs6akks5roUYigaJpZM4MkvfB .
What would the wrapper entail? .npy works for the actual numeric data but then do we need a separate file that stores the metadata (row/column names)?
@stefanv re: extracting different channels: yeah, this is hard if we use tuples as column names, i.e.
(mean, 0) (mean, 1) meta1 meta2
file0 2.0 0.180734 0.548427 0.187956
file1 2.5 0.196072 1.171789 0.174803
file2 3.0 0.558093 0.265003 0.109842
file3 2.6 0.280734 0.448427 0.287956
file4 2.5 0.496072 1.371789 0.274803
file5 3.0 0.258093 0.465003 0.409842
If we use a multi-index instead this is easy:
feature mean meta1 meta2
channel 0 1
file0 2.0 0.180734 0.548427 0.187956
file1 2.5 0.196072 1.171789 0.174803
file2 3.0 0.558093 0.265003 0.109842
file3 2.6 0.280734 0.448427 0.287956
file4 2.5 0.496072 1.371789 0.274803
file5 3.0 0.258093 0.465003 0.409842
but then the columns that don't have a channel
associated with them are harder to access than they should be. I'm on the fence about this, what do y'all think?
npz can store multiple arrays in one file (i.e., list of column names + data)
It may be preferable to have a consistent format for accessing columns with/without channel data, i.e. pretend that single channel is multi-channel with index 0
Hm npz would be nice, maybe I'll write a little wrapper around that.
I can see both (multiindex or tuples) being annoying, perhaps featurize
can have a flag that chooses which format is used? It's also not hard for the user to make the conversion manually to whatever they prefer but supporting both formats explicitly might be nicer.
Probably best for us to choose one way to do it (even if sub-optimal in all use cases), otherwise confusion is bound to arise.
Hello @bnaul! Thanks for updating the PR.
cesium/data_management.py
, following are the PEP8 issues :Line 92:31: E127 continuation line over-indented for visual indent
cesium/featurize.py
, following are the PEP8 issues :Line 63:81: E501 line too long (82 > 80 characters) Line 105:81: E501 line too long (81 > 80 characters)
cesium/tests/fixtures.py
, following are the PEP8 issues :Line 44:81: E501 line too long (92 > 80 characters)
cesium/tests/test_data_management.py
, following are the PEP8 issues :Line 21:5: E731 do not assign a lambda expression, use a def Line 59:9: E128 continuation line under-indented for visual indent Line 65:13: E128 continuation line under-indented for visual indent Line 68:9: E128 continuation line under-indented for visual indent
cesium/tests/test_featurize.py
, following are the PEP8 issues :Line 173:1: E303 too many blank lines (3)
cesium/time_series.py
, following are the PEP8 issues :Line 236:81: E501 line too long (82 > 80 characters)
examples/plot_EEG_Example.py
, following are the PEP8 issues :Line 34:46: E261 at least two spaces before inline comment Line 34:47: E262 inline comment should start with '# ' Line 34:81: E501 line too long (87 > 80 characters) Line 82:81: E501 line too long (86 > 80 characters) Line 149:69: E226 missing whitespace around arithmetic operator Line 164:81: E501 line too long (87 > 80 characters) Line 201:11: E126 continuation line over-indented for hanging indent Line 204:11: E126 continuation line over-indented for hanging indent Line 207:11: E126 continuation line over-indented for hanging indent
Merging #247 into master will decrease coverage by
0.93%
. The diff coverage is93.53%
.
@@ Coverage Diff @@
## master #247 +/- ##
==========================================
- Coverage 92.3% 91.37% -0.94%
==========================================
Files 45 37 -8
Lines 2197 1831 -366
Branches 276 229 -47
==========================================
- Hits 2028 1673 -355
+ Misses 113 111 -2
+ Partials 56 47 -9
Impacted Files | Coverage Δ | |
---|---|---|
cesium/tests/test_util.py | 100% <ø> (ø) |
:arrow_up: |
cesium/custom_exceptions.py | 60% <ø> (ø) |
:arrow_up: |
cesium/version.py | 100% <100%> (ø) |
:arrow_up: |
cesium/tests/fixtures.py | 97.56% <100%> (-0.22%) |
:arrow_down: |
cesium/tests/test_featurize.py | 100% <100%> (+0.64%) |
:arrow_up: |
cesium/tests/test_time_series.py | 100% <100%> (ø) |
:arrow_up: |
cesium/tests/test_data_management.py | 100% <100%> (ø) |
:arrow_up: |
cesium/util.py | 85.71% <100%> (-1.52%) |
:arrow_down: |
cesium/time_series.py | 80% <84.84%> (-3.34%) |
:arrow_down: |
cesium/featurize.py | 90.19% <85.71%> (-4.71%) |
:arrow_down: |
... and 2 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update f6480db...141fd6e. Read the comment docs.
@bnaul I now configured the pep complainer to use 80, not 79 :)
All green (except cesium_web
)! Ready for y'all to take a proper look if you like 🤓
This looks great to me! But: please note that no-one has yet fixed the dependent build server, so you definitely cannot trust that result :)
(To clarify, it currently always builds against the latest installable cesium library.)
Great how much simpler this makes things! :+1: from me.
pd.DataFrame
instead ofxr.Dataset
; for multi-channel features, columns labels are a(feature, channel)
tuple.build_model
/predict
wrapper functionality.featurize
API yet.Some open questions:
sklearn
for model building/prediction?Featureset
class do we still want to include?.impute
redundant w/sklearn
's imputation functionality?joblib
instead ofnetCDF4
? We already use it for models on the front end.