bnaul commented 7 years ago

Store feature info in a pd.DataFrame instead of xr.Dataset; for multi-channel features, columns labels are a (feature, channel) tuple.
Since feature data is simply rectangular now, there's no need for the build_model/predict wrapper functionality.
Otherwise haven't changed the featurize API yet.

Some open questions:

What functionality would be useful to add on top of sklearn for model building/prediction?
What part of the Featureset class do we still want to include?
- Is .impute redundant w/ sklearn's imputation functionality?
- Definitely want some form of serialization besides CSV; what about joblib instead of netCDF4? We already use it for models on the front end.

stefanv commented 7 years ago

As far as I know, joblib uses pickles to serialize. This is not transportable. We could use npy or npz, with a thin wrapper for handling unpacking back into a pandas array.

stefanv commented 7 years ago

Would it make sense to provide a utility function to extract all channels of a certain type from a DataFrame?

acrellin commented 7 years ago

Yes, joblib does use pickle under the hood.

On Wednesday, March 22, 2017, Stefan van der Walt notifications@github.com wrote:

Would it make sense to provide a utility function to extract all channels of a certain type from a DataFrame?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cesium-ml/cesium/pull/247#issuecomment-288448220, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5TTW5BkmwSJNZ9lARF6_JxVZhPs6akks5roUYigaJpZM4MkvfB .

bnaul commented 7 years ago

What would the wrapper entail? .npy works for the actual numeric data but then do we need a separate file that stores the metadata (row/column names)?

bnaul commented 7 years ago

@stefanv re: extracting different channels: yeah, this is hard if we use tuples as column names, i.e.

       (mean, 0)  (mean, 1)     meta1     meta2
file0        2.0   0.180734  0.548427  0.187956
file1        2.5   0.196072  1.171789  0.174803
file2        3.0   0.558093  0.265003  0.109842
file3        2.6   0.280734  0.448427  0.287956
file4        2.5   0.496072  1.371789  0.274803
file5        3.0   0.258093  0.465003  0.409842

If we use a multi-index instead this is easy:

feature mean               meta1     meta2
channel    0         1
file0    2.0  0.180734  0.548427  0.187956
file1    2.5  0.196072  1.171789  0.174803
file2    3.0  0.558093  0.265003  0.109842
file3    2.6  0.280734  0.448427  0.287956
file4    2.5  0.496072  1.371789  0.274803
file5    3.0  0.258093  0.465003  0.409842

but then the columns that don't have a channel associated with them are harder to access than they should be. I'm on the fence about this, what do y'all think?

stefanv commented 7 years ago

npz can store multiple arrays in one file (i.e., list of column names + data)

It may be preferable to have a consistent format for accessing columns with/without channel data, i.e. pretend that single channel is multi-channel with index 0

bnaul commented 7 years ago

Hm npz would be nice, maybe I'll write a little wrapper around that.

I can see both (multiindex or tuples) being annoying, perhaps featurize can have a flag that chooses which format is used? It's also not hard for the user to make the conversion manually to whatever they prefer but supporting both formats explicitly might be nicer.

stefanv commented 7 years ago

Probably best for us to choose one way to do it (even if sub-optimal in all use cases), otherwise confusion is bound to arise.

pep8speaks commented 7 years ago

Hello @bnaul! Thanks for updating the PR.

In the file cesium/data_management.py, following are the PEP8 issues :

Line 92:31: E127 continuation line over-indented for visual indent

In the file cesium/featurize.py, following are the PEP8 issues :

Line 63:81: E501 line too long (82 > 80 characters) Line 105:81: E501 line too long (81 > 80 characters)

In the file cesium/tests/fixtures.py, following are the PEP8 issues :

Line 44:81: E501 line too long (92 > 80 characters)

In the file cesium/tests/test_data_management.py, following are the PEP8 issues :

Line 21:5: E731 do not assign a lambda expression, use a def Line 59:9: E128 continuation line under-indented for visual indent Line 65:13: E128 continuation line under-indented for visual indent Line 68:9: E128 continuation line under-indented for visual indent

In the file cesium/tests/test_featurize.py, following are the PEP8 issues :

Line 173:1: E303 too many blank lines (3)

In the file cesium/time_series.py, following are the PEP8 issues :

Line 236:81: E501 line too long (82 > 80 characters)

In the file examples/plot_EEG_Example.py, following are the PEP8 issues :

Line 34:46: E261 at least two spaces before inline comment Line 34:47: E262 inline comment should start with '# ' Line 34:81: E501 line too long (87 > 80 characters) Line 82:81: E501 line too long (86 > 80 characters) Line 149:69: E226 missing whitespace around arithmetic operator Line 164:81: E501 line too long (87 > 80 characters) Line 201:11: E126 continuation line over-indented for hanging indent Line 204:11: E126 continuation line over-indented for hanging indent Line 207:11: E126 continuation line over-indented for hanging indent

Comment last updated on March 30, 2017 at 18:25 Hours UTC

codecov-io commented 7 years ago

Codecov Report

Merging #247 into master will decrease coverage by 0.93%. The diff coverage is 93.53%.

@@            Coverage Diff             @@
##           master     #247      +/-   ##
==========================================
- Coverage    92.3%   91.37%   -0.94%     
==========================================
  Files          45       37       -8     
  Lines        2197     1831     -366     
  Branches      276      229      -47     
==========================================
- Hits         2028     1673     -355     
+ Misses        113      111       -2     
+ Partials       56       47       -9

Impacted Files	Coverage Δ
cesium/tests/test_util.py	`100% <ø> (ø)`	:arrow_up:
cesium/custom_exceptions.py	`60% <ø> (ø)`	:arrow_up:
cesium/version.py	`100% <100%> (ø)`	:arrow_up:
cesium/tests/fixtures.py	`97.56% <100%> (-0.22%)`	:arrow_down:
cesium/tests/test_featurize.py	`100% <100%> (+0.64%)`	:arrow_up:
cesium/tests/test_time_series.py	`100% <100%> (ø)`	:arrow_up:
cesium/tests/test_data_management.py	`100% <100%> (ø)`	:arrow_up:
cesium/util.py	`85.71% <100%> (-1.52%)`	:arrow_down:
cesium/time_series.py	`80% <84.84%> (-3.34%)`	:arrow_down:
cesium/featurize.py	`90.19% <85.71%> (-4.71%)`	:arrow_down:
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f6480db...141fd6e. Read the comment docs.

stefanv commented 7 years ago

@bnaul I now configured the pep complainer to use 80, not 79 :)

bnaul commented 7 years ago

All green (except cesium_web)! Ready for y'all to take a proper look if you like 🤓

stefanv commented 7 years ago

This looks great to me! But: please note that no-one has yet fixed the dependent build server, so you definitely cannot trust that result :)

(To clarify, it currently always builds against the latest installable cesium library.)

acrellin commented 7 years ago

Great how much simpler this makes things! :+1: from me.

cesium-ml / cesium

API refactor #247

Comment last updated on March 30, 2017 at 18:25 Hours UTC

Codecov Report