cesium-ml / cesium

Machine Learning Time-Series Platform
Other
670 stars 101 forks source link

API refactor #247

Closed bnaul closed 7 years ago

bnaul commented 7 years ago

Some open questions:

stefanv commented 7 years ago

As far as I know, joblib uses pickles to serialize. This is not transportable. We could use npy or npz, with a thin wrapper for handling unpacking back into a pandas array.

stefanv commented 7 years ago

Would it make sense to provide a utility function to extract all channels of a certain type from a DataFrame?

acrellin commented 7 years ago

Yes, joblib does use pickle under the hood.

On Wednesday, March 22, 2017, Stefan van der Walt notifications@github.com wrote:

Would it make sense to provide a utility function to extract all channels of a certain type from a DataFrame?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cesium-ml/cesium/pull/247#issuecomment-288448220, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5TTW5BkmwSJNZ9lARF6_JxVZhPs6akks5roUYigaJpZM4MkvfB .

bnaul commented 7 years ago

What would the wrapper entail? .npy works for the actual numeric data but then do we need a separate file that stores the metadata (row/column names)?

bnaul commented 7 years ago

@stefanv re: extracting different channels: yeah, this is hard if we use tuples as column names, i.e.

       (mean, 0)  (mean, 1)     meta1     meta2
file0        2.0   0.180734  0.548427  0.187956
file1        2.5   0.196072  1.171789  0.174803
file2        3.0   0.558093  0.265003  0.109842
file3        2.6   0.280734  0.448427  0.287956
file4        2.5   0.496072  1.371789  0.274803
file5        3.0   0.258093  0.465003  0.409842

If we use a multi-index instead this is easy:

feature mean               meta1     meta2
channel    0         1
file0    2.0  0.180734  0.548427  0.187956
file1    2.5  0.196072  1.171789  0.174803
file2    3.0  0.558093  0.265003  0.109842
file3    2.6  0.280734  0.448427  0.287956
file4    2.5  0.496072  1.371789  0.274803
file5    3.0  0.258093  0.465003  0.409842

but then the columns that don't have a channel associated with them are harder to access than they should be. I'm on the fence about this, what do y'all think?

stefanv commented 7 years ago

npz can store multiple arrays in one file (i.e., list of column names + data)

It may be preferable to have a consistent format for accessing columns with/without channel data, i.e. pretend that single channel is multi-channel with index 0

bnaul commented 7 years ago

Hm npz would be nice, maybe I'll write a little wrapper around that.

I can see both (multiindex or tuples) being annoying, perhaps featurize can have a flag that chooses which format is used? It's also not hard for the user to make the conversion manually to whatever they prefer but supporting both formats explicitly might be nicer.

stefanv commented 7 years ago

Probably best for us to choose one way to do it (even if sub-optimal in all use cases), otherwise confusion is bound to arise.

pep8speaks commented 7 years ago

Hello @bnaul! Thanks for updating the PR.

Line 92:31: E127 continuation line over-indented for visual indent

Line 63:81: E501 line too long (82 > 80 characters) Line 105:81: E501 line too long (81 > 80 characters)

Line 44:81: E501 line too long (92 > 80 characters)

Line 21:5: E731 do not assign a lambda expression, use a def Line 59:9: E128 continuation line under-indented for visual indent Line 65:13: E128 continuation line under-indented for visual indent Line 68:9: E128 continuation line under-indented for visual indent

Line 173:1: E303 too many blank lines (3)

Line 236:81: E501 line too long (82 > 80 characters)

Line 34:46: E261 at least two spaces before inline comment Line 34:47: E262 inline comment should start with '# ' Line 34:81: E501 line too long (87 > 80 characters) Line 82:81: E501 line too long (86 > 80 characters) Line 149:69: E226 missing whitespace around arithmetic operator Line 164:81: E501 line too long (87 > 80 characters) Line 201:11: E126 continuation line over-indented for hanging indent Line 204:11: E126 continuation line over-indented for hanging indent Line 207:11: E126 continuation line over-indented for hanging indent

Comment last updated on March 30, 2017 at 18:25 Hours UTC
codecov-io commented 7 years ago

Codecov Report

Merging #247 into master will decrease coverage by 0.93%. The diff coverage is 93.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #247      +/-   ##
==========================================
- Coverage    92.3%   91.37%   -0.94%     
==========================================
  Files          45       37       -8     
  Lines        2197     1831     -366     
  Branches      276      229      -47     
==========================================
- Hits         2028     1673     -355     
+ Misses        113      111       -2     
+ Partials       56       47       -9
Impacted Files Coverage Δ
cesium/tests/test_util.py 100% <ø> (ø) :arrow_up:
cesium/custom_exceptions.py 60% <ø> (ø) :arrow_up:
cesium/version.py 100% <100%> (ø) :arrow_up:
cesium/tests/fixtures.py 97.56% <100%> (-0.22%) :arrow_down:
cesium/tests/test_featurize.py 100% <100%> (+0.64%) :arrow_up:
cesium/tests/test_time_series.py 100% <100%> (ø) :arrow_up:
cesium/tests/test_data_management.py 100% <100%> (ø) :arrow_up:
cesium/util.py 85.71% <100%> (-1.52%) :arrow_down:
cesium/time_series.py 80% <84.84%> (-3.34%) :arrow_down:
cesium/featurize.py 90.19% <85.71%> (-4.71%) :arrow_down:
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f6480db...141fd6e. Read the comment docs.

stefanv commented 7 years ago

@bnaul I now configured the pep complainer to use 80, not 79 :)

bnaul commented 7 years ago

All green (except cesium_web)! Ready for y'all to take a proper look if you like 🤓

stefanv commented 7 years ago

This looks great to me! But: please note that no-one has yet fixed the dependent build server, so you definitely cannot trust that result :)

(To clarify, it currently always builds against the latest installable cesium library.)

acrellin commented 7 years ago

Great how much simpler this makes things! :+1: from me.