Support reading data from DataStreams

JuliaStats / TimeSeries.jl

Time series toolkit for Julia

Other

352 stars 69 forks source link

Support reading data from DataStreams #292

Open milktrader opened 7 years ago

femtotrader commented 7 years ago

This enhancement request (supporting DataStreams.jl) was initially submit by @nalimilan https://github.com/JuliaStats/TimeSeries.jl/issues/290#issuecomment-254007499

Pinging @quinnj Maybe you can help on this ?

Code to convert DataFrame to TimeArray and TimeArray to DataFrame can be found here https://github.com/femtotrader/TimeSeriesIO.jl/blob/master/src/TimeSeriesIO.jl

it could help to build a TimeArray.Sink.

A TimeArray.Source (to convert from TimeArray to DataStream) will be also a nice feature to have.

If @milktrader doesn't want to add additional dependencies to TimeSeries.jl, this code can be part of TimeSeriesIO.jl

Related issues:

milktrader commented 7 years ago

Yes, I think this important functionality belongs in a separate package. Some other possible names ...

TimeSeriesTools (this might be too general)
TimeSeriesStreams

TimeSeriesIO is actually not bad for a package name either.

nalimilan commented 7 years ago

The point of the DataStreams framework is that you wouldn't have to depend on DataFrames, just on DataStreams.jl, and you'd get support for streaming from/to any source, like DataFrame, CSV, databases, etc.

milktrader commented 7 years ago

Why not have DataStreams.jl support TimeSeries, like it supports DataFrames?

DataFrames does not support DataStreams.jl

femtotrader commented 7 years ago

I still have some difficulties to understand functional differences between DataStreams.jl and IterableTables.jl

Maybe @davidanthoff and @quinnj can help for a better understanding

davidanthoff commented 7 years ago

In terms of goals the two packages are super similar. IterableTables.jl emerged out of the design of Query.jl, where the design of IterableTables.jl (namely iterators of NamedTuples.jl) forms the core of the most common backend.

In terms of design, the main difference currently is that IterableTables.jl only has one way of streaming data, namely row by row (where each row is a named tuple). DataStreams.jl offers two and different options: you can either stream field by field or column by column.

There are more sinks and sources for IterableTables.jl currently (more than a dozen as of right now). In particular, if you implement the IterableTables.jl interface, you get automatic interop with the DataStreams.jl sources and sinks via their field based streaming (but not with the column by column streaming). One other difference is in the details of the integration with Query.jl: while you can query a DataStreams.jl source, you should generally get a smoother experience if you query a IterableTables.jl because there are less wrapper steps involved. Same if you materialize a query into some tabular structure.

There are also some user API differences that should be fairly obvious if you just look at the examples of how to use the two packages.

I don't think we have ever done a performance comparison between the two approaches.