JuliaData / JuliaDB.jl

Parallel analytical database in pure Julia
http://juliadb.org/
Other
766 stars 61 forks source link

Unable to Index Out-of-Core Table #231

Open usmcamp0811 opened 5 years ago

usmcamp0811 commented 5 years ago

I have ~8GB of BCI data that I just fed into JuliaDB and now I can't seem to get a look at it. I'm still very new to Julia so hopefully I'm not doing anything too stupid.

Julia Version 1.0.2 (2018-11-08)

Load CSV's into JuliaDB

using JuliaDB

loadtable(glob("*.csv", "/home/beboobop/Datasets/bci_data/train/csv/"), output="/home/beboobop/Datasets/bci_data/train/juliadb", chunks=10;);

Load JuliaDB data

tbl = load("/home/beboobop/Datasets/bci_data/train/juliadb");

Now I try to index the data to just look at it.

tbl[2]

Returns

MethodError: no method matching getindex(::JuliaDB.DNextTable{NamedTuple{(:Time, :Fp1, :Fp2, :AF7, :AF3, :AF4, :AF8, :F7, :F5, :F3, :F1, :Fz, :F2, :F4, :F6, :F8, :FT7, :FC5, :FC3, :FC1, :FCz, :FC2, :FC4, :FC6, :FT8, :T7, :C5, :C3, :C1, :Cz, :C2, :C4, :C6, :T8, :TP7, :CP5, :CP3, :CP1, :CPz, :CP2, :CP4, :CP6, :TP8, :P7, :P5, :P3, :P1, :Pz, :P2, :P4, :P6, :P8, :PO7, :POz, :P08, :O1, :O2, :EOG, :FeedBackEvent),Tuple{Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Int64}},Tuple{Int64}}, ::Int64)

Stacktrace:
 [1] top-level scope at In[21]:1
joshday commented 5 years ago

For distributed tables, getindex (which gets called when you write tbl[2]) is not supported.

I think we do need a better way of taking a glance at a distributed table.

piever commented 5 years ago

Would it be possible to define iteration on a distributed table? This way all the tools that only use iteration (like for example TableView) would work on a distributed table.

usmcamp0811 commented 5 years ago

I don't know if this is feasible but what I am expecting to happen or would like to see is this Distributed Table act similar to a Dask DataFrame or Dask Array in Python.

russellromney commented 5 years ago

Any update on this? I love JuliaDB but can't use it for most things because of this.

Not having row indexing renders this pretty much unusable for a lot of Pandas-type operations on large data - I was super, super excited to find this as a better alternative to Pandas and Dask for speedy, rowwise operations...until I couldn't do rowwise operations.

piever commented 5 years ago

Not having row indexing renders this pretty much unusable for a lot of Pandas-type operations on large data - I was super, super excited to find this as a better alternative to Pandas and Dask for speedy, rowwise operations...until I couldn't do rowwise operations.

Note that, even though indexing is not supported, you can do row-wise operations using map or filter or using JuliaDBMeta macros.

Example:

julia> iris5 = table(iris, chunks = 5);

# row-wise operation parallelize out of the box
julia> @where iris5 :SepalLength == 4.9 && :Species == "setosa"
Distributed Table with 4 rows in 2 chunks:
SepalLength  SepalWidth  PetalLength  PetalWidth  Species
──────────────────────────────────────────────────────────
4.9          3.0         1.4          0.2         "setosa"
4.9          3.1         1.5          0.1         "setosa"
4.9          3.1         1.5          0.2         "setosa"
4.9          3.6         1.4          0.1         "setosa"

# call @applychunked to run function on all chunk: here I take
# first 5 entries of each chunk
julia> @applychunked iris5 _[1:5]
Distributed Table with 25 rows in 5 chunks:
SepalLength  SepalWidth  PetalLength  PetalWidth  Species
──────────────────────────────────────────────────────────────
5.1          3.5         1.4          0.2         "setosa"
4.9          3.0         1.4          0.2         "setosa"
4.7          3.2         1.3          0.2         "setosa"
4.6          3.1         1.5          0.2         "setosa"
5.0          3.6         1.4          0.2         "setosa"
4.8          3.1         1.6          0.2         "setosa"
5.4          3.4         1.5          0.4         "setosa"
5.2          4.1         1.5          0.1         "setosa"
5.5          4.2         1.4          0.2         "setosa"
4.9          3.1         1.5          0.2         "setosa"
5.0          2.0         3.5          1.0         "versicolor"
5.9          3.0         4.2          1.5         "versicolor"
6.0          2.2         4.0          1.0         "versicolor"
6.1          2.9         4.7          1.4         "versicolor"
5.6          2.9         3.6          1.3         "versicolor"
5.5          2.6         4.4          1.2         "versicolor"
6.1          3.0         4.6          1.4         "versicolor"
5.8          2.6         4.0          1.2         "versicolor"
5.0          2.3         3.3          1.0         "versicolor"
5.6          2.7         4.2          1.3         "versicolor"
6.9          3.2         5.7          2.3         "virginica"
5.6          2.8         4.9          2.0         "virginica"
7.7          2.8         6.7          2.0         "virginica"
6.3          2.7         4.9          1.8         "virginica"
6.7          3.3         5.7          2.1         "virginica"
russellromney commented 5 years ago

@piever I love usingJuliaDBMeta as well - I should have said index-dependent row-wise operations like select(t,:col1)[4] = 16, like what I'm used to with df.loc[1,"col1"] = 16. The row-wise operations, especially with your macros, are excellent.

This comes up in constructing new columns based on more complex conditions, like:

dfA - a table of marketing responses
dbB - a table of sent marketing items

for all rows in dfA where customer ID responded,
    dfBtemp = dfB rows that match that customer and where the contact date is before the response date

    for each marketing response, 
        attribute that response to a sent marketing item with dfB.at[i,"response"] = 1
        add the index of the marketing item to a list of successful responses

JuliaDB IMO is better than Pandas for these kinds of things because of piping and begin etc. but the lack of indexing holds me back from taking advantage of the excellent DIndexedTable in these scenarios. I'd love to see it happen in the future.

...that is unless I'm missing a major piece of how JuliaDB/JuliaDBMeta work, I'd be eager to learn more as I'm pretty new to it.