Open usmcamp0811 opened 5 years ago
For distributed tables, getindex
(which gets called when you write tbl[2]
) is not supported.
I think we do need a better way of taking a glance at a distributed table.
Would it be possible to define iteration on a distributed table? This way all the tools that only use iteration (like for example TableView) would work on a distributed table.
I don't know if this is feasible but what I am expecting to happen or would like to see is this Distributed Table act similar to a Dask DataFrame or Dask Array in Python.
Any update on this? I love JuliaDB but can't use it for most things because of this.
Not having row indexing renders this pretty much unusable for a lot of Pandas-type operations on large data - I was super, super excited to find this as a better alternative to Pandas and Dask for speedy, rowwise operations...until I couldn't do rowwise operations.
Not having row indexing renders this pretty much unusable for a lot of Pandas-type operations on large data - I was super, super excited to find this as a better alternative to Pandas and Dask for speedy, rowwise operations...until I couldn't do rowwise operations.
Note that, even though indexing is not supported, you can do row-wise operations using map
or filter
or using JuliaDBMeta macros.
Example:
julia> iris5 = table(iris, chunks = 5);
# row-wise operation parallelize out of the box
julia> @where iris5 :SepalLength == 4.9 && :Species == "setosa"
Distributed Table with 4 rows in 2 chunks:
SepalLength SepalWidth PetalLength PetalWidth Species
──────────────────────────────────────────────────────────
4.9 3.0 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
4.9 3.1 1.5 0.2 "setosa"
4.9 3.6 1.4 0.1 "setosa"
# call @applychunked to run function on all chunk: here I take
# first 5 entries of each chunk
julia> @applychunked iris5 _[1:5]
Distributed Table with 25 rows in 5 chunks:
SepalLength SepalWidth PetalLength PetalWidth Species
──────────────────────────────────────────────────────────────
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
4.8 3.1 1.6 0.2 "setosa"
5.4 3.4 1.5 0.4 "setosa"
5.2 4.1 1.5 0.1 "setosa"
5.5 4.2 1.4 0.2 "setosa"
4.9 3.1 1.5 0.2 "setosa"
5.0 2.0 3.5 1.0 "versicolor"
5.9 3.0 4.2 1.5 "versicolor"
6.0 2.2 4.0 1.0 "versicolor"
6.1 2.9 4.7 1.4 "versicolor"
5.6 2.9 3.6 1.3 "versicolor"
5.5 2.6 4.4 1.2 "versicolor"
6.1 3.0 4.6 1.4 "versicolor"
5.8 2.6 4.0 1.2 "versicolor"
5.0 2.3 3.3 1.0 "versicolor"
5.6 2.7 4.2 1.3 "versicolor"
6.9 3.2 5.7 2.3 "virginica"
5.6 2.8 4.9 2.0 "virginica"
7.7 2.8 6.7 2.0 "virginica"
6.3 2.7 4.9 1.8 "virginica"
6.7 3.3 5.7 2.1 "virginica"
@piever I love usingJuliaDBMeta as well - I should have said index-dependent row-wise operations like select(t,:col1)[4] = 16
, like what I'm used to with df.loc[1,"col1"] = 16
. The row-wise operations, especially with your macros, are excellent.
This comes up in constructing new columns based on more complex conditions, like:
dfA - a table of marketing responses
dbB - a table of sent marketing items
for all rows in dfA where customer ID responded,
dfBtemp = dfB rows that match that customer and where the contact date is before the response date
for each marketing response,
attribute that response to a sent marketing item with dfB.at[i,"response"] = 1
add the index of the marketing item to a list of successful responses
JuliaDB IMO is better than Pandas for these kinds of things because of piping and begin
etc. but the lack of indexing holds me back from taking advantage of the excellent DIndexedTable in these scenarios. I'd love to see it happen in the future.
...that is unless I'm missing a major piece of how JuliaDB/JuliaDBMeta work, I'd be eager to learn more as I'm pretty new to it.
I have ~8GB of BCI data that I just fed into JuliaDB and now I can't seem to get a look at it. I'm still very new to Julia so hopefully I'm not doing anything too stupid.
Julia Version 1.0.2 (2018-11-08)
Load CSV's into JuliaDB
Load JuliaDB data
tbl = load("/home/beboobop/Datasets/bci_data/train/juliadb");
Now I try to index the data to just look at it.
tbl[2]
Returns