Open info-rchitect opened 7 years ago
Can you please show what do you mean? Source + result of the operation?
We have a #concat method for vectors, but it also requires an index. You can fake it with something like this:
a = Daru::Vector.new(['a','b','c']).to_df
b = Daru::Vector.new(['d','e']).to_df
(a.concat b)[nil]
Reread your question and I may have jumped too quickly. I assumed by stacking you meant concatenation. But it sounds like you're actually talking about transposing? But yes, an example of input and expected output would help.
Hi,
So here is how I implemented it in my child Dataset class which inherits from Daru::DataFrame.
non_stacked_df = add_dataset(:smokestack, {
unit: [1, 2, 3, 4, 5, 6, 7, 8],
oper: [:ws1] * 8,
testA: [0, 0, 0, 0, 1, 1, 1, 1],
testB: [0, 1, 0, 1, 0, 1, 0, 1],
testC: [1, 0, 1, 0, 1, 0, 1, 0]
})
Produces this:
[3] pry(main)> non_stacked_df
=> #<Nemawashi::Analytics::Dataset: smokestack (8x5)>
unit oper testA testB testC
0 1 ws1 0 0 1
1 2 ws1 0 1 0
2 3 ws1 0 0 1
3 4 ws1 0 1 0
4 5 ws1 1 0 1
5 6 ws1 1 1 0
6 7 ws1 1 0 1
7 8 ws1 1 1 0
[4] pry(main)> non_stacked_df.shape
=> [8, 5]
Calling my 'stack' method produces this:
stacked_df = non_stacked_df.stack(/test/)
[1] pry(main)> stacked_df
=> #<Nemawashi::Analytics::Dataset: stack (24x4)>
unit oper label value
0 1 ws1 testA 0
1 2 ws1 testA 0
2 3 ws1 testA 0
3 4 ws1 testA 0
4 5 ws1 testA 1
5 6 ws1 testA 1
6 7 ws1 testA 1
7 8 ws1 testA 1
8 1 ws1 testB 0
9 2 ws1 testB 1
10 3 ws1 testB 0
11 4 ws1 testB 1
12 5 ws1 testB 0
13 6 ws1 testB 1
14 7 ws1 testB 0
... ... ... ... ...
[2] pry(main)> stacked_df.shape
=> [24, 4]
So it is essentially a way to transpose some columns while grouping by others. Here is what I do in my stack method (assume the Regexp passed as the arg found 'stack_columns' and the method defaults to using all other vectors as 'group_columns')
group_column_copies = []
stack_columns.size.times do
group_column_copies << self.filter_vectors {|v| group_columns.include? v.name}
end
stacked_dataset = group_column_copies.inject(&:concat)
label_data = stack_columns.map {|c| Array.new(self.nrows,c)}.flatten
value_data = stack_columns.map {|c| self[c].to_a}.flatten
stacked_dataset[:label] = Daru::Vector.new(label_data)
stacked_dataset[:value] = Daru::Vector.new(value_data)
Unfortunately it is very slow so any suggestions to speed it up would be appreciated. If people see the value I could do a PR (after getting it sped up).
What use cases does this method have? Can it be made generic enough to be useful to a larger audience of data engineers?
The general use case is to create many layers of metadata within a single column which can then be used to create variability charts. JMP has a dedicated menu option for this table transform if that lends any credibility to the enhancement. I would be glad to do a PR if someone can at least check the code I posted for any suggestions on speeding it up. Below is an example of a JMP variability chart that has 8 unique values for the attribute ceramic sheet. Many times the data may come in as 8 columns titled 'ceranmic sheet N' where N is 1..8. The stack operation would put the data in the format to create the variability chart.
What is the consensus on this feature?
My vote is that it is a useful operation. Since we already have "pivot" operation, can we call this one "unpivot"?
@info-rchitect yes I'm also for implementing this issue. Sorry I've been on vacation so couldn't read your comments properly previously.
Regarding the code that you've posted, using the where
clause in place of filter_vectors
is much faster. You can then work with the BoolArray
, and generate the actual vector by passing the BoolArray
to the []
method. Can you try using that and see if there's a speedup?
yes will do, thanks very much. the PR has been delayed while I get my system admins to install some gems. stay tuned...
@v0dro the where method seems to work on specified vectors, whereas I would like to filter the dataset based on the name of each vector. Could you show me an example?
Can you connect them with and
?
It can be something like:
df.where(df[:a].eq(3)).and(df[:b].eq(2))
I'm thinking this would be better (@info-rchitect your use case in particular) if the default were that indexes were grouped by and columns transposed. So unit
and oper
would need to be indexes in your case.
Is there a built-in method for 'stacking' vectors? Stacking being taking N columns and turning them into rows with a label column and a value column. I have implemented a quick and dirty version but it is not optimized and still running fro last evening! So I was hoping someone might have done this type of manipulation in Daru before. The terminology comes from JMP btw. thx!