SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

Stacking vectors? #360

Open info-rchitect opened 7 years ago

info-rchitect commented 7 years ago

Is there a built-in method for 'stacking' vectors? Stacking being taking N columns and turning them into rows with a label column and a value column. I have implemented a quick and dirty version but it is not optimized and still running fro last evening! So I was hoping someone might have done this type of manipulation in Daru before. The terminology comes from JMP btw. thx!

zverok commented 7 years ago

Can you please show what do you mean? Source + result of the operation?

gnilrets commented 7 years ago

We have a #concat method for vectors, but it also requires an index. You can fake it with something like this:

a = Daru::Vector.new(['a','b','c']).to_df
b = Daru::Vector.new(['d','e']).to_df
(a.concat b)[nil]
gnilrets commented 7 years ago

Reread your question and I may have jumped too quickly. I assumed by stacking you meant concatenation. But it sounds like you're actually talking about transposing? But yes, an example of input and expected output would help.

info-rchitect commented 7 years ago

Hi,

So here is how I implemented it in my child Dataset class which inherits from Daru::DataFrame.

non_stacked_df = add_dataset(:smokestack, {
  unit: [1, 2, 3, 4, 5, 6, 7, 8],
  oper: [:ws1] * 8,
  testA: [0, 0, 0, 0, 1, 1, 1, 1],
  testB: [0, 1, 0, 1, 0, 1, 0, 1],
  testC: [1, 0, 1, 0, 1, 0, 1, 0]
})

Produces this:

[3] pry(main)> non_stacked_df
=> #<Nemawashi::Analytics::Dataset: smokestack (8x5)>
        unit  oper testA testB testC
     0     1   ws1     0     0     1
     1     2   ws1     0     1     0
     2     3   ws1     0     0     1
     3     4   ws1     0     1     0
     4     5   ws1     1     0     1
     5     6   ws1     1     1     0
     6     7   ws1     1     0     1
     7     8   ws1     1     1     0
[4] pry(main)> non_stacked_df.shape
=> [8, 5]

Calling my 'stack' method produces this:

stacked_df = non_stacked_df.stack(/test/)
[1] pry(main)> stacked_df
=> #<Nemawashi::Analytics::Dataset: stack (24x4)>
        unit  oper label value
     0     1   ws1 testA     0
     1     2   ws1 testA     0
     2     3   ws1 testA     0
     3     4   ws1 testA     0
     4     5   ws1 testA     1
     5     6   ws1 testA     1
     6     7   ws1 testA     1
     7     8   ws1 testA     1
     8     1   ws1 testB     0
     9     2   ws1 testB     1
    10     3   ws1 testB     0
    11     4   ws1 testB     1
    12     5   ws1 testB     0
    13     6   ws1 testB     1
    14     7   ws1 testB     0
   ...   ...   ...   ...   ...
[2] pry(main)> stacked_df.shape
=> [24, 4]

So it is essentially a way to transpose some columns while grouping by others. Here is what I do in my stack method (assume the Regexp passed as the arg found 'stack_columns' and the method defaults to using all other vectors as 'group_columns')

group_column_copies = []
stack_columns.size.times do
  group_column_copies << self.filter_vectors {|v| group_columns.include? v.name}
end
stacked_dataset = group_column_copies.inject(&:concat)
label_data = stack_columns.map {|c| Array.new(self.nrows,c)}.flatten
value_data = stack_columns.map {|c| self[c].to_a}.flatten
stacked_dataset[:label] = Daru::Vector.new(label_data)
stacked_dataset[:value] = Daru::Vector.new(value_data)

Unfortunately it is very slow so any suggestions to speed it up would be appreciated. If people see the value I could do a PR (after getting it sped up).

v0dro commented 7 years ago

What use cases does this method have? Can it be made generic enough to be useful to a larger audience of data engineers?

info-rchitect commented 7 years ago

The general use case is to create many layers of metadata within a single column which can then be used to create variability charts. JMP has a dedicated menu option for this table transform if that lends any credibility to the enhancement. I would be glad to do a PR if someone can at least check the code I posted for any suggestions on speeding it up. Below is an example of a JMP variability chart that has 8 unique values for the attribute ceramic sheet. Many times the data may come in as 8 columns titled 'ceranmic sheet N' where N is 1..8. The stack operation would put the data in the format to create the variability chart.

image

info-rchitect commented 7 years ago

What is the consensus on this feature?

gnilrets commented 7 years ago

My vote is that it is a useful operation. Since we already have "pivot" operation, can we call this one "unpivot"?

v0dro commented 7 years ago

@info-rchitect yes I'm also for implementing this issue. Sorry I've been on vacation so couldn't read your comments properly previously.

Regarding the code that you've posted, using the where clause in place of filter_vectors is much faster. You can then work with the BoolArray, and generate the actual vector by passing the BoolArray to the [] method. Can you try using that and see if there's a speedup?

info-rchitect commented 7 years ago

yes will do, thanks very much. the PR has been delayed while I get my system admins to install some gems. stay tuned...

info-rchitect commented 7 years ago

@v0dro the where method seems to work on specified vectors, whereas I would like to filter the dataset based on the name of each vector. Could you show me an example?

v0dro commented 7 years ago

Can you connect them with and?

It can be something like:

df.where(df[:a].eq(3)).and(df[:b].eq(2))
baarkerlounger commented 6 years ago

I'm thinking this would be better (@info-rchitect your use case in particular) if the default were that indexes were grouped by and columns transposed. So unit and oper would need to be indexes in your case.