SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

Possible to use where clause with regex or string search? #416

Closed baarkerlounger closed 6 years ago

baarkerlounger commented 6 years ago
df = Daru::DataFrame.new({y1: ["action|thriller", "comedy", "Drama"], y2: [9, 10, 3]}, 
     index: ["a", "b", "c"])

Looking to do something along the lines of df.where(df[:y1].contains('action'))

Pandas has df['y1'].str.contains('action')

My current workaround looks like:

Daru::Core::Query::BoolArray.new(df[:y1].map{ |e| e.to_s.include?("action") })
zverok commented 6 years ago

You can use more Ruby-idiomatic filter(:row):

df.filter(:row) { |r| r[:y1].include?('action') }
# => #<Daru::DataFrame(1x2)>
#                    y1         y2
#          a action|thr          9 

This idea of bool arrays was borrowed from pandas, but it would be probably retired in future, in favor of more idiomatic ways.

(Though, I should say that filter currently is slower)

baarkerlounger commented 6 years ago

The bool array has the advantage of being easy to &/| with other filtering/slicing methods. Neither version seems particularly nice compared to the Pandas way though.

zverok commented 6 years ago

Hm. Can you please show how pandas is more powerful? For me it looks like natural Ruby's blocks can do everything (just can be pretty slow on it)

baarkerlounger commented 6 years ago

I'm not saying more powerful necessarily but I think nicer to work with (and performance). Compare:

Pandas: filtered = df[(df['y1'] == 'Movie') & (df['y2'].str.contains('Action'))] Daru: filtered = df.where(df[:y1].eq('Movie')).filter(:row) { |r| r[:y2].include?('Action')}

The first example to me it's more readable my filter has 2 conditions.

zverok commented 6 years ago
filtered = df.filter(:row) { |r| r[:y1] == 'Movie' && r[:y2].include?('Action')}

...just like you'd filter your usual arrays.