SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

Implement a pandas.get_dummies equivalent for daru #474

Open willianveiga opened 5 years ago

willianveiga commented 5 years ago

Please implement a method like pandas.get_dummies for daru.

Considering the following DataFrame:

color,dog
brown,1
black and white,0
brown,1
...

Our get_dummies implementation should output something like:

color_brown,color_black_and_white,dog
1,0,1
0,1,0
1,0,1
PetalsOnWind commented 5 years ago

I am new here. Can I give this a try?

v0dro commented 5 years ago

Sure. Let us know if you run into difficulties.

janpeterka commented 4 years ago

Hey, are you @PetalsOnWind still working on this?

I used rumale gem to do (something like) this, here is my code (maybe it helps). It expects input vector to have only int values, so it's needed to add convertor of unique non-numerical values to numerical to use.

    def one_hot_encode_vector(data_frame, vector_name:, delete: false, name: nil)
      vector_name = vector_name.to_sym
      encoder = Rumale::Preprocessing::OneHotEncoder.new
      labels = Numo::Int32[data_frame[vector_name].to_a].flatten
      one_hot_vectors = encoder.fit_transform(labels)

      name = vector_name.to_s unless name.present?

      transposed_one_hot_vectors = one_hot_vectors.to_a.transpose

      data_frame[vector_name].sort.uniq.to_a.each_with_index do |value, i|
        vector_name = "#{name}_encoded_#{value}".to_sym
        data_frame[vector_name] = transposed_one_hot_vectors[i] unless i.nil?
      end

      if delete
        data_frame.delete_vector(vector_name)
      end

      data_frame
    end