clbustos / statsample

A suite for basic and advanced statistics on Ruby.
http://github.com/clbustos/statsample
BSD 3-Clause "New" or "Revised" License
402 stars 96 forks source link

trouble with Statsample::Bivariate#correlation_matrix #17

Open akchan opened 10 years ago

akchan commented 10 years ago

Hi, I'm in trouble with statsample to do PCA analysis for large data. Does anyone have any good idea?

I want to do PCA alanysis with very large data. (3000 variables, 50 samples) Then, I wrote this code.

data_raw = IO.readlines('data1.txt').map{|v| v.split }[1..-1]

hash_tmp = {}

data_raw[1..3000].each do |ary|
  hash_tmp[ary[0]] = ary[1..-1].map(&:to_i).to_scale
end

ds = hash_tmp.to_dataset

puts "Input data done!"

cor_matrix=Statsample::Bivariate.correlation_matrix(ds)

puts "cor_matrix was prepared."

pca=Statsample::Factor::PCA.new(cor_matrix)

binding.pry

But the ruby on my mac doesn't return "Cor_matrix was prepared.". I wrote another code to investigate a cause of this.

# Opening Class to investigate where is bottleneck
module Statsample
  module Bivariate
    class << self
      def covariance_matrix_optimized(ds)
        x=ds.to_gsl
        n=x.row_size
        m=x.column_size
        puts "calculating means..."
        means=((1/n.to_f)*GSL::Matrix.ones(1,n)*x).row(0)
        puts "centering matrix..."
        centered=x-(GSL::Matrix.ones(n,m)*GSL::Matrix.diag(means))
        puts "calculating covariance matrix..."
        ss=centered.transpose*centered
        puts "calculating n..."
        s=((1/(n-1).to_f))*ss
        puts "done!"              #<= This line has executed
        s
      end

      def correlation_matrix(ds)
        vars,cases=ds.fields.size,ds.cases
        if !ds.has_missing_data? and Statsample.has_gsl? and prediction_optimized(vars,cases) < prediction_pairwise(vars,cases)
          binding.pry
          cm=correlation_matrix_optimized(ds)
          binding.pry             #<= This line hasn't executed. :(
        else
          cm=correlation_matrix_pairwise(ds)
        end
        binding.pry
        cm.extend(Statsample::CovariateMatrix)
        binding.pry
        cm.fields=ds.fields
        binding.pry
        cm
      end
    end
  end
end

Then the Ruby return until "done!" and doesn't return from Statsample::Bivariate#covariance_matrix_optimized method. I haven't seen a Ruby method which doesn't return.

If someone knows a way to solve this problem or investigate cause deeply, please tell me.