albahnsen / CostSensitiveClassification

CostSensitiveClassification Library in Python
BSD 3-Clause "New" or "Revised" License
206 stars 83 forks source link

Getting AttributeError: 'bool' object has no attribute 'astype' for CostSensitiveLogisticRegression() #4

Closed dhruvghulati-zz closed 8 years ago

dhruvghulati-zz commented 8 years ago

I have code like:

costClassifier = CostSensitiveLogisticRegression() costClassifier.fit(train_data_features, train_property_labels, open_cost_mat_train) y_open_pred_test_cslr = costClassifier.predict(test_data_features)

Where train data features are a bag of words for 15,000 sentences, train_property_labels are categorical labels for sentences, and open_cost_mat_train is a cost matrix, respectively:

   [[0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]
     ..., 
    [0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]
    [0 0 0 ..., 0 0 0]] 

 [u'/location/statistical_region/net_migration', u'/location/statistical_region/net_migration', u'/location/statistical_region/net_migration', .....]

 [[ 0.36303512  0.          0.          0.        ]
  [ 0.24472353  0.          0.          0.        ]
  [ 0.18386408  0.          0.          0.        ]
  ..., 
  [ 0.00650667  0.          0.          0.        ]
  [ 0.06445714  0.          0.          0.        ]
  [ 0.05        0.          0.          0.        ]] 

My stack trace however is:

/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/metrics/costs.py:76: FutureWarning:   elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  y_true = (y_true == 1).astype(np.float)
   Traceback (most recent call last):
     File "/Users/dhruv/Documents/university/ClaimDetection/src/main/costSensitiveClassifier.py", line 272, in <module>
     openCostClassifier.fit(train_data_features, train_property_labels, open_cost_mat_train)
    File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 237, in fit
    res.fit()
  File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 165, in fit
   self.cost_ = self._fitness_function()
      File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 151, in _fitness_function
  for i in range(n_jobs))
   File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__
       self.dispatch(function, args, kwargs)
      File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch
       job = ImmediateApply(func, args, kwargs)
       File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__
       self.results = func(*args, **kwargs)
   File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 35, in     _fitness_function_parallel
     return fitness_function(pop, *fargs).tolist()
      File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 96, in  _logistic_cost_loss
        out[i] = _logistic_cost_loss_i(w[i], X, y, cost_mat, alpha)
    File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 53, in _logistic_cost_loss_i
       out = cost_loss(y, y_prob, cost_mat) / n_samples
   File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/metrics/costs.py", line 76, in cost_loss
   y_true = (y_true == 1).astype(np.float)
    AttributeError: 'bool' object has no attribute 'astype'
albahnsen commented 8 years ago

Could you send me a few examples of train_data_features, train_property_labels, open_cost_mat_train such that I can replicate the error. Only 10 or so lines.

dhruvghulati-zz commented 8 years ago

So train_data_features may be:

[ [ 0, 0, 0 , 0 , 2, 0, 1, 0],
   [ 0, 0, 0 , 0 , 1, 0, 0, 0],
   [ 1, 0, 0 , 1 , 4, 0, 3, 0],
   [ 0, 2, 0 , 0 , 0, 0, 8, 0],
   [ 0, 0, 0 , 0 , 0, 3, 0, 0]]

Representing the bag of words from some sentences as a numpy array. Note each number is <type 'numpy.int64'> type.

Then train_property_label is a list of unicode labels for each rows of the above, for sake of argument:

 [u'A', u'B', u'A', u'C', u'A']

And the open_cost_mat_train is:

       [ [ 0.36303512  0.          0.          0.        ]
         [ 0.24472353  0.          0.          0.        ]
          [ 0.18386408  0.          0.          0.        ]
         [ 0.00650667  0.          0.          0.        ]
          [ 0.06445714  0.          0.          0.        ]]

Where each value is <type 'numpy.float64'> type, and this is a numpy array.

Note, I will be changing the C_FN to be half the C_FP but I am not sure this is the issue.

Note: I checked the type of train_property_labels and changed it from a list to an array, and now get this error:

File "/Users/dhruv/Documents/university/ClaimDetection/src/main/costSensitiveClassifier.py", line 272, in openCostClassifier.fit(train_data_features, train_property_labels, open_cost_mattrain) File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 237, in fit res.fit() File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 165, in fit self.cost = self._fitness_function() File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 151, in _fitness_function for i in range(n_jobs)) File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in call self.dispatch(function, args, kwargs) File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch job = ImmediateApply(func, args, kwargs) File "/Users/dhruv/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in init self.results = func(_args, _kwargs) File "/Users/dhruv/anaconda/lib/python2.7/site-packages/pyea/models/ga.py", line 35, in _fitness_function_parallel return fitness_function(pop, fargs).tolist() File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 96, in _logistic_cost_loss out[i] = _logistic_cost_loss_i(w[i], X, y, cost_mat, alpha) File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/models/regression.py", line 53, in _logistic_cost_loss_i out = cost_loss(y, y_prob, cost_mat) / n_samples File "/Users/dhruv/anaconda/lib/python2.7/site-packages/costcla/metrics/costs.py", line 79, in cost_loss cost = y_true * ((1 - y_pred) * cost_mat[:, 1] + y_pred * cost_mat[:, 2]) ValueError: operands could not be broadcast together with shapes (0,) (15000,)

albahnsen commented 8 years ago

@dhruvghulati Unfortunately, costla is so far only built for binary classification problems assuming a 0 and 1 label. This may be the problem.

dhruvghulati-zz commented 8 years ago

Understood, OK thanks a lot for pointing this out.