RubixML / ML

A high-level machine learning and deep learning library for the PHP language.
https://rubixml.com
MIT License
2.03k stars 184 forks source link

Convert String on Sample Data Into Integer #190

Closed halimkun closed 1 month ago

halimkun commented 3 years ago

hello, I have a question. this there are some questions.

i have data that looks like this. (cat disease data by identifying the symptoms of the disease) because there is too much data, I can't display all of them, only some of them

object(Rubix\ML\Datasets\Labeled)[5]
  protected array 'labels' => 
    array (size=4)
      0 => string 'panleukopenia' (length=13)
      1 => string 'scabies' (length=7)
      2 => string 'enteritis' (length=9)
      3 => string 'fcv' (length=3)
  protected array 'samples' => 
    array (size=4)
      0 => 
        array (size=17)
          0 => string 'k1' (length=2)
          1 => string 'ya' (length=2)
          2 => string 'ya' (length=2)
          3 => string 'ya' (length=2)
          4 => string 'ya' (length=2)
          5 => string '' (length=0)
          6 => string '' (length=0)
          7 => string '' (length=0)
          8 => string '' (length=0)
          9 => string '' (length=0)
          10 => string '' (length=0)
          11 => string '' (length=0)
          12 => string '' (length=0)
          13 => string '' (length=0)
          14 => string '' (length=0)
          15 => string '' (length=0)
          16 => string '' (length=0)
      1 => 
        array (size=17)
          0 => string 'k90' (length=3)
          1 => string '' (length=0)
          2 => string '' (length=0)
          3 => string '' (length=0)
          4 => string '' (length=0)
          5 => string '' (length=0)
          6 => string '' (length=0)
          7 => string '' (length=0)
          8 => string '' (length=0)
          9 => string '' (length=0)
          10 => string '' (length=0)
          11 => string '' (length=0)
          12 => string 'ya' (length=2)
          13 => string '' (length=0)
          14 => string '' (length=0)
          15 => string '' (length=0)
          16 => string '' (length=0)
      2 => 
        array (size=17)
          0 => string 'k224' (length=4)
          1 => string '' (length=0)
          2 => string '' (length=0)
          3 => string '' (length=0)
          4 => string '' (length=0)
          5 => string '' (length=0)
          6 => string '' (length=0)
          7 => string 'ya' (length=2)
          8 => string '' (length=0)
          9 => string '' (length=0)
          10 => string '' (length=0)
          11 => string '' (length=0)
          12 => string '' (length=0)
          13 => string '' (length=0)
          14 => string '' (length=0)
          15 => string '' (length=0)
          16 => string '' (length=0)
      3 => 
        array (size=17)
          0 => string 'k235' (length=4)
          1 => string '' (length=0)
          2 => string '' (length=0)
          3 => string '' (length=0)
          4 => string '' (length=0)
          5 => string '' (length=0)
          6 => string '' (length=0)
          7 => string '' (length=0)
          8 => string '' (length=0)
          9 => string '' (length=0)
          10 => string 'ya' (length=2)
          11 => string '' (length=0)
          12 => string '' (length=0)
          13 => string '' (length=0)
          14 => string 'ya' (length=2)
          15 => string 'ya' (length=2)
          16 => string '' (length=0)

it can be seen that there is an empty space that does not represent the existing symptoms.

the question is how to change yes and the empty space to numeric

I have tried changing it using NumericStringConverter() nothing happens (data is still the same), and using OneHotEncoder() there is an addition of data in each index, for example index 0 which originally had 17 data turned into 29 data

below is the data that has been apply() with OneHotEncoder()

object(Rubix\ML\Datasets\Labeled)[5]
  protected array 'labels' => 
    array (size=4)
      0 => string 'panleukopenia' (length=13)
      1 => string 'scabies' (length=7)
      2 => string 'enteritis' (length=9)
      3 => string 'fcv' (length=3)
  protected array 'samples' => 
    array (size=4)
      0 => 
        array (size=29)
          0 => int 1
          1 => int 0
          2 => int 0
          3 => int 0
          4 => int 1
          5 => int 0
          6 => int 1
          7 => int 0
          8 => int 1
          9 => int 0
          10 => int 1
          11 => int 0
          12 => int 1
          13 => int 1
          14 => int 1
          15 => int 0
          16 => int 1
          17 => int 1
          18 => int 1
          19 => int 0
          20 => int 1
          21 => int 1
          22 => int 0
          23 => int 1
          24 => int 1
          25 => int 0
          26 => int 1
          27 => int 0
          28 => int 1
      1 => 
        array (size=29)
          0 => int 0
          1 => int 1
          2 => int 0
          3 => int 0
          4 => int 0
          5 => int 1
          6 => int 0
          7 => int 1
          8 => int 0
          9 => int 1
          10 => int 0
          11 => int 1
          12 => int 1
          13 => int 1
          14 => int 1
          15 => int 0
          16 => int 1
          17 => int 1
          18 => int 1
          19 => int 0
          20 => int 1
          21 => int 0
          22 => int 1
          23 => int 1
          24 => int 1
          25 => int 0
          26 => int 1
          27 => int 0
          28 => int 1
      2 => 
        array (size=29)
          0 => int 0
          1 => int 0
          2 => int 1
          3 => int 0
          4 => int 0
          5 => int 1
          6 => int 0
          7 => int 1
          8 => int 0
          9 => int 1
          10 => int 0
          11 => int 1
          12 => int 1
          13 => int 1
          14 => int 0
          15 => int 1
          16 => int 1
          17 => int 1
          18 => int 1
          19 => int 0
          20 => int 1
          21 => int 1
          22 => int 0
          23 => int 1
          24 => int 1
          25 => int 0
          26 => int 1
          27 => int 0
          28 => int 1
      3 => 
        array (size=29)
          0 => int 0
          1 => int 0
          2 => int 0
          3 => int 1
          4 => int 0
          5 => int 1
          6 => int 0
          7 => int 1
          8 => int 0
          9 => int 1
          10 => int 0
          11 => int 1
          12 => int 1
          13 => int 1
          14 => int 1
          15 => int 0
          16 => int 1
          17 => int 1
          18 => int 0
          19 => int 1
          20 => int 1
          21 => int 1
          22 => int 0
          23 => int 1
          24 => int 0
          25 => int 1
          26 => int 0
          27 => int 1
          28 => int 1

I don't think it's a problem, however when trying to predict new data

$dataTesting = [
["KN1", "", "", "", "", "", "", "", "", "", "", "", "Ya", "", "", "", ""],
["KN2", "", "", "", "", "", "", "", "", "", "", "", "Ya", "Ya", "", "", ""],
];

$testData = new Unlabeled($dataTesting);
$testData->apply(new OneHotEncoder);

$pred = $knn->predict($testData);

var_dump($pred);

Generates error Fatal error: Uncaught Rubix\ML\Exceptions\IncorrectDatasetDimensionality: Dataset must contain samples with exactly 29 dimensions, 19 given

halimkun commented 3 years ago

this my data from .csv file

cat,anorexia,muntah,lemah,kurang respon,dehidrasi,demam,diare,hipersevalis,radang telinga,batuk,hidung meler,gatal,telinga keropeng,pilek,bersin2,mata berair,disease
k1,ya,ya,ya,ya,,,,,,,,,,,,,panleukopenia
k90,,,,,,,,,,,,ya,,,,,scabies
k224,,,,,,,ya,,,,,,,,,,enteritis
k235,,,,,,,,,,ya,,,,ya,ya,,fcv
andrewdalpino commented 3 years ago

Hey @halimkun it sounds like you might need a custom transformation. You can create a separate category to represent the absence of a symptom (such as "no") and impute that wherever there's an empty feature value. Since yes and no are a binary representations, you can use the integers 1 and 0 for their numeric representation. Having that said, why do you need categories represented numerically?

https://docs.rubixml.com/latest/preprocessing.html#custom-transformations