dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.06k stars 8.7k forks source link

[jvm-package] Creating a single test instance at run time #1286

Closed mhnamaki closed 8 years ago

mhnamaki commented 8 years ago

Hi, Thanks for sharing this useful software. I'm trying to use XGBoost for a structured prediction problem which I have to generate my test instances at run time. That's why, I'm using the public DMatrix(float[] data, int nrow, int ncol) constructor. However, when I create test instances one by one by using this constructor, the number of mis-classification are much more than when I have all of them in a file use public DMatrix(String dataPath)

So, may be I'm wrong at feeding the constructor. I mean if I have three features and an output label, how should I fill them in the "data" array? what should be the "ncols"?

DMatrix testInstace = new DMatrix(testingInstance, 1, (3 or 4?)); testingInstance should have the output labels? if yes, the output labels should be valid to have a correct prediction?! which index will keep the output label? zero or last in each row?

Thank you in advance. --Mohammad

mhnamaki commented 8 years ago

Hi, It seems that there is a bug in the program.

I've written a simple program to train and test over same dataset with just 6 examples (short version of breast-cancer dataset).

When I create DMatrix using the filePath approach, it doesn't have any training error in its predictions. However, when I create DMatrix using the (data,nrows,ncols) approach it has 3 training errors.

Please help me on this issue. I've attached the simple source code and the shortest possible dataset which has this problem also.

`import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; import java.util.HashMap; import ml.dmlc.xgboost4j.java.Booster; import ml.dmlc.xgboost4j.java.DMatrix; import ml.dmlc.xgboost4j.java.XGBoost;

public class XGBoostStaticTest {

public static void main(String[] args) throws Exception {

    int numberOfFeatures = 10;

    String trainTestPath = "breast-cancer-short.txt";

    // load file from text file to learn
    DMatrix trainMat = new DMatrix(trainTestPath);

    HashMap<String, Object> paramsCls = new HashMap<String, Object>();
    paramsCls.put("eta", 1.0);
    paramsCls.put("booster", "gbtree");
    paramsCls.put("silent", 1);
    paramsCls.put("objective", "multi:softmax");
    paramsCls.put("num_class", "5");
    paramsCls.put("eval_metric", "mlogloss");
    paramsCls.put("max_depth", numberOfFeatures);

    HashMap<String, DMatrix> watches = new HashMap<String, DMatrix>();
    watches.put("train", trainMat);

    // set round
    int round = 3;

    // train a boost model
    Booster booster = XGBoost.train(trainMat, paramsCls, round, watches, null, null);

    {
        // init DMatrix using filePath approach
        DMatrix testMat = new DMatrix(trainTestPath);

        // count the misclassifications
        int trainErrCnt = 0;

        // count the instances to be tested?
        int testInstancesCnt = 0;

        // getting labels from DMatrix directly (from lib-svm formatted
        // file)
        float[] labels = testMat.getLabel();

        // predict the whole of the test file
        float[][] predicts = booster.predict(testMat);

        for (int i = 0; i < predicts.length; i++) {
            if (((int) labels[i]) != ((int) predicts[i][0])) {
                trainErrCnt++;
            }
            testInstancesCnt++;
            System.out.println(labels[i] + ", " + predicts[i][0]);
        }

        System.out.println("trainErr: " + trainErrCnt);
        System.out.println("testCnt: " + testInstancesCnt);
    }

    // testing with create instances one by one from the same file.
    // for some of instances it generates different value than the filePath
    // approach!
    {
        FileInputStream fis = new FileInputStream(trainTestPath);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis));

        int trainErrCnt = 0;
        int testingInstancesCnt = 0;
        String line = null;
        while ((line = br.readLine()) != null) {

            // splitting lib-svm formatted file
            String[] splittedLine = line.split(" ");

            // initializing a data array to keep the features
            float[] testingInstance = new float[1 * numberOfFeatures];

            for (int f = 0; f < (splittedLine.length - 1); f++) {
                // the first item in splitted line is output label so we
                // start from one....
                testingInstance[f] = Float.parseFloat(splittedLine[f + 1].split(":")[1].toString());
            }

            // creating a DMatrix with one row and the numberOfFeatures
            // columns
            DMatrix testInstance = new DMatrix(testingInstance, 1, numberOfFeatures);

            // predict just one instance
            float[][] predicts = booster.predict(testInstance);

            // the first splitted item in each row is output label
            float label = Float.parseFloat(splittedLine[0]);

            if (((int) label) != ((int) predicts[0][0])) {
                trainErrCnt++;
            }

            testingInstancesCnt++;
            System.out.println(label + ", " + predicts[0][0]);

        }
        System.out.println("trainErr: " + trainErrCnt);
        System.out.println("testCnt: " + testingInstancesCnt);
        br.close();
    }
}

}`

XGBoostStaticTest.txt breast-cancer-short.txt

CodingCat commented 8 years ago

I didn't understand your code for parsing the file...it seems that you specify nrow as 1?

You can get the meaning of nrow and ncol from https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/src/test/java/ml/dmlc/xgboost4j/java/DMatrixTest.java#L95

basically, nrows== number of instances, ncol == nFeatures

mhnamaki commented 8 years ago

Hi, There are two approaches:

  1. Reading the whole test instances from the same training file. It works perfect.
  2. Creating test instances one by one and call predict() for each of them. That's why, in my code nrows is 1 and ncols is the number of features which is 10 in this example.

The parsing code just tries to split the lib-svm format file to output label and its corresponding features, line by line. So, it fills the "testingInstance" data array with all the features without its output label. Then, initializes the DMatrix with just one row and number of features as number of columns. Then, it calls predict() on the DMatrix and compares the output label with the predicted label. Based on a lib-svm file, the output label is the first splitted item.

May be this code doesn't make sense for you since I created the instances one by one from a static file. However, in my original requirement, I don't have these static files since instances will generate based on the status of the problem at given time. This code is simplified to show that there is a problem in defining this kind of DMatrix. Please help me on that. Thanks

mhnamaki commented 8 years ago

Hi, I think I've found the problem with the dense matrix constructor! If I change all of the zeros in my data array to something like 0.000001 then it works perfect! This means that it gives me the same results in comparison with the static test file. I think the problem is with default missing value which is 0.0F. In my case, I don't have any missing values but in my dense matrix there are some zero value features. It may be mixed up the missing values with these zero values features.

In earlier version of XGBoost users could set the defaults for missing values.

CodingCat commented 8 years ago

I see....I just merged a pending PR for setting missing value in jvm-packages, now you can use https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/DMatrix.java#L128 to build a DMatrix

mhnamaki commented 8 years ago

Thank you very much.