dmlc / treelite

Universal model exchange and serialization format for decision tree forests
https://treelite.readthedocs.io/en/latest/
Apache License 2.0
732 stars 99 forks source link

core Errors when predict online from the xgboost model with java api #163

Closed sunnyDX closed 4 years ago

sunnyDX commented 4 years ago

Here is the GDB information: gdb /opt/taobao/java/bin/java --core=core-616-java-3489-1584598903

[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/opt/taobao/java/bin/java -server -Xms2g -Xmx2g -Xmn1g -XX:MetaspaceSize=256m -'. Program terminated with signal 6, Aborted.

0 0x00007f9f7cf27277 in raise () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install ali-jdk-8.4.8-1574344.alios7.x86_64 (gdb) bt

0 0x00007f9f7cf27277 in raise () from /lib64/libc.so.6

1 0x00007f9f7cf28968 in abort () from /lib64/libc.so.6

2 0x00007f9f7c81ffd5 in os::abort(bool) () from /opt/taobao/install/ajdk-8.4.8-b211/jre/lib/amd64/server/libjvm.so

3 0x00007f9f7c9ddce3 in VMError::report_and_die() () from /opt/taobao/install/ajdk-8.4.8-b211/jre/lib/amd64/server/libjvm.so

4 0x00007f9f7c825c32 in JVM_handle_linux_signal () from /opt/taobao/install/ajdk-8.4.8-b211/jre/lib/amd64/server/libjvm.so

5 0x00007f9f7c81bd13 in signalHandler(int, siginfo, void) () from /opt/taobao/install/ajdk-8.4.8-b211/jre/lib/amd64/server/libjvm.so

6

7 0x00007f9e0cbba448 in unsigned long (anonymous namespace)::PredictBatch_(treelite::CSRBatch const, bool, unsigned long, unsigned long, void, unsigned long, unsigned long, unsigned long, float*) [clone .isra.217] ()

from /tmp/libtreelite4j9026278753296637105.so

8 0x00007f9e0cbc6f71 in unsigned long treelite::Predictor::PredictBatchBase_(treelite::CSRBatch const, int, bool, float) () from /tmp/libtreelite4j9026278753296637105.so

9 0x00007f9e0cbcbb75 in TreelitePredictorPredictBatch () from /tmp/libtreelite4j9026278753296637105.so

10 0x00007f9e0cbb4caa in Java_ml_dmlc_treelite4j_TreeliteJNI_TreelitePredictorPredictBatch ()

from /tmp/libtreelite4j9026278753296637105.so

11 0x00007f9f6733f79f in ?? ()

12 0x0000000000000000 in ?? ()

hcho3 commented 4 years ago

Can you post an example script?

sunnyDX commented 4 years ago

Can you post an example script?

Strangely, core dump happens occasionally,This is core dump gdb information:

(gdb) where

0 0x00007f0d8f248277 in raise () from /lib64/libc.so.6

1 0x00007f0d8f249968 in abort () from /lib64/libc.so.6

2 0x00007f0d8eb40fd5 in os::abort(bool) () from /opt/taobao/install/ajdk-8.4.8-b211/jre/lib/amd64/server/libjvm.so

3 0x00007f0d8ecfece3 in VMError::report_and_die() () from /opt/taobao/install/ajdk-8.4.8-b211/jre/lib/amd64/server/libjvm.so

4 0x00007f0d8eb46c32 in JVM_handle_linux_signal () from /opt/taobao/install/ajdk-8.4.8-b211/jre/lib/amd64/server/libjvm.so

5 0x00007f0d8eb3cd13 in signalHandler(int, siginfo, void) ()

from /opt/taobao/install/ajdk-8.4.8-b211/jre/lib/amd64/server/libjvm.so

6

7 0x00007f0c187a5bd6 in (anonymous namespace)::PredLoop<(anonymous namespace)::PredictBatch_(const BatchType, bool, size_t, size_t, treelite::Predictor::PredFuncHandle, size_t, size_t, size_t, float) [with BatchType = treelite::CSRBatch; size_t = long unsigned int; treelite::Predictor::PredFuncHandle = void]::__lambda2>(const treelite::CSRBatch , size_t, size_t, size_t, float *, (anonymousnamespace)::__lambda2) (batch=0x7f0b78027620, num_feature=30001, rbegin=0, rend=24, out_pred=0x7f0b78027650, func=...)

at /home/admin/treelite/runtime/native/src/predictor.cc:118

8 0x00007f0c187a4893 in (anonymous namespace)::PredictBatch_ (batch=0x7f0b78027620, pred_margin=false,

num_feature=30001, num_output_group=1, pred_func_handle=0x7f0b4c93e730 <predict>, rbegin=0, rend=24,
expected_query_result_size=24, out_pred=0x7f0b78027650) at /home/admin/treelite/runtime/native/src/predictor.cc:197

9 0x00007f0c187ace47 in treelite::Predictor::PredictBatchBase_ (this=0x7f0ca4023a90, batch=0x7f0b78027620,

verbose=1, pred_margin=false, out_result=0x7f0b78027650) at /home/admin/treelite/runtime/native/src/predictor.cc:459

10 0x00007f0c187a419e in treelite::Predictor::PredictBatch (this=0x7f0ca4023a90, batch=0x7f0b78027620, verbose=1,

pred_margin=false, out_result=0x7f0b78027650) at /home/admin/treelite/runtime/native/src/predictor.cc:495

11 0x00007f0c187a1bab in TreelitePredictorPredictBatch (handle=0x7f0ca4023a90, batch=0x7f0b78027620, batch_sparse=1, verbose=1,

pred_margin=0, out_result=0x7f0b78027650, out_result_size=0x7f0b4c582260)
at /home/admin/treelite/runtime/native/src/c_api/c_api_runtime.cc:111

12 0x00007f0c1879ef42 in Java_ml_dmlc_treelite4j_java_TreeliteJNI_TreelitePredictorPredictBatch (jenv=0x7f0b6c002210,

jcls=0x7f0b4c5822d8, jhandle=139692267944592, jbatch=139687234795040, jbatch_sparse=1 '\001', jverbose=1 '\001',
jpred_margin=0 '\000', jout_result=0x7f0b4c5822d0, jout_result_size=0x7f0b4c582300)
at /home/admin/treelite/runtime/java/treelite4j/src/native/treelite4j.cpp:180

13 0x00007f0d7b330cdf in ?? ()

14 0x0000000000000000 in ?? ()

Corresponding to my debugging code:118

96 template 97 inline size_t PredLoop(const treelite::CSRBatch batch, size_t num_feature, 98 size_t rbegin, size_t rend, 99 float out_pred, PredFunc func) { 100 CHECK_LE(batch->num_col, num_feature); 101 std::vector inst( 102 std::max(batch->num_col, num_feature), {-1}); 103 CHECK(rbegin < rend && rend <= batch->num_row); 104 CHECK(sizeof(size_t) < sizeof(int64_t) 105 || (rbegin <= static_cast(std::numeric_limits::max()) 106 && rend <= static_cast(std::numeric_limits::max()))); 107 const int64t rbegin = static_cast(rbegin); 108 const int64t rend = static_cast(rend); 109 const size_t num_col = batch->num_col; 110 const float data = batch->data; 111 const uint32_t col_ind = batch->col_ind; 112 const size_t* row_ptr = batch->row_ptr; 113 size_t total_output_size = 0; 114 for (int64t rid = rbegin; rid < rend_; ++rid) { 115 const size_t ibegin = row_ptr[rid]; 116 const size_t iend = row_ptr[rid + 1]; 117 for (size_t i = ibegin; i < iend; ++i) { 118 inst[col_ind[i]].fvalue = data[i]; 119 } 120 total_output_size += func(rid, &inst[0], out_pred); 121 for (size_t i = ibegin; i < iend; ++i) { 122 inst[col_ind[i]].missing = -1; 123 } 124 } 125 return total_output_size; 126 } 127

hcho3 commented 4 years ago

I have no idea why this is happening. Can you upload your Java application here so that I can try running it?

sunnyDX commented 4 years ago

I have no idea why this is happening. Can you upload your Java application here so that I can try running it?

This core problem has been bothering me for a long time , it only occurs when handle lots of requests at the same time ,I'm afraid you can't reproduce it,Here is the core code:

import java.nio.charset.StandardCharsets; import java.util.ArrayList; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Map.Entry; import java.util.Set;

import com.gaode.idccmnpredict.pb.FeatureOuterClass; import com.gaode.idccmnpredict.predict.dto.PredictionResult; import com.google.common.primitives.Floats; import com.google.common.primitives.Ints; import lombok.extern.slf4j.Slf4j; import ml.dmlc.treelite4j.DataPoint; import ml.dmlc.treelite4j.java.BatchBuilder; import ml.dmlc.treelite4j.java.Predictor; import ml.dmlc.treelite4j.java.SparseBatch; import org.apache.commons.lang.ArrayUtils;

@Slf4j public class XgbModelHandler1{

private Predictor predictor;

protected void loadFromPath(String modelPath) throws Exception {
    predictor = new Predictor(modelPath, 1, true);
}

protected void scoreOptimusObject(FeatureCacheItem userFeature, Map<String, Object> adOfflineFeature,Map<String, FeatureCacheItem> adRealTimeFeature, PredictionResult out) {
    int itemNums = adOfflineFeature.size();
    /*
     * record invalid sample id:the sample without any feature
     */
    Set<String> invalidIdx = new HashSet<>();
    /*
     * generate the datapoint instance
     */
    List<DataPoint> dataPointList =  getDataPoint(invalidIdx,userFeature,adOfflineFeature,adRealTimeFeature);
    /*
     * predict
     */
    float[][] results = new float[itemNums][1];
    if (dataPointList.size() != 0) {
        try {
            SparseBatch batch = BatchBuilder.CreateSparseBatch(dataPointList.iterator());
            results = predictor.predict(batch, true, false);
        } catch (Exception e) {
            log.error("[BatchBuilder] SparseBatch Create error", e);
        }
    }

}

/**
 * generate the datapoint instance
 * @param invalidIdx
 * @param userFeature user feature
 * @param adOfflineFeature ad offline feature
 * @param adRealTimeFeature ad  realtime feature
 * @return
 */
protected List<DataPoint> getDataPoint(Set<String> invalidIdx,FeatureCacheItem userFeature, Map<String, Object> adOfflineFeature,Map<String, FeatureCacheItem> adRealTimeFeature){
    List<DataPoint> dataPointList = new ArrayList<DataPoint>();
    for(Entry<String, Object> entry : adOfflineFeature.entrySet()) {
        String item = entry.getKey();
        FeatureCacheItem itemFeatures = new FeatureCacheItem(entry.getValue().toString());
        if(adRealTimeFeature.keySet().contains(item))
        {
            itemFeatures.merge(adRealTimeFeature.get(item));
        }
        int[] indices = ArrayUtils.addAll(userFeature.getKeys(), itemFeatures.getKeys());
        float[] values = ArrayUtils.addAll(userFeature.getVals(), itemFeatures.getVals());
        if(indices.length > 0  && indices.length == values.length){
            dataPointList.add(new DataPoint(indices, values));
        } else {
            invalidIdx.add(item);
        }
    }
    return dataPointList;
}

}

class FeatureCacheItem { private int[] keys;

public int[] getKeys() {
    return keys;
}

public void setKeys(int[] keys) {
    this.keys = keys;
}

private float[] vals;
public float[] getVals() {
    return vals;
}

public void setVals(float[] vals) {
    this.vals = vals;
}

public FeatureCacheItem(int[] keys,float[] vals){
    this.keys = keys;
    this.vals = vals;
}

public FeatureCacheItem(String val) {
    try {
        FeatureOuterClass.Feature feature = FeatureOuterClass.Feature.parseFrom(
            val.getBytes(StandardCharsets.ISO_8859_1.name()));
        keys = Ints.toArray(feature.getKeyList());
        vals = Floats.toArray(feature.getValList());
    } catch (Exception e) {
        keys = new int[0];
        vals = new float[0];
    }
}
public void merge(FeatureCacheItem featureCacheItem){
    this.keys = ArrayUtils.addAll(this.keys,featureCacheItem.getKeys());
    this.vals = ArrayUtils.addAll(this.vals,featureCacheItem.getVals());
}

}

hcho3 commented 4 years ago

I wonder if some SparseBatch objects are being garbage-collected away.

sunnyDX commented 4 years ago

I wonder if some SparseBatch objects are being garbage-collected away.

This is also a point for me to consider. Maybe it's jni problem?

newwayy commented 4 years ago

I wonder if some SparseBatch objects are being garbage-collected away.

Yes, that's it. C++ is using the CSRBatch object, occasionally SparseBatch has been garbaged by JVM.

At first, I wonder if it's problem with this code:
https://github.com/dmlc/treelite/blob/mainline/runtime/java/treelite4j/src/main/java/ml/dmlc/treelite4j/java/SparseBatch.java#L52
protected void finalize() throws Throwable { super.finalize(); dispose(); }
I changed the sequence, but the problem is still there. dispose(); super.finalize();

Thanks.

hcho3 commented 4 years ago

The issue is that SparseBatch objects are currently accessed in zero-copy fashion, and some SparseBatch objects are being garbage-collected away resulting into dangling references. In the upcoming code refactoring, I am creating a separate data matrix class and it will require one data copy. Making copy may hurt performance but will eliminate dangling references.

hcho3 commented 4 years ago

In https://github.com/dmlc/treelite/pull/196, I created a separate data matrix class (DMatrix) that manages its own memory. Each time the matrix is constructed, the arrays in the parameters will be copied into the matrix. This way, garbage collection will not lead to dangling references.

hcho3 commented 4 years ago

The recent refactor (#196, #198, #199, #201, #203) created a dedicated data matrix class (DMatrix) that manages its own memory. As a result, garbage collection will no longer result into dangling references.

wangqiaoshi commented 2 years ago

ble { super.finalize(); dispose(); } I changed the sequence, but the problem is still there. @sunnyDX 你好,能加个丁丁吗,我也遇到了相同的问题