dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.03k stars 8.69k forks source link

XGBoost model inference issue post upgrading to XGBoost 2.0.1 #9950

Closed Belphegor21 closed 1 week ago

Belphegor21 commented 7 months ago

After upgrading to DJL library 0.25.0 which uses XGBoost 2.0.1 (https://github.com/deepjavalibrary/djl/blob/v0.25.0/gradle.properties#L25) I'm seeing increased failure while calling model.

Sample stack trace

...
Caused by: ai.djl.engine.EngineException: XGBoost Engine error: 
    at ml.dmlc.xgboost4j.java.JniUtils.checkCall(JniUtils.java:36)
    at ml.dmlc.xgboost4j.java.JniUtils.inference(JniUtils.java:86)
    at ai.djl.ml.xgboost.XgbSymbolBlock.forwardInternal(XgbSymbolBlock.java:68)
    at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:79)
    at ai.djl.nn.Block.forward(Block.java:127)
    at ai.djl.inference.Predictor.predictInternal(Predictor.java:144)
    at ai.djl.inference.Predictor.batchPredict(Predictor.java:171)
    ... 110 more
Caused by: ml.dmlc.xgboost4j.java.XGBoostError: vector::_M_fill_insert
    at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
    at ml.dmlc.xgboost4j.java.JniUtils.checkCall(JniUtils.java:34)
    ... 116 more

Prior to the upgrade there were no issues but now i see a 0.005% failure rate. The failure rate is low but earlier it was 0%. So that caused it?

trivialfis commented 7 months ago

@wbo4958 Have you seen this before?

wbo4958 commented 7 months ago

I haven't seen this issue. Seems djl has reworked xgboost according to

Caused by: ml.dmlc.xgboost4j.java.XGBoostError: vector::_M_fill_insert
    at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
    at ml.dmlc.xgboost4j.java.JniUtils.checkCall(JniUtils.java:34)

since XGBoost doesn't have JniUtils

wbo4958 commented 7 months ago

@Belphegor21 could you have a minium code to repro it?

Belphegor21 commented 7 months ago

As mentioned previously, this happens only 0.005% of the time. Most of the time the code works fine. I'm unable to reproduce this myself but since it is used in a production system with constant traffic i see errors popping. My concern is that prior to the DJL library upgrade this had a 0.000% failure which increased to 0.005% post upgrade.

trivialfis commented 7 months ago

Is it the same error message _M_fill_insert every time it happens?

Belphegor21 commented 7 months ago

Yes.

trivialfis commented 7 months ago

Is it possible that the inference machine is running out of memory?

Belphegor21 commented 7 months ago

The mem usage graph never goes above 67%.

trivialfis commented 7 months ago

Does the inference server rely on xgboost prediction being thread-safe?

Belphegor21 commented 7 months ago

There is some level of thread safety involved, since each machine has 8 server threads, so 8 production requests are being executed concurrently on the same machine. This being said, the model inference using xgboost runs on the main server thread. We don't do any multi-threading while calling model. Does that answer you question?

trivialfis commented 7 months ago

Thank you for sharing, I don't have any explanation on top of my mind. This error happens when there are corruptions in the stack or when one tries to allocate unrealistically large data. https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a02085.html

So far we haven't seen anything related in our tests, with and without sanitizers. Would be great if you could record the data that causes the error when it happens.

Belphegor21 commented 7 months ago

Yup that was my thought as well. I'll start logging the input which caused the failure. Maybe that could show something.

trivialfis commented 7 months ago

Feel free to let me know once there's an easier way to reproduce it, you can contact me in private in case of the data is proprietary.

Belphegor21 commented 7 months ago

Not able to reproduce it. I've logged the input which caused the issue but retrying with that same input works. It seems to be a transient error. Also i've noticed SIGSEGV fault caused by DJL. This was causing service availability issues since my containers would shut down with this fault. I know that this caused by DJL since after i pushed a revert of the upgrade to production those errors stopped and haven't had an issue with availability since.

With DJL 0.25: Screenshot 2024-01-29 at 11 26 32 AM

With DJL 0.23 (Revert): Screenshot 2024-01-29 at 11 25 10 AM

The SIGSEGV looks like

13 Jan 2024 22:28:29,985 [DEBUG] d92a7fa4-4deb-4471-87f6-b6462710daa3 82a4fa21-065e-4947-97e7-902208c65bf3 (MainExecutorService-17) com.amazonaws.request: Received successful response: 200, AWS Request ID: 4E29OFAT5PFI5E3ROQEPPKS27FVV4KQNSO5AEMVJF66Q9ASUAAJG
13 Jan 2024 22:28:29,985 [DEBUG] d92a7fa4-4deb-4471-87f6-b6462710daa3 82a4fa21-065e-4947-97e7-902208c65bf3 (MainExecutorService-17) com.amazonaws.requestId: x-amzn-RequestId: 4E29OFAT5PFI5E3ROQEPPKS27FVV4KQNSO5AEMVJF66Q9ASUAAJG
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2d4952b35d, pid=1, tid=1267
#
# JRE version: OpenJDK Runtime Environment 1.0.1516.0 (17.0.9+13) (build 17.0.9+13-LTS)
# Java VM: OpenJDK 64-Bit Server VM 1.0.1516.0 (17.0.9+13-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:

There isn't anything after that in the logs.

trivialfis commented 7 months ago

Thank you for sharing, it seems an issue in the DJL repo would be more helpful?

Belphegor21 commented 7 months ago

Good point, create one there, https://github.com/deepjavalibrary/djl/issues/2969

Belphegor21 commented 7 months ago

Is there incompatibility between model version and XGBoost version? We were using a model based on XGBoost 1.7.6 and DJL 0.25.x uses XGBoost 2.0.1(if memory serves).

trivialfis commented 7 months ago

No issue, new XGB can load old models.

trivialfis commented 1 week ago

Feel free to reopen if this is caused by XGBoost.