Closed Belphegor21 closed 1 week ago
@wbo4958 Have you seen this before?
I haven't seen this issue. Seems djl has reworked xgboost according to
Caused by: ml.dmlc.xgboost4j.java.XGBoostError: vector::_M_fill_insert
at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
at ml.dmlc.xgboost4j.java.JniUtils.checkCall(JniUtils.java:34)
since XGBoost doesn't have JniUtils
@Belphegor21 could you have a minium code to repro it?
As mentioned previously, this happens only 0.005% of the time. Most of the time the code works fine. I'm unable to reproduce this myself but since it is used in a production system with constant traffic i see errors popping. My concern is that prior to the DJL library upgrade this had a 0.000% failure which increased to 0.005% post upgrade.
Is it the same error message _M_fill_insert
every time it happens?
Yes.
Is it possible that the inference machine is running out of memory?
The mem usage graph never goes above 67%.
Does the inference server rely on xgboost prediction being thread-safe?
There is some level of thread safety involved, since each machine has 8 server threads, so 8 production requests are being executed concurrently on the same machine. This being said, the model inference using xgboost runs on the main server thread. We don't do any multi-threading while calling model. Does that answer you question?
Thank you for sharing, I don't have any explanation on top of my mind. This error happens when there are corruptions in the stack or when one tries to allocate unrealistically large data. https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a02085.html
So far we haven't seen anything related in our tests, with and without sanitizers. Would be great if you could record the data that causes the error when it happens.
Yup that was my thought as well. I'll start logging the input which caused the failure. Maybe that could show something.
Feel free to let me know once there's an easier way to reproduce it, you can contact me in private in case of the data is proprietary.
Not able to reproduce it. I've logged the input which caused the issue but retrying with that same input works. It seems to be a transient error. Also i've noticed SIGSEGV
fault caused by DJL. This was causing service availability issues since my containers would shut down with this fault. I know that this caused by DJL since after i pushed a revert of the upgrade to production those errors stopped and haven't had an issue with availability since.
With DJL 0.25:
With DJL 0.23 (Revert):
The SIGSEGV looks like
13 Jan 2024 22:28:29,985 [36m[DEBUG][m d92a7fa4-4deb-4471-87f6-b6462710daa3 82a4fa21-065e-4947-97e7-902208c65bf3 (MainExecutorService-17) com.amazonaws.request: Received successful response: 200, AWS Request ID: 4E29OFAT5PFI5E3ROQEPPKS27FVV4KQNSO5AEMVJF66Q9ASUAAJG
13 Jan 2024 22:28:29,985 [36m[DEBUG][m d92a7fa4-4deb-4471-87f6-b6462710daa3 82a4fa21-065e-4947-97e7-902208c65bf3 (MainExecutorService-17) com.amazonaws.requestId: x-amzn-RequestId: 4E29OFAT5PFI5E3ROQEPPKS27FVV4KQNSO5AEMVJF66Q9ASUAAJG
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f2d4952b35d, pid=1, tid=1267
#
# JRE version: OpenJDK Runtime Environment 1.0.1516.0 (17.0.9+13) (build 17.0.9+13-LTS)
# Java VM: OpenJDK 64-Bit Server VM 1.0.1516.0 (17.0.9+13-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
There isn't anything after that in the logs.
Thank you for sharing, it seems an issue in the DJL repo would be more helpful?
Good point, create one there, https://github.com/deepjavalibrary/djl/issues/2969
Is there incompatibility between model version and XGBoost version? We were using a model based on XGBoost 1.7.6 and DJL 0.25.x uses XGBoost 2.0.1(if memory serves).
No issue, new XGB can load old models.
Feel free to reopen if this is caused by XGBoost.
After upgrading to DJL library 0.25.0 which uses XGBoost 2.0.1 (https://github.com/deepjavalibrary/djl/blob/v0.25.0/gradle.properties#L25) I'm seeing increased failure while calling model.
Sample stack trace
Prior to the upgrade there were no issues but now i see a 0.005% failure rate. The failure rate is low but earlier it was 0%. So that caused it?