elastic / ml-cpp

Machine learning C++ code
Other
149 stars 62 forks source link

[ML] DFA job failed with the error: failed to load model [...] definition #2626

Closed wwang500 closed 6 months ago

wwang500 commented 7 months ago

Version

Latest 8.13.0-SNAPSHOT

Step to reproduce

Observed:

Job will fail with the message: Updated analytics task state to [failed] with reason [[dfa_classification_special_char_2019_1707902226_000_0] failed running inference on model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557]; cause was [failed to load model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557] definition]] .

in es log:

[2024-02-14T09:17:09,433][INFO ][o.e.x.m.d.s.ReindexingStep] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Started reindexing
[2024-02-14T09:17:13,020][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Started loading data
[2024-02-14T09:17:13,573][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Started analyzing
[2024-02-14T09:17:13,573][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Waiting for result processor to complete
[2024-02-14T09:17:14,446][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.ds-.logs-deprecation.elasticsearch-default-2024.02.14-000001] creating index, cause [initialize_data_stream], templates [.deprecation-indexing-template], shards [1]/[1]
[2024-02-14T09:17:14,446][INFO ][o.e.c.m.MetadataCreateDataStreamService] [node-0] adding data stream [.logs-deprecation.elasticsearch-default] with write index [.ds-.logs-deprecation.elasticsearch-default-2024.02.14-000001], backing indices [], and aliases []
[2024-02-14T09:17:14,447][INFO ][o.e.c.r.a.AllocationService] [node-0] updating number_of_replicas to [0] for indices [.ds-.logs-deprecation.elasticsearch-default-2024.02.14-000001]
[2024-02-14T09:17:14,549][INFO ][o.e.x.i.IndexLifecycleTransition] [node-0] moving index [.ds-.logs-deprecation.elasticsearch-default-2024.02.14-000001] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [.deprecation-indexing-ilm-policy]
[2024-02-14T09:17:14,700][INFO ][o.e.c.r.a.AllocationService] [node-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ds-.logs-deprecation.elasticsearch-default-2024.02.14-000001][0]]])." previous.health="YELLOW" reason="shards started [[.ds-.logs-deprecation.elasticsearch-default-2024.02.14-000001][0]]"
[2024-02-14T09:17:14,754][INFO ][o.e.x.i.IndexLifecycleTransition] [node-0] moving index [.ds-.logs-deprecation.elasticsearch-default-2024.02.14-000001] from [{"phase":"new","action":"complete","name":"complete"}] to [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}] in policy [.deprecation-indexing-ilm-policy]
[2024-02-14T09:17:14,794][INFO ][o.e.c.m.MetadataMappingService] [node-0] [.ds-.logs-deprecation.elasticsearch-default-2024.02.14-000001/S8td_YqLTQmyhNXoODPRhA] update_mapping [_doc]
[2024-02-14T09:17:14,846][INFO ][o.e.x.i.IndexLifecycleTransition] [node-0] moving index [.ds-.logs-deprecation.elasticsearch-default-2024.02.14-000001] from [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}] to [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}] in policy [.deprecation-indexing-ilm-policy]
[2024-02-14T09:18:04,570][INFO ][o.e.x.s.s.SecurityIndexManager] [node-0] security index does not exist, creating [.security-profile-8] with alias [.security-profile]
[2024-02-14T09:18:04,579][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.security-profile-8] creating index, cause [api], templates [], shards [1]/[0]
[2024-02-14T09:18:04,754][INFO ][o.e.c.r.a.AllocationService] [node-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.security-profile-8][0]]])." previous.health="YELLOW" reason="shards started [[.security-profile-8][0]]"
[2024-02-14T09:19:02,121][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] [data_frame_analyzer/83119] [CBoostedTreeImpl.cc@371] Stopping fine-tune hyperparameters on round 4 out of 18
[2024-02-14T09:19:02,679][INFO ][o.e.x.m.d.p.ChunkedTrainedModelPersister] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] finished storing trained model with id [dfa_classification_special_char_2019_1707902226_000_0-1707920342557]
[2024-02-14T09:19:02,726][INFO ][o.e.x.m.d.p.AnalyticsResultProcessor] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Started writing results
[2024-02-14T09:19:03,569][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Result processor has completed
[2024-02-14T09:19:03,572][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Closing process
[2024-02-14T09:19:03,572][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] [data_frame_analyzer/83119] [Main.cc@261] [[{"name":"E_DFTPMEstimatedPeakMemoryUsage","description":"The upfront estimate of the peak memory training the predictive model would use","value":88039355}
,{"name":"E_DFTPMPeakMemoryUsage","description":"The peak memory training the predictive model used","value":36074039}
,{"name":"E_DFTPMTimeToTrain","description":"The time it took to train the predictive model","value":108740}
,{"name":"E_DFTPMTrainedForestNumberTrees","description":"The total number of trees in the trained forest","value":4}
,]]
[2024-02-14T09:19:03,573][INFO ][o.e.x.m.p.AbstractNativeProcess] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] State output finished
[2024-02-14T09:19:03,573][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Closed process
[2024-02-14T09:19:03,648][INFO ][o.e.x.m.d.i.InferenceRunner] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Started inference on test data against model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557]
[2024-02-14T09:19:03,659][ERROR][o.e.x.m.d.i.InferenceRunner] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Error running inference on model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557]

org.elasticsearch.ElasticsearchStatusException: failed to load model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557] definition
        at org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService.lambda$loadWithoutCaching$15(ModelLoadingService.java:565) ~[?:?]
        at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.persistence.TrainedModelProvider.lambda$getTrainedModelForInference$15(TrainedModelProvider.java:582) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.persistence.ChunkedTrainedModelRestorer.doSearch(ChunkedTrainedModelRestorer.java:202) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.persistence.ChunkedTrainedModelRestorer.lambda$restoreModelDefinition$0(ChunkedTrainedModelRestorer.java:134) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) ~[?:?]
[2024-02-14T09:19:03,662][ERROR][o.e.x.m.d.DataFrameAnalyticsTask] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Setting task to failed
org.elasticsearch.ElasticsearchStatusException: [dfa_classification_special_char_2019_1707902226_000_0] failed running inference on model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557]; cause was [failed to load model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557] definition]
        at org.elasticsearch.xpack.ml.dataframe.inference.InferenceRunner.run(InferenceRunner.java:126) ~[?:?]
        at org.elasticsearch.xpack.ml.dataframe.steps.InferenceStep.lambda$runInference$5(InferenceStep.java:99) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) ~[?:?]
Caused by: org.elasticsearch.ElasticsearchStatusException: failed to load model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557] definition
        at org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService.lambda$loadWithoutCaching$15(ModelLoadingService.java:565) ~[?:?]
        at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.persistence.TrainedModelProvider.lambda$getTrainedModelForInference$15(TrainedModelProvider.java:582) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.persistence.ChunkedTrainedModelRestorer.doSearch(ChunkedTrainedModelRestorer.java:202) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.persistence.ChunkedTrainedModelRestorer.lambda$restoreModelDefinition$0(ChunkedTrainedModelRestorer.java:134) ~[?:?]
        ... 4 more
[2024-02-14T09:19:03,714][INFO ][o.e.x.m.d.DataFrameAnalyticsTask] [node-0] [dfa_classification_special_char_2019_1707902226_000_0] Updated analytics task state to [failed] with reason [[dfa_classification_special_char_2019_1707902226_000_0] failed running inference on model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557]; cause was [failed to load model [dfa_classification_special_char_2019_1707902226_000_0-1707920342557] definition]]
edsavage commented 7 months ago

While I haven't got to the root cause of this issue I have obtained what may be something of a clue.

Additional debug messages were printed from the CJsonOutputStreamWrapper class by way of adding an additional member

std::ostringstream m_DebugStream;

and writing to it in addition to the ostream in m_ConcurrentOutputStream, then logging the contents of m_DebugStream in CJsonOutputStreamWrapper::flush

This does produce a lot of noise but one thing of particular interest is the compressed_inference_model string. Copying the contents of that string to a file, say cim.txt and then decoding it with

base64 -d -i cim.txt -o - | gunzip -c

gives the raw JSON document. When attempting to validate this document using say jq, gives an error.

parse error: Invalid escape at line 1, column 359

Using an IDE, such as CLion to format the JSON, shows that there are unescaped special characters in the keys, e.g.

"contains\ and a /": 0.050394797956339991,

Here, the backslash needs escaping.

I think a solution may be to use boost::json::serialize on the keys, in a similar way to how it is done for string values.

droberts195 commented 7 months ago

I think a solution may be to use boost::json::serialize on the keys, in a similar way to how it is done for string values.

Yes, that sounds highly likely, especially as the failing test is called "special_characters_2019".

It looks like both these places need changing:

https://github.com/elastic/ml-cpp/blob/bd3238b46739abf9f4096cc4f9bde335873ad67a/include/core/CBoostJsonWriterBase.h#L190-L194

and:

https://github.com/elastic/ml-cpp/blob/bd3238b46739abf9f4096cc4f9bde335873ad67a/include/core/CBoostJsonWriterBase.h#L348-L360

They both need to be more like:

https://github.com/elastic/ml-cpp/blob/bd3238b46739abf9f4096cc4f9bde335873ad67a/include/core/CBoostJsonWriterBase.h#L337-L346