elastic / ml-cpp

Machine learning C++ code
Other
7 stars 62 forks source link

[ML] Fix source of multiple unit test failures on macos #2765

Closed edsavage closed 1 month ago

edsavage commented 1 month ago

On recent versions of macOS, e.g. Sonoma 14.7 with clang version on or about 16.0.0 we experience multiple unit test failures in the model and api libraries that stem from a failed dynamic_cast when running code optimized in any way - i.e. not Debug builds. An example test failure is

../../../cmake-build-relwithdebinfo/test/lib/model/unittest/ml_test_model --run_test=CEventRateAnomalyDetectorTest/testPersist
Running 1 test case...
2024-10-16 01:54:42,963125 UTC [91925] DEBUG /Users/eds/src/elasticsearch/ml-cpp/lib/test/CTestObserver.cc@28 +---------------------------------------------+
2024-10-16 01:54:42,963145 UTC [91925] DEBUG /Users/eds/src/elasticsearch/ml-cpp/lib/test/CTestObserver.cc@29 |  CEventRateAnomalyDetectorTest/testPersist  |
2024-10-16 01:54:42,963151 UTC [91925] DEBUG /Users/eds/src/elasticsearch/ml-cpp/lib/test/CTestObserver.cc@30 +---------------------------------------------+
2024-10-16 01:54:42,963300 UTC [91925] DEBUG /Users/eds/src/elasticsearch/ml-cpp/lib/model/CAnomalyDetector.cc@119 CAnomalyDetector(): count by status for '', first time = 1346713620, bucketLength = 3600, m_LastBucketEndTime = 1346716800
2024-10-16 01:54:42,979013 UTC [91925] ERROR /Users/eds/src/elasticsearch/ml-cpp/lib/maths/time_series/CTimeSeriesMultibucketFeatureSerialiser.cc@67 Unknown feature with type 'NSt3__110unique_ptrIN2ml5maths11time_series29CTimeSeriesMultibucketFeatureIdEENS_14default_deleteIS5_EEEE
2024-10-16 01:54:42,979255 UTC [91925] DEBUG /Users/eds/src/elasticsearch/ml-cpp/lib/model/CAnomalyDetector.cc@119 CAnomalyDetector(): count by status for '', first time = 0, bucketLength = 3600, m_LastBucketEndTime = 0
2024-10-16 01:54:42,979621 UTC [91925] ERROR /Users/eds/src/elasticsearch/ml-cpp/lib/maths/time_series/CTimeSeriesModel.cc@1354 Failed to restore MULTIBUCKET_FEATURE_6_3_TAG
2024-10-16 01:54:42,979635 UTC [91925] ERROR /Users/eds/src/elasticsearch/ml-cpp/lib/model/CIndividualModel.cc@383 Failed to restore FEATURE_MODELS_TAG
2024-10-16 01:54:42,979641 UTC [91925] ERROR /Users/eds/src/elasticsearch/ml-cpp/lib/model/CEventRateModel.cc@128 Failed to restore INDIVIDUAL_STATE_TAG
2024-10-16 01:54:42,979647 UTC [91925] ERROR /Users/eds/src/elasticsearch/ml-cpp/lib/model/CAnomalyDetector.cc@219 Failed to restore live models from
2024-10-16 01:54:42,979651 UTC [91925] ERROR /Users/eds/src/elasticsearch/ml-cpp/lib/model/CAnomalyDetector.cc@189 Invalid model ensemble section in
/Users/eds/src/elasticsearch/ml-cpp/lib/model/unittest/CEventRateAnomalyDetectorTest.cc:275: fatal error: in "CEventRateAnomalyDetectorTest/testPersist": critical check traverser.traverseSubLevel( std::bind(&ml::model::CAnomalyDetector::acceptRestoreTraverser, &restoredDetector, EMPTY_STRING, std::placeholders::_1)) has failed

From the above the first ERROR message is most relevant Unknown feature with type NSt3__110unique_ptrIN2ml5maths11time_series29CTimeSeriesMultibucketFeatureIdEENS_14default_deleteIS5_EEEE

Demangling the symbol is useful..

c++filt -t NSt3__110unique_ptrIN2ml5maths11time_series29CTimeSeriesMultibucketFeatureIdEENS_14default_deleteIS5_EEEE
std::__1::unique_ptr<ml::maths::time_series::CTimeSeriesMultibucketFeature<double>, std::__1::default_delete<ml::maths::time_series::CTimeSeriesMultibucketFeature<double>>>

Looking at CTimeSeriesMultibucketFeatureSerialiser.cc around line 67 shows the source of error, a failure to dynamically cast base type CTimeSeriesMultibucketFeature<double> * to derived CTimeSeriesMultibucketScalarMean *

As mentioned above, this error does not occur when running code compiled with no optimizations. Googling gives a hint - https://stackoverflow.com/questions/44489597/why-dont-c-compilers-optimize-this-dynamic-cast-from-a-final-class. The suggestion is that the final keyword can have some impact on dynamic_cast in optimized code.

Removing final from the definition of CTimeSeriesMultibucketMean does indeed resolve the problem.

This PR provides a solution to the problem, which may be useful in the short to mid term. Ideally we'd be able to use full C++ language features without restriction.