Open droberts195 opened 1 year ago
Pinging @elastic/ml-core (Team:ML)
"ml/inference_crud/Test put with defer_definition_decompression with invalid definition and no memory estimate" muted by #94855, as it fails 2 out of 3 times in the 3 node serverless test clusters.
Related PR #96804
Related PR https://github.com/elastic/elasticsearch/pull/96804
I don't think that PR will fix this problem though, as it doesn't change the TrainedModelDefinition
class.
PutTrainedModelConfig
is a master node action. Validation of configs takes place on the master node. Therefore, if the node that the request initially arrives on is not the master node then the config that gets validated is a config that has been serialized across the network.Unfortunately serialization across the network does not accurately preserve the submitted trained model config.
This can result in strange error messages. For example, in the test
Test put with defer_definition_decompression with invalid definition and no memory estimate
there is no compressed definition supplied and the error is that there's a mismatch between regression and classification. If this test request happens to get sent to the master node directly then the error is a sensible:However, if the test request is sent to a non-master node then this error is returned, which does not make sense at all:
The problem arises because
TrainedModelDefinition.LazyModelDefinition.writeTo
serializes the result ofgetCompressedDefinition()
, and that creates a compressed definition even if one did not exist originally. This should not be done before validation.