While ML nodes auto-scale, the trained models footprint remains a constant. Notably, for ELSER and E5 it is determined by the number of threads and allocations set by the user during the time of its deployment. This causes resources to remain committed even at times that they are not actually needed, and this issue applies to all trained models. Trained model deployments need to auto-scale CPUs, i.e. they need to dynamically release resources that are not used or to put them to work for ingest vs query-time, as needed.
While ML nodes auto-scale, the trained models footprint remains a constant. Notably, for ELSER and E5 it is determined by the number of threads and allocations set by the user during the time of its deployment. This causes resources to remain committed even at times that they are not actually needed, and this issue applies to all trained models. Trained model deployments need to auto-scale CPUs, i.e. they need to dynamically release resources that are not used or to put them to work for ingest vs query-time, as needed.