Open sidoruka opened 4 years ago
The next realization of this issue is proposed:
1) New entity NodeDescription
introduced:
NodeDescription:
Long id
\\ set of requirements for node
RunInstance instance
int numberOfInstances
\\ Start or Stop
Action
2) When Admin create schedule for specific instance type with disk size, new NodeDescription
created and persisted to DB
3) NodeDescription.id
is used as schedulableId
in RunSchedule
to be able to schedule action
4) New type of Job NodeJob
is implementeed
4) New field Queue<NodeDescription> freeNodeActions
is added, this field will be shared between Autoscaler
and NodeJob
5) NodeJob
simply populate this queue so many times how specified in NodeDescription
Autoscaler
works with this queue in the following manner:
void handleNodeActions() {
NodeDescription nd = freeNodeActions.pop()
switch (nd.action) {
case Start:
for (i = 0; i < nd.numberOfInstances; i++) {
startNode(nd.runInstance, nd.id)
}
case Stop:
for (i = 0; i < nd.numberOfInstances; i++) {
markNodesForTerination(nd.id)
}
default:
throw new IlligalArgumentEx
}
terminateAllMarkedNodeIfPossible()
}
In method startNode
we will create new node, if it is possible (f.e. if maxNodeCount
> currentClusterSize
) and mark it with NodeDescription id
In method markNodesForTerination
all nodes with specified NodeDescription.id
will be marked with tag readyToTerminate
And in method terminateAllMarkedNodeIfPossible
all nodes with readyToTerminate
tag will terminated if it possible (if no run currently running on that node)
This approach with marking node solves the problem when we should kill a node due to schedule but it still working with some run, in our case it will be terminated as soon as it done with run
(if it not possible to do it right now, it will be done on the next step of autoscaler
loop)
Usage of scheduled
approach may not cover all the requirements to persistent node management, so I'd suggest to implement a new functionality for this.
A new entity PersistentNode
is added with the following fields:
count
- number of nodesregionId
instanceType
instanceDisk
priceType
dockerImage
to support pre-pulled images (TBD)start
a CRON (TBD) expression specifying time from which node shall be active (e.g. Monday 10 AM)end
a CRON (TBD) expression specifying time from which schedule is not active (e.g. Friday 6 PM)Scale up:
PersistentNode
s is fetched and cluster state is verified. If current number of nodes with required configuration in the cluster is lower than count
specified in PersistentNode
new free nodes are created with respect to total cluster size limitrun_id
labels for free nodes - review current approachScale down:
PersistentNode
s shall be left in the cluster if total number of such nodes is less or equal to node count
specified in the PersistentNode
ge
@sidoruka server part backported to release/0.16
Test cases were created by #1929 and located here.
Docs were added via #1516.
Background At the moment, Cloud Pipeline allows controlling the number of
persistent
compute nodes in the cluster. I.e. a certain number (cluster.min.size
) of the nodes of a specified size (cluster.instance.type
/cluster.instance.hdd
) will be always available in the cluster (even if there is no workload). This is useful to speed up the compute instances creation process (as the nodes are already up and running).Moreover, we need to make this mechanism a bit more flexible.
Approach
Cluster State
tab, instead of the globalPreferences
(this shall be available to theROLE_ADMIN
users only)