Open YujiOshima opened 7 years ago
@YujiOshima - I have a few questions:
Are there anything specific about the CUDA flavor that we can't combine it with the swarm flavor? For example, if we only need some command line (e.g. apt-get install ...
) then can't we include them in the template instead?
What are the requirements on health checks that are special for this use case?
What about labeling of the nodes / Docker engines on these nodes?
What are the rolling update / update semantics for the use cases for a) training and b) inference (serving)? Is your use case focusing only on training? I think it would be nice to look at how we can support serving / inference as well, since that's when rolling updates etc make the most sense (pushing out a new model, updating tensorflow serving etc.)
I believe tensorflow supports checkpointing during training. I think we would have to take this into account because we don't want hours of training be destroyed by a simple config change on our side. For this, I can see how a flavor plugin's Drain
would be helpful.
Telemetry -- how can we combine metrics such as CPU/GPU utilization, with events in infrakit?
Storage -- how are we provisioning disks and possibly other blob storage like Cepth so that the TF jobs can easily access storage?
How does the user launch a training session -- through a docker run
or deploy a k8s batch job ? Is the cluster multi-tenant?
@chungers
I want to separate nvidia driver layer and application layer, to manage swarm and app without conscious of nvidia driver version. For example at default, Docker needs to volume mount /usr/lib/nvidia-3xx.
Health checks is pretty difficult as nvidia command does not have remote api so far as I know. May be we need to set agent and check nvidia-smi and libs etc.
I think docker swarm flavor or instance plugin are responsible for it.
Yes, inference and serving are big usecase. Docker swarm or kubernetes are responsible for tensorflow application rolling update. GPU flavor is responsible for Nvidia driver or cuda update. Of course it need to support their update and need to cooperation with swarm flavor.
Yes, it's important problem. For now, it is manage at tensorflow layer in container. Although it is ideal that users can drain without conscious of jobs in operation, it is necessary to carefully consider the design of physical devices, cluster orchestrator and applications. For example, to have shared storage or to always share as kv-store, and on which layer to provide it.
Same as health checks, how about set some agents to nodes?
Not only Tensorflow but also discussions on how Infrakit treats resources. I think that there is no specific requirement for machine learning applications like current TF.
In my thought, using k8s batch job. And in my usecase, it is multiuser but not multi-tenant.
@YujiOshima
Thank you for your comments. I think there are a couple of areas we can work on:
Health checks is pretty difficult as nvidia command does not have remote api so far as I know. May be we need to set agent and check nvidia-smi and libs etc.
Same as health checks, how about set some agents to nodes?
Unless there are specific requirements, I think instead of developing local agent, we should look at integration with Prometheus. We can then install node_exporter and nvidia_exporter on the host. Monitoring and alerting would then follow the standard Prometheus set up (e.g. Grafana, etc)
If we care about cluster autoscaling, then I see an area of integration where alerts or thresholds from Prometheus can trigger events to Infrakit to scale up the cluster. This can mean
There are two directions:
For the first one, we already have the Drain
method defined in the Flavor plugin for k8s and swarm. So we just need to add the implementations. We can use the existing swarm and k8s flavors.
For the second case, the outside-facing API is not yet defined. As a concrete example, for k8s, we can create an API that easily maps to their cloudprovider
spi of their cluster autoscaler: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/cloud_provider.go#L29
This is what I meant when I suggested "Infrakit Apps": an application-specific API that can be implemented on top of Infrakit primitives. So this would be the 'scale group' API for applications that sit on top of infrakit in the stack. It would also interface with the metrics/health in part 1 so that cluster scale up/down can be triggered by thresholds, etc.
I'd like to get this started and hopefully reconcile with the work you've already done in #474 so that they fit cleanly within the overall architecture of infrakit. Is that ok with you?
Thank you @chungers !
Rather than developing original agent, I agree to use Prometheus's metrics. But I am not sure of the relationship between Infrakit and Prometheus. Apart from Infrakit, deploy Prometheus and Infrakit's health check communicates with Prometheus Server?
Infrakit to container orchestrator ==> drain a particular node prior to update Container orchestrator to infrakit ==> scale up / down
Please let me confirm my understanding. First is that the drain function of flavor sends an instruction to container orchestrator, moves the container from one node to another, and deletes the node. The second is to send autoscale instructions of kubernetes to infrakit app instead of cloud provider, right? For example, if you deploy k8s to aws you have two choices, whether to make a cloud provider aws or infrakit. Without aws it will be impossible to link natively with ECR. I am worried about whether this will confuse users. The idea of providing an API for communicating with the cluster for each container orchestrator in APP is agreeable as it increases flexibility.
@YujiOshima
Rather than developing original agent, I agree to use Prometheus's metrics. But I am not sure of the relationship between Infrakit and Prometheus. Apart from Infrakit, deploy Prometheus and Infrakit's health check communicates with Prometheus Server?
I think maybe the first thing we should do is to define "health". What are the metrics that you'd care about (e.g. with NVML) that would be appropriate for infrakit? In other words, what metrics, besides the host disappearing altogether, would you care to monitor so that when it changes past certain thresholds, infrakit should start spinning up a new host? From the list at NVML, I can see metrics like active compute process and the GPU utilization would be useful.
Once we defined what "health" is, we can decide how to collect this data and report to infrakit. Because node exporter and the nvidia_exporter all listen on network ports, we can have a flavor plugin that polls or scrapes the data from the nodes. This is similar to how the Prometheus server would do, but if our flavor plugins can scrape the data directly, then Prometheus server can be optional.
Does this seem reasonable? It's pretty easy to build a collector of prometheus data and in turn implement our flavor's Health method. I think we can even build it generically so that it works for whatever Prometheus monitors -- we just need to have the agent/exporters running on the hosts.
Please let me confirm my understanding. First is that the drain function of flavor sends an instruction to container orchestrator, moves the container from one node to another, and deletes the node.
Yes.
The second is to send autoscale instructions of kubernetes to infrakit app instead of cloud provider, right? For example, if you deploy k8s to aws you have two choices, whether to make a cloud provider aws or infrakit.
Yes - in essence infrakit acts like an autoscaling group. On AWS yes you could use their ASG, but we can support specialized use cases like retaining the EBS volume where you may have checkpointed and restorable containers. This may be useful especially for training that has been running for some time and you want to be able to resume cleanly. I don't think you can do this easily with ASG. Of course, in the on-prem cases, there are no autoscaling groups so we can definitely fill that gap.
Without aws it will be impossible to link natively with ECR. I am worried about whether this will confuse users. Are you using ECR? Do we need some integration with it? Are you also using the GPU instances on AWS or do you have on-prem / bare-metal hosts?
The idea of providing an API for communicating with the cluster for each container orchestrator in APP is agreeable as it increases flexibility.
We will revisit your infrakit app PR because they are related in terms of how these "apps" fit with the rest of the architecture. I will start a PR soon and have you take a look and see if that make sense. The tricky part is that we need to have a single endpoint for higher-level systems to access, but in HA mode, our daemons run on different hosts. So this API endpoint needs to be able to route traffic to the current leader.
I think the API will be REST instead of JSON-RPC or GRPC. What do you think?
I would like to make it possible to manage GPU clusters using Infrakit. As an idea, first of all nvidia GPU only, create a flavor to install nvidia driver and cuda. For example, by combining cuda falvor and swarm falvor, you can build a GPU cluster with swarm. Image of running option.
infrakit-flavor-cuda --drive=369.95 --cuda=8.0