NVIDIA Docker Build - Githubissues

jpuerto-psc commented 3 years ago

Good Morning Toil Team,

Wanted to reach out and get some guidance on provisioning nodes with a specific docker build? For our CODEX pipeline we rely on our custom CWL install, which adds GPU options. This also relies on the NVIDIA container toolkit to accept those options.

Was looking to see if this was at all possible through the Toil framework? From what I understand so far, I would need to install this onto a worker node (since that is what would be actually running that pipeline)? Or does the leader node need to have this installed?

Sorry for any naive questions - still trying to get familiar with the architecture and AWS terminology.

Thanks in advance!

Best regards,

Juan

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-948

adamnovak commented 3 years ago

If you have an AMI you want to use with Toil's cluster management, you can set TOIL_AWS_AMI so that toil launch-cluster will use it. So you could take Flatcar, replace its Docker with the modified NVidia one, and use that. You can also use TOIL_APPLIANCE_SELF to set the Docker image that Toil jobs run inside of, so you could use that to set one where the modified Docker client is available.

We also have TOIL_CUSTOM_DOCKER_INIT_COMMAND and TOIL_CUSTOM_INIT_COMMAND for customizing the setup process for the worker containers, but we don't have any way to tell the worker containers themselves to start up with additional Docker options (here for Mesos clusters; I'm not sure if any of this NVidia stuff would work with Kubernetes clusters).

But if you swap out the base AMI for one with the customized container runtime, and the Toil container for one where the docker command understands NVidia's new options, you could run a Toil job that would then run a docker command that could start a container that could use these custom Docker options.

Then the hard part is accessing all that from CWL; what exactly have you done (on top of vanilla cwltool I assume?) to get custom CWL extensions like this working? Maybe you could just install a modified version of the cwltool module on the leader and in the worker containers?

adamnovak commented 3 years ago

It looks like NVidia's container runtime modifications can be used with Kubernetes. For that to work with toil launch-cluster clusters I think you would have to change the Toil base AMI to one with the right container runtime, and modify the Kubernetes setup script we use to install their operator that sets up the drivers on the workers.

Or you could set up your own Kubernetes cluster, and set up the container runtime and the drivers on the nodes yourself.

However, with Kubernetes, we don't run containers that Toil jobs start as siblings; everything gets run in the one pod via singularity. So you'd have to make sure that the GPUs were exposed to the Toil pods (possibly by modifying Toil to tell Kubernetes to do that) and that they were accessible to your singularity-containerized or uncontainerized CWL workflow steps.

jpuerto-psc commented 3 years ago

@adamnovak Thank you for this feedback. It seems that the standard Flatcar AMIs that are offered on AWS do not support EC2 instances that have GPUs. That makes things a bit more complicated than expected.

The AMI which comes with the Nvidia tools pre-installed, does not seem to work with Toil. I would assume that this is because the OS for this AMI is: Linux/Unix, Ubuntu Ubuntu 18.04.

Taking all of this into account - it seems that the solution here would be to create a custom Flatcar image, upload(?) that as an AWS AMI, and then try to use that for Toil. What are your thoughts on this?

adamnovak commented 3 years ago

If you can get NVidia's tools installed on Flatcar, you could take a snapshot and make that an AMI, and then use that AMI.

Does the Flatcar AMI not let you run it on the GPU instances? Or is it that it will run but it lacks GPU drivers?

Getting Toil to deploy a cluster using a base that isn't Flatcar would be possible, but right now we lean heavily on Flatcar's Ignition setup tool, and Ubuntu's cloud-init won't understand our user data instructions to do things like deploy systemd units. I think we also insist on there being a user named "core". We'd need to make a bunch of Toil changes, and it's not clear that we'd really want to keep any of them in the main Toil line of development.

jpuerto-psc commented 3 years ago

@adamnovak The Flatcar AMI cannot be run on GPU instances. It specifically has those EC2 instance types disabled.

I did try taking a snapshot of an EC2 instance on which I installed those tools - but run into the same issue. It can't be run on EC2 instances with GPUs.

Completely understood on not wanting to get Toil to deploy clusters on non Flatcar AMIs. It does seem like Flatcar's Pro AMI will be adding GPU support in the future, but there's no indication on when that is expected.

RE: Your questions around how we are implementing GPU support in CWL, we are using a custom CWL version written by our team. I'm not familiar with this, since I didn't write it, but it is functional on our workflows that are running on our on-prem resources.

Would it make sense to fork Toil and try to go through the set-up for integrating it with the Nvidia AMI?

adamnovak commented 3 years ago

@jpuerto-psc Which GPU instance type specifically is Flatcar refusing to let you launch? In us-west-2 with Flatcar AMI ami-019657181ea76e880 from the feed at https://stable.release.flatcar-linux.net/amd64-usr/current/flatcar_production_ami_hvm_us-west-2.txt, I was able to manually launch a p2.xlarge instance, which has an NVidia GPU in it.

ssh core@54.190.14.155
The authenticity of host '54.190.14.155 (54.190.14.155)' can't be established.
ECDSA key fingerprint is SHA256:Ly1sjA60/MzcHbbeNjq9zdebjqdx4GI78HKKRXIUXeQ.
ECDSA key fingerprint is MD5:62:53:32:45:4e:72:95:1b:43:39:47:89:ca:7a:92:63.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '54.190.14.155' (ECDSA) to the list of known hosts.
Flatcar Container Linux by Kinvolk stable (2765.2.6)
core@ip-172-31-11-3 ~ $ lspci; wget -q -O - http://169.254.169.254/latest/meta-data/instance-type ; echo
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Device 1d0f:ec20
00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
p2.xlarge
core@ip-172-31-11-3 ~ $

There's also https://github.com/shelmangroup/coreos-gpu-installer and a description at https://docs.giantswarm.io/advanced/gpu/ that show how to install the NVidia drivers onto Flatcar by running a container, although you'd have to work backwards from Kubernetes YAML to docker command if you wanted to use it here. But they definitely have Flatcar running on AWS GPUY instances.

The Flatcar AMIs will lock you out of the ARM Graviton instances, since the stable ones are all x86_64, but the GPU instances seem to be available when I look at them.

jpuerto-psc commented 3 years ago

@adamnovak I am attempting this in us-east-2. I tried with a p3.2xlarge and a g4dn.xlarge - AMI: ami-0a7a2bfaad8fdd51a

Its interesting that you were able to launch this p2.xlarge instance - when attempting to launch the latest AMI release in us-east-2, that is disabled as an option:

When I try to do this via the toil launch-cluster command, I get an error:

[2021-07-16T09:19:45-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-07-16T09:19:45-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-07-16T09:19:45-0400] [MainThread] [I] [toil] Using default docker appliance of quay.io/ucsc_cgl/toil:5.3.0-2c0f712d953fb06c74ae884bf1156b21a5bcaec6-py3.6 as TOIL_APPLIANCE_SELF is not set.
[2021-07-16T09:19:45-0400] [MainThread] [I] [toil.utils.toilLaunchCluster] Creating cluster jp-lh-hubmap-test-cluster...
[2021-07-16T09:19:46-0400] [MainThread] [I] [toil] Using default user-defined custom docker init command of  as TOIL_CUSTOM_DOCKER_INIT_COMMAND is not set.
[2021-07-16T09:19:46-0400] [MainThread] [I] [toil] Using default user-defined custom init command of  as TOIL_CUSTOM_INIT_COMMAND is not set.
[2021-07-16T09:19:46-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-07-16T09:19:46-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-07-16T09:19:46-0400] [MainThread] [I] [toil] Using default docker appliance of quay.io/ucsc_cgl/toil:5.3.0-2c0f712d953fb06c74ae884bf1156b21a5bcaec6-py3.6 as TOIL_APPLIANCE_SELF is not set.
[2021-07-16T09:19:46-0400] [MainThread] [I] [toil.lib.ec2] Creating p2.xlarge instance(s) ...
Traceback (most recent call last):
  File "/Users/jpuerto/toil-test/venv/bin/toil", line 8, in <module>
    sys.exit(main())
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/utils/toilMain.py", line 30, in main
    module.main()
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/utils/toilLaunchCluster.py", line 170, in main
    awsEc2ExtraSecurityGroupIds=options.awsEc2ExtraSecurityGroupIds)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/aws/awsProvisioner.py", line 267, in launchCluster
    tags=leader_tags)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/lib/retry.py", line 256, in call
    return func(*args, **kwargs)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/lib/ec2.py", line 393, in create_instances
    return ec2_resource.create_instances(**prune(request))
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnsupportedOperation) when calling the RunInstances operation: The instance configuration for this AWS Marketplace product is not supported. Please see the AWS Marketplace site for more information about supported instance types, regions, and operating systems.

I am planning on using flatcar-forklift to manage the Nvidia driver installs. This seems to build off the coreos-gpu-installer repository.

adamnovak commented 3 years ago

@jpuerto-psc Did you pull that AMI from the Marketplace? Are you using the Marketplace for a good reason?

It looks like the latest normal AMI for us-east-2, according to https://stable.release.flatcar-linux.net/amd64-usr/current/flatcar_production_ami_hvm_us-east-2.txt, is ami-02eb704ee029f6b9e, which has been current since June 15th, according to the modification date listed on https://stable.release.flatcar-linux.net/amd64-usr/current/

The AWS Marketplace probably has a system where you can mark an AMI as restricted/licensed for some instance types and not others. But the normal AMIs can't be restricted like that, I don't think.

adamnovak commented 3 years ago

Toil will fall back to checking the marketplace if it can't find a region in https://stable.release.flatcar-linux.net/amd64-usr/current/flatcar_production_ami_all.json but I see us-east-2 in there.

jpuerto-psc commented 3 years ago

@adamnovak Yes - I was using the AMI from the Marketplace. I am not using it for any particular reason.

Thanks so much for your advice here. I went ahead and used an AMI listed there from a previous version (forklift doesn't seem to have a release for latest version of Flatcar) and was able to successfully compile and build the Nvidia drivers! Will have to play around a bit more to get the container toolkit working, but this is a good first step. Closing out this issue with much thanks!

adamnovak commented 3 years ago

@jpuerto-psc I was looking into what proper GPU scheduling support in Toil would need, and I came across Google's "device plugin" for nVidia GPUs on Kubernetes, which still needs the drivers installed on the host but doesn't seem to require replacing the whole container runtime: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#nvidia-gpu-device-plugin-used-by-gce

jpuerto-psc commented 3 years ago

For those that are looking to do this, here is what I did to get this relatively functional:

Provision a Flatcar EC2 instance to have the docker GPU tools installed. Instructions
Once you have provisioned this EC2 instance, create an AMI from that instance. SAVE THIS AMI INFORMATION. You will need this AMI ID to have Toil create clusters using this AMI.
export TOIL_AWS_AMI=<AMI_ID>
Start up your cluster as normal.

It seems that, currently, the Nvidia install does not support passing the --gpus flag as part of your docker command, so you will want to avoid this. Other than that, however, we have been able to run our workflows using toil while accessing the GPU.

DataBiosphere / toil

NVIDIA Docker Build #3695