NVIDIA / ngc-container-replicator

NGC Container Replicator
BSD 3-Clause "New" or "Revised" License
28 stars 12 forks source link
containers deep-learning nvidia-docker nvidia-gpus

NGC Replicator

Clones nvcr.io using the either DGX (compute.nvidia.com) or NGC (ngc.nvidia.com) API keys.

The replicator will make an offline clone of the NGC/DGX container registry. In its current form, the replicator will download every CUDA container image as well as each Deep Learning framework image in the NVIDIA project.

Tarfiles will be saved in /output inside the container, so be sure to volume mount that directory. In the following example, we will collect our images in /tmp on the host.

Use --min-version to limit the number of versions to download. In the example below, we will only clone versions 17.10 and later DL framework images.

docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/output \
    deepops/replicator --project=nvidia --min-version=17.12 \
                       --api-key=<your-dgx-or-ngc-api-key>

You can also filter on specific images. If you want to filter only on image names containing the strings "tensorflow", "pytorch", and "tensorrt", you would simply add --image for each option, e.g.

docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/output \
    deepops/replicator --project=nvidia --min-version=17.12 \
                       --image=tensorflow --image=pytorch --image=tensorrt \
                       --dry-run \
                       --api-key=<your-dgx-or-ngc-api-key>

Note: the --dry-run option lets you see what will happen without committing to a lengthy download.

By default, the --image flag does a substring match in order to ensure you match all images that may be desired. Sometimes, however, you only want to download a specific image with no substring matching. In this case, you can add the --strict-name-match flag, e.g.

docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/output \
    deepops/replicator --project=nvidia --min-version=17.12 \
                       --image=tensorflow \
                       --strict-name-match \
                       --dry-run \
                       --api-key=<your-dgx-or-ngc-api-key>

Note: a state.yml file will be created the output directory. This saved state will be used to avoid pulling images that were previously pulled. If you wish to repull and save an image, just delete the entry in state.yml corresponding to the image_name and tag you wish to refresh.

Kubernetes Deployment

If you don't already have a deepops namespace, create one now.

kubectl create namespace deepops

Next, create a secret with your NGC API Key

kubectl -n deepops create secret generic  ngc-secret
--from-literal=apikey=<your-api-key-goes-here>

Next, create a persistent volume claim that will life outside the lifecycle of the CronJob. If you are using DeepOps you can use a Rook/Ceph PVC similar to:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ngc-replicator-pvc
  namespace: deepops
  labels:
    app: ngc-replicator
spec:
  storageClassName: rook-raid0-retain  # <== Replace with your StorageClass
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 32Mi

Finally, create a CronJob that executes the replicator on a schedule. This eample run the replicator every hour. Note: This example used Rook block storage to provide a persistent volume to hold the state.yml between executions. This ensures you will only download new container images. For more details, see our DeepOps project.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: replicator-config
  namespace: deepops
data:
  ngc-update.sh: |
    #!/bin/bash
    ngc_replicator                                        \
      --project=nvidia                                    \
      --min-version=$(date +"%y.%m" -d "1 month ago")     \
      --py-version=py3                                    \
      --image=tensorflow --image=pytorch --image=tensorrt \
      --no-exporter                                       \
      --registry-url=registry.local  # <== Replace with your local repo
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: ngc-replicator
  namespace: deepops
  labels:
    app: ngc-replicator
spec:
  schedule: "0 4 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          nodeSelector:
            node-role.kubernetes.io/master: ""
          containers:
            - name: replicator
              image: deepops/replicator
              imagePullPolicy: Always
              command: [ "/bin/sh", "-c", "/ngc-update/ngc-update.sh" ]
              env:
              - name: NGC_REPLICATOR_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: ngc-secret
                    key: apikey
              volumeMounts:
              - name: registry-config
                mountPath: /ngc-update
              - name: docker-socket
                mountPath: /var/run/docker.sock
              - name: ngc-replicator-storage
                mountPath: /output
          volumes:
            - name: registry-config
              configMap:
                name: replicator-config
                defaultMode: 0777
            - name: docker-socket
              hostPath:
                path: /var/run/docker.sock
                type: File
            - name: ngc-replicator-storage
              persistentVolumeClaim:
                claimName: ngc-replicator-pvc
          restartPolicy: Never

Developer Quickstart

make dev
py.test

TODOs