Unable to allocate a vGPU, even though there are available vGPUs

cr7258 commented 1 week ago

What happened:

I set up a single-node GPU cluster using Kind. When I attempt to allocate 1 vGPU to a Pod, it works as expected. However, when I try to allocate more than 1 vGPU to the Pod, it fails, even though there are enough vGPUs available.

There are 10 vGPUs.

 allocatable:
    cpu: "4"
    ephemeral-storage: 40901312Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 14951736Ki
    nvidia.com/gpu: "10"
    pods: "110"
  capacity:
    cpu: "4"
    ephemeral-storage: 40901312Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 14951736Ki
    nvidia.com/gpu: "10"
    pods: "110"

Create a Pod that requests 2 vGPUs.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2
          nvidia.com/gpumem: 1000

Pod events.

Events:
  Type     Reason            Age   From            Message
  ----     ------            ----  ----            -------
  Warning  FailedScheduling  80s   hami-scheduler  0/1 nodes are available: preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Warning  FilteringFailed   81s   hami-scheduler  no available node, all node scores do not meet

What you expected to happen:

The Pod can request more than 1 vGPU.

How to reproduce it (as minimally and precisely as possible):

Create Kind GPU cluster: https://github.com/cr7258/hands-on-lab/blob/main/ai/gpu/script/ubuntu-kind-gpu-cluster.sh Install HAMi: helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.31.2 -n kube-system Label node: kubectl label node gpu-cluster-control-plane gpu=on

Anything else we need to know?:

The output of nvidia-smi -a


root@gpu-demo:~# nvidia-smi -a
==============NVSMI LOG==============

Timestamp                                 : Sun Nov 10 18:43:55 2024
Driver Version                            : 565.57.01
CUDA Version                              : 12.7

Attached GPUs                             : 1
GPU 00000000:00:07.0
    Product Name                          : Tesla T4
    Product Brand                         : NVIDIA
    Product Architecture                  : Turing
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : HMM
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1322621091831
    GPU UUID                              : GPU-9a3603f4-f66e-d48b-11ea-13e327064cca
    Minor Number                          : 0
    VBIOS Version                         : 90.04.96.00.9F
    MultiGPU Board                        : No
    Board ID                              : 0x7
    Board Part Number                     : 900-2G183-0000-001
    GPU Part Number                       : 1EB8-895-A1
    FRU Part Number                       : N/A
    Platform Info
        RACK Serial Number                : N/A
        Chassis Physical Slot Number      : N/A
        Compute Slot Index                : N/A
        Node Index                        : N/A
        Peer Type                         : N/A
        Module Id                         : 1
    Inforom Version
        Image Version                     : G183.0200.00.02
        OEM Object                        : 1.1
        ECC Object                        : 5.0
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    GPU Recovery Action                   : None
    GSP Firmware Version                  : 565.57.01
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x07
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x1EB810DE
        Bus Id                            : 00000000:00:07.0
        Sub System Id                     : 0x12A210DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 3
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 250 KB/s
        Rx Throughput                     : 250 KB/s
        Atomic Caps Outbound              : N/A
        Atomic Caps Inbound               : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 15360 MiB
        Reserved                          : 445 MiB
        Used                              : 1 MiB
        Free                              : 14916 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 33 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 96 C
        GPU Slowdown Temp                 : 93 C
        GPU Max Operating Temp            : 85 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 14.31 W
        Current Power Limit               : 70.00 W
        Requested Power Limit             : 70.00 W
        Default Power Limit               : 70.00 W
        Min Power Limit                   : 60.00 W
        Max Power Limit                   : 70.00 W
    GPU Memory Power Readings 
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : 585 MHz
        Memory                            : 5001 MHz
    Default Applications Clocks
        Graphics                          : 585 MHz
        Memory                            : 5001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1590 MHz
        SM                                : 1590 MHz
        Memory                            : 5001 MHz
        Video                             : 1470 MHz
    Max Customer Boost Clocks
        Graphics                          : 1590 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes                             : None
    Capabilities
        EGM                               : disabled

Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)


root@gpu-cluster-control-plane:/# cat /etc/containerd/config.toml 
disabled_plugins = []
imports = ["/etc/containerd/config.toml"]
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2

[cgroup]
  path = ""

[debug]
  address = ""
  format = ""
  gid = 0
  level = ""
  uid = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
  tcp_address = ""
  tcp_tls_ca = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.gc.v1.scheduler"]
    deletion_threshold = 0
    mutation_threshold = 100
    pause_threshold = 0.02
    schedule_delay = "0s"
    startup_delay = "100ms"

  [plugins."io.containerd.grpc.v1.cri"]
    cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
    device_ownership_from_security_context = false
    disable_apparmor = false
    disable_cgroup = false
    disable_hugetlb_controller = true
    disable_proc_mount = false
    disable_tcp_service = true
    drain_exec_sync_io_timeout = "0s"
    enable_cdi = false
    enable_selinux = false
    enable_tls_streaming = false
    enable_unprivileged_icmp = false
    enable_unprivileged_ports = false
    ignore_deprecation_warnings = []
    ignore_image_defined_volumes = false
    image_pull_progress_timeout = "5m0s"
    image_pull_with_sync_fs = false
    max_concurrent_downloads = 3
    max_container_log_line_size = 16384
    netns_mounts_under_state_dir = false
    restrict_oom_score_adj = false
    sandbox_image = "registry.k8s.io/pause:3.10"
    selinux_category_range = 1024
    stats_collect_period = 10
    stream_idle_timeout = "4h0m0s"
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    systemd_cgroup = false
    tolerate_missing_hugetlb_controller = true
    unset_seccomp_profile = ""

    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      conf_template = ""
      ip_pref = ""
      max_conf_num = 1
      setup_serially = false

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = true
      ignore_blockio_not_enabled_errors = false
      ignore_rdt_not_enabled_errors = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        privileged_without_host_devices_all_devices_allowed = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = ""
        sandbox_mode = ""
        snapshotter = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          base_runtime_spec = "/etc/containerd/cri-base.json"
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          sandbox_mode = ""
          snapshotter = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi]
          base_runtime_spec = "/etc/containerd/cri-base.json"
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          sandbox_mode = ""
          snapshotter = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy]
          base_runtime_spec = "/etc/containerd/cri-base.json"
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          sandbox_mode = ""
          snapshotter = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          base_runtime_spec = "/etc/containerd/cri-base.json"
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          sandbox_mode = ""
          snapshotter = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.test-handler]
          base_runtime_spec = "/etc/containerd/cri-base.json"
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          sandbox_mode = ""
          snapshotter = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.test-handler.options]
            SystemdCgroup = true

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        privileged_without_host_devices_all_devices_allowed = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = ""
        sandbox_mode = ""
        snapshotter = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = ""

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""

  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"

  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"

  [plugins."io.containerd.internal.v1.tracing"]

  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"

  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false

  [plugins."io.containerd.nri.v1.nri"]
    disable = true
    disable_connections = false
    plugin_config_path = "/etc/nri/conf.d"
    plugin_path = "/opt/nri/plugins"
    plugin_registration_timeout = "5s"
    plugin_request_timeout = "2s"
    socket_path = "/var/run/nri/nri.sock"

  [plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "runc"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["linux/amd64"]
    sched_core = false

  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]

  [plugins."io.containerd.service.v1.tasks-service"]
    blockio_config_file = ""
    rdt_config_file = ""

  [plugins."io.containerd.snapshotter.v1.blockfile"]
    fs_type = ""
    mount_options = []
    root_path = ""
    scratch_file = ""

  [plugins."io.containerd.snapshotter.v1.native"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.overlayfs"]
    mount_options = []
    root_path = ""
    sync_remove = false
    upperdir_label = false

  [plugins."io.containerd.tracing.processor.v1.otlp"]

  [plugins."io.containerd.transfer.v1.local"]
    config_path = ""
    max_concurrent_downloads = 3
    max_concurrent_uploaded_layers = 3

    [[plugins."io.containerd.transfer.v1.local".unpack_config]]
      differ = ""
      platform = "linux/amd64"
      snapshotter = "overlayfs"

[proxy_plugins]

  [proxy_plugins.fuse-overlayfs]
    address = "/run/containerd-fuse-overlayfs.sock"
    platform = ""
    type = "snapshot"

    [proxy_plugins.fuse-overlayfs.exports]

[stream_processors]

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar"

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar+gzip"

[timeouts]
  "io.containerd.timeout.bolt.open" = "0s"
  "io.containerd.timeout.metrics.shimstats" = "2s"
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[ttrpc]
  address = ""
  gid = 0
  uid = 0

The hami-device-plugin container logs


root@gpu-cluster-control-plane:/# kubectl logs -n kube-system hami-device-plugin-sz4s7            
Defaulted container "device-plugin" out of: device-plugin, vgpu-monitor
I1110 10:16:12.092377    6855 client.go:53] BuildConfigFromFlags failed for file /root/.kube/config: stat /root/.kube/config: no such file or directory using inClusterConfig
I1110 10:16:12.095684    6855 main.go:158] Starting FS watcher.
I1110 10:16:12.095755    6855 main.go:168] Start working on node gpu-cluster-control-plane
I1110 10:16:12.095764    6855 main.go:169] Starting OS watcher.
I1110 10:16:12.096173    6855 main.go:184] Starting Plugins.
I1110 10:16:12.096192    6855 main.go:242] Loading configuration.
I1110 10:16:12.096387    6855 vgpucfg.go:134] flags= [--mig-strategy value      the desired strategy for exposing MIG devices on GPUs that support it:
                [none | single | mixed] (default: "none") [$MIG_STRATEGY] --fail-on-init-error  fail the plugin if an error is encountered during initialization, otherwise block indefinitely (default: true) [$FAIL_ON_INIT_ERROR] --nvidia-driver-root value  the root path for the NVIDIA driver installation (typical values are '/' or '/run/nvidia/driver') (default: "/") [$NVIDIA_DRIVER_ROOT] --pass-device-specs      pass the list of DeviceSpecs to the kubelet on Allocate() (default: false) [$PASS_DEVICE_SPECS] --device-list-strategy value [ --device-list-strategy value ]    the desired strategy for passing the device list to the underlying runtime:
                [envvar | volume-mounts | cdi-annotations] (default: "envvar") [$DEVICE_LIST_STRATEGY] --device-id-strategy value       the desired strategy for passing device IDs to the underlying runtime:
                [uuid | index] (default: "uuid") [$DEVICE_ID_STRATEGY] --gds-enabled    ensure that containers are started with NVIDIA_GDS=enabled (default: false) [$GDS_ENABLED] --mofed-enabled      ensure that containers are started with NVIDIA_MOFED=enabled (default: false) [$MOFED_ENABLED] --config-file value       the path to a config file as an alternative to command line options or environment variables [$CONFIG_FILE] --cdi-annotation-prefix value       the prefix to use for CDI container annotation keys (default: "cdi.k8s.io/") [$CDI_ANNOTATION_PREFIX] --nvidia-ctk-path value    the path to use for the nvidia-ctk in the generated CDI specification (default: "/usr/bin/nvidia-ctk") [$NVIDIA_CTK_PATH] --container-driver-root value  the path where the NVIDIA driver root is mounted in the container; used for generating CDI specifications (default: "/driver-root") [$CONTAINER_DRIVER_ROOT] --node-name value  node name (default: "gpu-cluster-control-plane") [$NodeName] --device-split-count value  the number for NVIDIA device split (default: 2) [$DEVICE_SPLIT_COUNT] --device-memory-scaling value     the ratio for NVIDIA device memory scaling (default: 1) [$DEVICE_MEMORY_SCALING] --device-cores-scaling value    the ratio for NVIDIA device cores scaling (default: 1) [$DEVICE_CORES_SCALING] --disable-core-limit     If set, the core utilization limit will be ignored (default: false) [$DISABLE_CORE_LIMIT] --resource-name value  the name of field for number GPU visible in container (default: "nvidia.com/gpu") --help, -h    show help --version, -v print the version]
I1110 10:16:12.096566    6855 vgpucfg.go:143] DeviceMemoryScaling 1
I1110 10:16:12.096761    6855 vgpucfg.go:108] Device Plugin Configs: {[{m5-cloudinfra-online02 1.8 0 10 none 0xc00040e4b0}]}
I1110 10:16:12.096774    6855 main.go:258] Updating config with default resource matching patterns.
config= [{* nvidia.com/gpu}]
I1110 10:16:12.097024    6855 main.go:269] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "ResourceName": "nvidia.com/gpu",
  "DebugMode": null
}
I1110 10:16:12.097038    6855 main.go:272] Retrieving plugins.
I1110 10:16:12.097880    6855 factory.go:107] Detected NVML platform: found NVML library
I1110 10:16:12.097936    6855 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I1110 10:16:12.128547    6855 server.go:185] Starting GRPC server for 'nvidia.com/gpu'
I1110 10:16:12.129400    6855 server.go:133] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1110 10:16:12.131984    6855 server.go:141] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I1110 10:16:12.132025    6855 register.go:187] Starting WatchAndRegister
I1110 10:16:12.141867    6855 register.go:132] MemoryScaling= 1 registeredmem= 15360
I1110 10:16:12.188051    6855 register.go:160] nvml registered device id=1, memory=15360, type=Tesla T4, numa=0
.....

I1110 11:05:18.708838    6855 register.go:197] Successfully registered annotation. Next check in 30s seconds...
I1110 11:05:48.709275    6855 register.go:132] MemoryScaling= 1 registeredmem= 15360
I1110 11:05:48.753383    6855 register.go:160] nvml registered device id=1, memory=15360, type=Tesla T4, numa=0
I1110 11:05:48.753420    6855 register.go:167] "start working on the devices" devices=[{"id":"GPU-9a3603f4-f66e-d48b-11ea-13e327064cca","count":10,"devmem":15360,"devcore":100,"type":"NVIDIA-Tesla T4","health":true}]
I1110 11:05:48.756608    6855 util.go:163] Encoded node Devices: GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:
I1110 11:05:48.756645    6855 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-10 11:05:48.756620661 +0000 UTC m=+2976.676870749 hami.io/node-nvidia-register:GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:]
I1110 11:05:48.768588    6855 register.go:197] Successfully registered annotation. Next check in 30s seconds...
I1110 11:06:18.768947    6855 register.go:132] MemoryScaling= 1 registeredmem= 15360
I1110 11:06:18.812908    6855 register.go:160] nvml registered device id=1, memory=15360, type=Tesla T4, numa=0
I1110 11:06:18.812944    6855 register.go:167] "start working on the devices" devices=[{"id":"GPU-9a3603f4-f66e-d48b-11ea-13e327064cca","count":10,"devmem":15360,"devcore":100,"type":"NVIDIA-Tesla T4","health":true}]
I1110 11:06:18.816106    6855 util.go:163] Encoded node Devices: GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:
I1110 11:06:18.816143    6855 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-10 11:06:18.816115385 +0000 UTC m=+3006.736365466 hami.io/node-nvidia-register:GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:]
I1110 11:06:18.831675    6855 register.go:197] Successfully registered annotation. Next check in 30s seconds...
I1110 11:06:48.832857    6855 register.go:132] MemoryScaling= 1 registeredmem= 15360
I1110 11:06:48.876723    6855 register.go:160] nvml registered device id=1, memory=15360, type=Tesla T4, numa=0
I1110 11:06:48.876761    6855 register.go:167] "start working on the devices" devices=[{"id":"GPU-9a3603f4-f66e-d48b-11ea-13e327064cca","count":10,"devmem":15360,"devcore":100,"type":"NVIDIA-Tesla T4","health":true}]
I1110 11:06:48.879981    6855 util.go:163] Encoded node Devices: GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:
I1110 11:06:48.880011    6855 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-10 11:06:48.879991102 +0000 UTC m=+3036.800241183 hami.io/node-nvidia-register:GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:]
I1110 11:06:48.894491    6855 register.go:197] Successfully registered annotation. Next check in 30s seconds...
I1110 11:07:18.894883    6855 register.go:132] MemoryScaling= 1 registeredmem= 15360
I1110 11:07:18.941502    6855 register.go:160] nvml registered device id=1, memory=15360, type=Tesla T4, numa=0
I1110 11:07:18.941553    6855 register.go:167] "start working on the devices" devices=[{"id":"GPU-9a3603f4-f66e-d48b-11ea-13e327064cca","count":10,"devmem":15360,"devcore":100,"type":"NVIDIA-Tesla T4","health":true}]
I1110 11:07:18.945303    6855 util.go:163] Encoded node Devices: GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:
I1110 11:07:18.945337    6855 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-10 11:07:18.945313825 +0000 UTC m=+3066.865563907 hami.io/node-nvidia-register:GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:]
I1110 11:07:18.958883    6855 register.go:197] Successfully registered annotation. Next check in 30s seconds...
I1110 11:07:48.959797    6855 register.go:132] MemoryScaling= 1 registeredmem= 15360
I1110 11:07:49.003669    6855 register.go:160] nvml registered device id=1, memory=15360, type=Tesla T4, numa=0
I1110 11:07:49.003701    6855 register.go:167] "start working on the devices" devices=[{"id":"GPU-9a3603f4-f66e-d48b-11ea-13e327064cca","count":10,"devmem":15360,"devcore":100,"type":"NVIDIA-Tesla T4","health":true}]
I1110 11:07:49.006835    6855 util.go:163] Encoded node Devices: GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:
I1110 11:07:49.006864    6855 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-10 11:07:49.006846369 +0000 UTC m=+3096.927096449 hami.io/node-nvidia-register:GPU-9a3603f4-f66e-d48b-11ea-13e327064cca,10,15360,100,NVIDIA-Tesla T4,0,true:]
I1110 11:07:49.021041    6855 register.go:197] Successfully registered annotation. Next check in 30s seconds...

The hami-scheduler container logs


root@gpu-cluster-control-plane:/# kubectl logs -n kube-system hami-scheduler-74b5f7df7-m67ff 
Defaulted container "kube-scheduler" out of: kube-scheduler, vgpu-scheduler-extender
I1110 10:10:21.611821       1 flags.go:64] FLAG: --allow-metric-labels="[]"
I1110 10:10:21.611870       1 flags.go:64] FLAG: --allow-metric-labels-manifest=""
I1110 10:10:21.611877       1 flags.go:64] FLAG: --authentication-kubeconfig=""
I1110 10:10:21.611882       1 flags.go:64] FLAG: --authentication-skip-lookup="false"
I1110 10:10:21.611892       1 flags.go:64] FLAG: --authentication-token-webhook-cache-ttl="10s"
I1110 10:10:21.611898       1 flags.go:64] FLAG: --authentication-tolerate-lookup-failure="true"
I1110 10:10:21.611903       1 flags.go:64] FLAG: --authorization-always-allow-paths="[/healthz,/readyz,/livez]"
I1110 10:10:21.611922       1 flags.go:64] FLAG: --authorization-kubeconfig=""
I1110 10:10:21.611927       1 flags.go:64] FLAG: --authorization-webhook-cache-authorized-ttl="10s"
I1110 10:10:21.611933       1 flags.go:64] FLAG: --authorization-webhook-cache-unauthorized-ttl="10s"
I1110 10:10:21.611938       1 flags.go:64] FLAG: --bind-address="0.0.0.0"
I1110 10:10:21.611947       1 flags.go:64] FLAG: --cert-dir=""
I1110 10:10:21.611953       1 flags.go:64] FLAG: --client-ca-file=""
I1110 10:10:21.611958       1 flags.go:64] FLAG: --config="/config/config.yaml"
I1110 10:10:21.611963       1 flags.go:64] FLAG: --contention-profiling="true"
I1110 10:10:21.611969       1 flags.go:64] FLAG: --disable-http2-serving="false"
I1110 10:10:21.611974       1 flags.go:64] FLAG: --disabled-metrics="[]"
I1110 10:10:21.611985       1 flags.go:64] FLAG: --emulated-version="[]"
I1110 10:10:21.611993       1 flags.go:64] FLAG: --feature-gates=""
I1110 10:10:21.612003       1 flags.go:64] FLAG: --help="false"
I1110 10:10:21.612008       1 flags.go:64] FLAG: --http2-max-streams-per-connection="0"
I1110 10:10:21.612017       1 flags.go:64] FLAG: --kube-api-burst="100"
I1110 10:10:21.612035       1 flags.go:64] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I1110 10:10:21.612041       1 flags.go:64] FLAG: --kube-api-qps="50"
I1110 10:10:21.612047       1 flags.go:64] FLAG: --kubeconfig=""
I1110 10:10:21.612052       1 flags.go:64] FLAG: --leader-elect="true"
I1110 10:10:21.612057       1 flags.go:64] FLAG: --leader-elect-lease-duration="15s"
I1110 10:10:21.612063       1 flags.go:64] FLAG: --leader-elect-renew-deadline="10s"
I1110 10:10:21.612071       1 flags.go:64] FLAG: --leader-elect-resource-lock="leases"
I1110 10:10:21.612076       1 flags.go:64] FLAG: --leader-elect-resource-name="hami-scheduler"
I1110 10:10:21.612081       1 flags.go:64] FLAG: --leader-elect-resource-namespace="kube-system"
I1110 10:10:21.612087       1 flags.go:64] FLAG: --leader-elect-retry-period="2s"
I1110 10:10:21.612092       1 flags.go:64] FLAG: --log-flush-frequency="5s"
I1110 10:10:21.612098       1 flags.go:64] FLAG: --log-json-info-buffer-size="0"
I1110 10:10:21.612108       1 flags.go:64] FLAG: --log-json-split-stream="false"
I1110 10:10:21.612113       1 flags.go:64] FLAG: --log-text-info-buffer-size="0"
I1110 10:10:21.612119       1 flags.go:64] FLAG: --log-text-split-stream="false"
I1110 10:10:21.612123       1 flags.go:64] FLAG: --logging-format="text"
I1110 10:10:21.612128       1 flags.go:64] FLAG: --master=""
I1110 10:10:21.612133       1 flags.go:64] FLAG: --permit-address-sharing="false"
I1110 10:10:21.612137       1 flags.go:64] FLAG: --permit-port-sharing="false"
I1110 10:10:21.612141       1 flags.go:64] FLAG: --pod-max-in-unschedulable-pods-duration="5m0s"
I1110 10:10:21.612147       1 flags.go:64] FLAG: --profiling="true"
I1110 10:10:21.612151       1 flags.go:64] FLAG: --requestheader-allowed-names="[]"
I1110 10:10:21.612156       1 flags.go:64] FLAG: --requestheader-client-ca-file=""
I1110 10:10:21.612159       1 flags.go:64] FLAG: --requestheader-extra-headers-prefix="[x-remote-extra-]"
I1110 10:10:21.612166       1 flags.go:64] FLAG: --requestheader-group-headers="[x-remote-group]"
I1110 10:10:21.612178       1 flags.go:64] FLAG: --requestheader-username-headers="[x-remote-user]"
I1110 10:10:21.612188       1 flags.go:64] FLAG: --secure-port="10259"
I1110 10:10:21.612194       1 flags.go:64] FLAG: --show-hidden-metrics-for-version=""
I1110 10:10:21.612198       1 flags.go:64] FLAG: --tls-cert-file=""
I1110 10:10:21.612203       1 flags.go:64] FLAG: --tls-cipher-suites="[]"
I1110 10:10:21.612211       1 flags.go:64] FLAG: --tls-min-version=""
I1110 10:10:21.612216       1 flags.go:64] FLAG: --tls-private-key-file=""
I1110 10:10:21.612221       1 flags.go:64] FLAG: --tls-sni-cert-key="[]"
I1110 10:10:21.612229       1 flags.go:64] FLAG: --v="4"
I1110 10:10:21.612240       1 flags.go:64] FLAG: --version="false"
I1110 10:10:21.612248       1 flags.go:64] FLAG: --vmodule=""
I1110 10:10:21.612262       1 flags.go:64] FLAG: --write-config-to=""
I1110 10:10:22.073858       1 serving.go:386] Generated self-signed cert in-memory
W1110 10:10:22.074759       1 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1110 10:10:22.521614       1 requestheader_controller.go:247] Loaded a new request header values for RequestHeaderAuthRequestController
I1110 10:10:22.528040       1 scheduler.go:499] "Creating extender" extender={"URLPrefix":"https://127.0.0.1:443","FilterVerb":"filter","PreemptVerb":"","PrioritizeVerb":"","Weight":1,"BindVerb":"bind","EnableHTTPS":true,"TLSConfig":{"Insecure":true,"ServerName":"","CertFile":"","KeyFile":"","CAFile":"","CertData":null,"KeyData":null,"CAData":null},"HTTPTimeout":"30s","NodeCacheCapable":true,"ManagedResources":[{"Name":"nvidia.com/gpu","IgnoredByScheduler":true},{"Name":"nvidia.com/gpumem","IgnoredByScheduler":true},{"Name":"nvidia.com/gpucores","IgnoredByScheduler":true},{"Name":"nvidia.com/gpumem-percentage","IgnoredByScheduler":true},{"Name":"nvidia.com/priority","IgnoredByScheduler":true},{"Name":"cambricon.com/vmlu","IgnoredByScheduler":true},{"Name":"hygon.com/dcunum","IgnoredByScheduler":true},{"Name":"hygon.com/dcumem","IgnoredByScheduler":true},{"Name":"hygon.com/dcucores","IgnoredByScheduler":true},{"Name":"iluvatar.ai/vgpu","IgnoredByScheduler":true}],"Ignorable":false}
I1110 10:10:22.529270       1 framework.go:392] "the scheduler starts to work with those plugins" Plugins={"PreEnqueue":{"Enabled":[{"Name":"SchedulingGates","Weight":0}],"Disabled":null},"QueueSort":{"Enabled":[{"Name":"PrioritySort","Weight":0}],"Disabled":null},"PreFilter":{"Enabled":[{"Name":"NodeAffinity","Weight":0},{"Name":"NodePorts","Weight":0},{"Name":"NodeResourcesFit","Weight":0},{"Name":"VolumeRestrictions","Weight":0},{"Name":"NodeVolumeLimits","Weight":0},{"Name":"VolumeBinding","Weight":0},{"Name":"VolumeZone","Weight":0},{"Name":"PodTopologySpread","Weight":0},{"Name":"InterPodAffinity","Weight":0}],"Disabled":null},"Filter":{"Enabled":[{"Name":"NodeUnschedulable","Weight":0},{"Name":"NodeName","Weight":0},{"Name":"TaintToleration","Weight":0},{"Name":"NodeAffinity","Weight":0},{"Name":"NodePorts","Weight":0},{"Name":"NodeResourcesFit","Weight":0},{"Name":"VolumeRestrictions","Weight":0},{"Name":"NodeVolumeLimits","Weight":0},{"Name":"VolumeBinding","Weight":0},{"Name":"VolumeZone","Weight":0},{"Name":"PodTopologySpread","Weight":0},{"Name":"InterPodAffinity","Weight":0}],"Disabled":null},"PostFilter":{"Enabled":[{"Name":"DefaultPreemption","Weight":0}],"Disabled":null},"PreScore":{"Enabled":[{"Name":"TaintToleration","Weight":0},{"Name":"NodeAffinity","Weight":0},{"Name":"NodeResourcesFit","Weight":0},{"Name":"VolumeBinding","Weight":0},{"Name":"PodTopologySpread","Weight":0},{"Name":"InterPodAffinity","Weight":0},{"Name":"NodeResourcesBalancedAllocation","Weight":0}],"Disabled":null},"Score":{"Enabled":[{"Name":"TaintToleration","Weight":3},{"Name":"NodeAffinity","Weight":2},{"Name":"NodeResourcesFit","Weight":1},{"Name":"VolumeBinding","Weight":1},{"Name":"PodTopologySpread","Weight":2},{"Name":"InterPodAffinity","Weight":2},{"Name":"NodeResourcesBalancedAllocation","Weight":1},{"Name":"ImageLocality","Weight":1}],"Disabled":null},"Reserve":{"Enabled":[{"Name":"VolumeBinding","Weight":0}],"Disabled":null},"Permit":{"Enabled":null,"Disabled":null},"PreBind":{"Enabled":[{"Name":"VolumeBinding","Weight":0}],"Disabled":null},"Bind":{"Enabled":[{"Name":"DefaultBinder","Weight":0}],"Disabled":null},"PostBind":{"Enabled":null,"Disabled":null},"MultiPoint":{"Enabled":null,"Disabled":null}}
I1110 10:10:22.533067       1 configfile.go:94] "Using component config" config=<
        apiVersion: kubescheduler.config.k8s.io/v1
        clientConnection:
          acceptContentTypes: ""
          burst: 100
          contentType: application/vnd.kubernetes.protobuf
          kubeconfig: ""
          qps: 50
        enableContentionProfiling: true
        enableProfiling: true
        extenders:
        - bindVerb: bind
          enableHTTPS: true
          filterVerb: filter
          httpTimeout: 30s
          managedResources:
          - ignoredByScheduler: true
            name: nvidia.com/gpu
          - ignoredByScheduler: true
            name: nvidia.com/gpumem
          - ignoredByScheduler: true
            name: nvidia.com/gpucores
          - ignoredByScheduler: true
            name: nvidia.com/gpumem-percentage
          - ignoredByScheduler: true
            name: nvidia.com/priority
          - ignoredByScheduler: true
            name: cambricon.com/vmlu
          - ignoredByScheduler: true
            name: hygon.com/dcunum
          - ignoredByScheduler: true
            name: hygon.com/dcumem
          - ignoredByScheduler: true
            name: hygon.com/dcucores
          - ignoredByScheduler: true
            name: iluvatar.ai/vgpu
          nodeCacheCapable: true
          tlsConfig:
            insecure: true
          urlPrefix: https://127.0.0.1:443
          weight: 1
        kind: KubeSchedulerConfiguration
        leaderElection:
          leaderElect: true
          leaseDuration: 15s
          renewDeadline: 10s
          resourceLock: leases
          resourceName: hami-scheduler
          resourceNamespace: kube-system
          retryPeriod: 2s
        parallelism: 16
        percentageOfNodesToScore: 0
        podInitialBackoffSeconds: 1
        podMaxBackoffSeconds: 10
        profiles:
        - pluginConfig:
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              kind: DefaultPreemptionArgs
              minCandidateNodesAbsolute: 100
              minCandidateNodesPercentage: 10
            name: DefaultPreemption
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              hardPodAffinityWeight: 1
              ignorePreferredTermsOfExistingPods: false
              kind: InterPodAffinityArgs
            name: InterPodAffinity
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              kind: NodeAffinityArgs
            name: NodeAffinity
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              kind: NodeResourcesBalancedAllocationArgs
              resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1
            name: NodeResourcesBalancedAllocation
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              ignoredResources:
              - nvidia.com/gpu
              - nvidia.com/gpumem
              - nvidia.com/gpucores
              - nvidia.com/gpumem-percentage
              - nvidia.com/priority
              - cambricon.com/vmlu
              - hygon.com/dcunum
              - hygon.com/dcumem
              - hygon.com/dcucores
              - iluvatar.ai/vgpu
              kind: NodeResourcesFitArgs
              scoringStrategy:
                resources:
                - name: cpu
                  weight: 1
                - name: memory
                  weight: 1
                type: LeastAllocated
            name: NodeResourcesFit
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              defaultingType: System
              kind: PodTopologySpreadArgs
            name: PodTopologySpread
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              bindTimeoutSeconds: 600
              kind: VolumeBindingArgs
            name: VolumeBinding
          plugins:
            bind: {}
            filter: {}
            multiPoint:
              enabled:
              - name: SchedulingGates
                weight: 0
              - name: PrioritySort
                weight: 0
              - name: NodeUnschedulable
                weight: 0
              - name: NodeName
                weight: 0
              - name: TaintToleration
                weight: 3
              - name: NodeAffinity
                weight: 2
              - name: NodePorts
                weight: 0
              - name: NodeResourcesFit
                weight: 1
              - name: VolumeRestrictions
                weight: 0
              - name: NodeVolumeLimits
                weight: 0
              - name: VolumeBinding
                weight: 0
              - name: VolumeZone
                weight: 0
              - name: PodTopologySpread
                weight: 2
              - name: InterPodAffinity
                weight: 2
              - name: DefaultPreemption
                weight: 0
              - name: NodeResourcesBalancedAllocation
                weight: 1
              - name: ImageLocality
                weight: 1
              - name: DefaultBinder
                weight: 0
            permit: {}
            postBind: {}
            postFilter: {}
            preBind: {}
            preEnqueue: {}
            preFilter: {}
            preScore: {}
            queueSort: {}
            reserve: {}
            score: {}
          schedulerName: hami-scheduler
 >
I1110 10:10:22.534793       1 server.go:167] "Starting Kubernetes Scheduler" version="v1.31.2"
I1110 10:10:22.534807       1 server.go:169] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I1110 10:10:22.538671       1 requestheader_controller.go:172] Starting RequestHeaderAuthRequestController
I1110 10:10:22.538758       1 shared_informer.go:313] Waiting for caches to sync for RequestHeaderAuthRequestController
I1110 10:10:22.538691       1 configmap_cafile_content.go:205] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I1110 10:10:22.538819       1 tlsconfig.go:203] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1731233422\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1731233421\" (2024-11-10 09:10:21 +0000 UTC to 2025-11-10 09:10:21 +0000 UTC (now=2024-11-10 10:10:22.538788141 +0000 UTC))"
I1110 10:10:22.538830       1 shared_informer.go:313] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1110 10:10:22.538687       1 configmap_cafile_content.go:205] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I1110 10:10:22.538858       1 shared_informer.go:313] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file

......
I1110 11:06:18.828647       1 eventhandlers.go:99] "Update event for node" node="gpu-cluster-control-plane"
I1110 11:06:18.846253       1 eventhandlers.go:99] "Update event for node" node="gpu-cluster-control-plane"
I1110 11:06:22.704937       1 scheduling_queue.go:1312] "Pod moved to an internal scheduling queue" pod="default/gpu-pod" event="UnschedulableTimeout" queue="Active" hint=1
I1110 11:06:22.705027       1 schedule_one.go:83] "About to try and schedule pod" pod="default/gpu-pod"
I1110 11:06:22.705039       1 schedule_one.go:96] "Attempting to schedule pod" pod="default/gpu-pod"
I1110 11:06:22.707899       1 preemption.go:221] "Preemption will not help schedule pod on any node" logger="PostFilter.DefaultPreemption" pod="default/gpu-pod"
I1110 11:06:22.707945       1 schedule_one.go:1055] "Unable to schedule pod; no fit; waiting" pod="default/gpu-pod" err="0/1 nodes are available: preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling."
I1110 11:06:22.708044       1 schedule_one.go:1122] "Updating pod condition" pod="default/gpu-pod" conditionType="PodScheduled" conditionStatus="False" conditionReason="Unschedulable"
I1110 11:06:36.566498       1 reflector.go:871] k8s.io/client-go/informers/factory.go:160: Watch close - *v1.StatefulSet total 10 items received
I1110 11:06:48.892992       1 eventhandlers.go:99] "Update event for node" node="gpu-cluster-control-plane"
I1110 11:06:48.909668       1 eventhandlers.go:99] "Update event for node" node="gpu-cluster-control-plane"

The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Any relevant kernel output lines from dmesg

Environment:

HAMi version: v2.4.0
nvidia driver or other AI device driver version: 565.57.01
Docker version from docker version: 24.0.7
Docker command, image and tag used
Kernel version from uname -a: Linux gpu-demo 6.8.0-40-generic #40-Ubuntu SMP PREEMPT_DYNAMIC Fri Jul 5 10:34:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Others:

wawa0210 commented 1 week ago

kubectl logs -n kube-system hami-scheduler-74b5f7df7-m67ff -c vgpu-scheduler-extender

Can you execute this command and upload the log?

It seems that there is a conflict in gpu resources, but no detailed log is recorded in the pod event

cr7258 commented 1 week ago

I1111 04:14:35.990771       1 client.go:53] BuildConfigFromFlags failed for file /root/.kube/config: stat /root/.kube/config: no such file or directory using inClusterConfig
I1111 04:14:35.992321       1 scheduler.go:63] New Scheduler
I1111 04:14:35.993143       1 reflector.go:289] Starting reflector *v1.Node (1h0m0s) from pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229
I1111 04:14:35.993271       1 reflector.go:325] Listing and watching *v1.Node from pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229
I1111 04:14:35.999747       1 reflector.go:289] Starting reflector *v1.Pod (1h0m0s) from pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229
I1111 04:14:35.999960       1 reflector.go:325] Listing and watching *v1.Pod from pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229
I1111 04:14:36.093414       1 shared_informer.go:341] caches populated
I1111 04:14:36.093447       1 shared_informer.go:341] caches populated
I1111 04:14:36.101270       1 route.go:42] Into Predicate Route outer func
I1111 04:14:36.101707       1 metrics.go:231] Initializing metrics for scheduler
I1111 04:14:36.101938       1 metrics.go:65] Starting to collect metrics for scheduler
I1111 04:14:36.101669       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:14:36" nodeName="gpu-cluster-control-plane"
I1111 04:14:36.103709       1 pods.go:105] Getting all scheduled pods with 0 nums
I1111 04:14:36.108040       1 main.go:86] listen on 0.0.0.0:443
I1111 04:14:36.134195       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:14:36.134405       1 scheduler.go:246] node gpu-cluster-control-plane device NVIDIA come node info=&{gpu-cluster-control-plane [{GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 0 10 15360 100 NVIDIA-Tesla T4 0 true NVIDIA}]} total=[{GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 0 10 15360 100 NVIDIA-Tesla T4 0 true NVIDIA}]
I1111 04:14:37.111227       1 route.go:44] Into Predicate Route inner func
I1111 04:14:37.113695       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="d04944cd-f20a-4b37-806d-4c2d442ce62d" namespaces="default"
I1111 04:14:37.113716       1 device.go:170] Counting iluvatar devices
I1111 04:14:37.113722       1 device.go:245] Counting mlu devices
I1111 04:14:37.113730       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:14:37.113761       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:14:37.113773       1 device.go:179] Counting dcu devices
I1111 04:14:37.113828       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:14:37.113843       1 score.go:32] devices status
I1111 04:14:37.113901       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:14:37.113911       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:14:37.113922       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:14:37.113932       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:14:37.113947       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:14:37.113977       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:14:37.114000       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:14:37.114008       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:14:37.114197       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:15:05.908157       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:15:05" nodeName="gpu-cluster-control-plane"
I1111 04:15:05.923070       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:15:24.283018       1 route.go:131] Start to handle webhook request on /webhook
I1111 04:15:24.284990       1 webhook.go:63] Processing admission hook for pod default/gpu-pod, UID: 803d7a11-3398-4836-9446-050c25391467
I1111 04:15:24.292893       1 route.go:44] Into Predicate Route inner func
I1111 04:15:24.293199       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="acb621b8-a0e8-4d47-bf4d-72cd2ec022f9" namespaces="default"
I1111 04:15:24.293217       1 device.go:245] Counting mlu devices
I1111 04:15:24.293226       1 device.go:250] idx= nvidia.com/gpu val= {{1 0} {<nil>} 1 DecimalSI} {{0 0} {<nil>}  }
I1111 04:15:24.293243       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:15:24.293257       1 device.go:179] Counting dcu devices
I1111 04:15:24.293268       1 device.go:170] Counting iluvatar devices
I1111 04:15:24.293287       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":1,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:15:24.293307       1 score.go:32] devices status
I1111 04:15:24.293371       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:15:24.293392       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:15:24.293403       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:15:24.293420       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:15:24.293427       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 1.651042
I1111 04:15:24.293446       1 score.go:70] "Allocating device for container request" pod="default/gpu-pod" card request={"Nums":1,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}
I1111 04:15:24.293468       1 score.go:74] "scoring pod" pod="default/gpu-pod" Memreq=1000 MemPercentagereq=101 Coresreq=0 Nums=1 device index=0 device="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264"
I1111 04:15:24.293478       1 score.go:40] Type contains NVIDIA-Tesla T4 NVIDIA
I1111 04:15:24.293488       1 score.go:46] idx NVIDIA true true
I1111 04:15:24.293497       1 score.go:62] checkUUID result is true for NVIDIA type
I1111 04:15:24.293510       1 score.go:126] "first fitted" pod="default/gpu-pod" device="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264"
I1111 04:15:24.293566       1 score.go:137] "device allocate success" pod="default/gpu-pod" allocate device={"NVIDIA":[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]}
I1111 04:15:24.293577       1 scheduler.go:485] nodeScores_len= 1
I1111 04:15:24.293585       1 scheduler.go:488] schedule default/gpu-pod to gpu-cluster-control-plane map[NVIDIA:[[{0 GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 NVIDIA 1000 0}]]]
I1111 04:15:24.293618       1 util.go:186] Encoded container Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,NVIDIA,1000,0:
I1111 04:15:24.293625       1 util.go:209] Encoded pod single devices GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,NVIDIA,1000,0:;
I1111 04:15:24.293646       1 pods.go:63] Pod added: Name: gpu-pod, UID: acb621b8-a0e8-4d47-bf4d-72cd2ec022f9, Namespace: default, NodeID: gpu-cluster-control-plane
I1111 04:15:24.300955       1 util.go:277] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]]}
I1111 04:15:24.301298       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Normal" reason="FilteringSucceed" message="Successfully filtered to following nodes: [gpu-cluster-control-plane] for default/gpu-pod "
I1111 04:15:24.302365       1 scheduler.go:375] "Bind" pod="gpu-pod" namespace="default" podUID="acb621b8-a0e8-4d47-bf4d-72cd2ec022f9" node="gpu-cluster-control-plane"
I1111 04:15:24.307173       1 device.go:245] Counting mlu devices
I1111 04:15:24.307184       1 device.go:250] idx= nvidia.com/gpu val= {{1 0} {<nil>} 1 DecimalSI} {{0 0} {<nil>}  }
I1111 04:15:24.307195       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:15:24.328235       1 nodelock.go:65] "Node lock set" node="gpu-cluster-control-plane"
I1111 04:15:24.334285       1 util.go:277] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]]}
I1111 04:15:24.336440       1 scheduler.go:430] After Binding Process
I1111 04:15:24.337323       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Normal" reason="BindingSucceed" message="Successfully binding node [gpu-cluster-control-plane] to default/gpu-pod"
I1111 04:15:24.337718       1 util.go:277] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]]}
I1111 04:15:24.354560       1 util.go:277] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]]}
I1111 04:15:24.365949       1 util.go:277] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]]}
I1111 04:15:24.395938       1 util.go:277] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]]}
I1111 04:15:26.068870       1 util.go:277] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]]}
I1111 04:15:35.972144       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:15:35" nodeName="gpu-cluster-control-plane"
I1111 04:15:35.987788       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:15:36.360449       1 util.go:277] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]]}
I1111 04:15:45.787308       1 util.go:277] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Type":"NVIDIA","Usedmem":1000,"Usedcores":0}]]}
I1111 04:15:45.793036       1 pods.go:72] Deleted pod gpu-pod with node ID gpu-cluster-control-plane
I1111 04:15:59.470368       1 route.go:131] Start to handle webhook request on /webhook
I1111 04:15:59.470899       1 webhook.go:63] Processing admission hook for pod default/gpu-pod, UID: e6271658-4bad-4a44-aeca-d70ccb94da76
I1111 04:15:59.479682       1 route.go:44] Into Predicate Route inner func
I1111 04:15:59.479990       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 04:15:59.480018       1 device.go:245] Counting mlu devices
I1111 04:15:59.480027       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:15:59.480044       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:15:59.480076       1 device.go:179] Counting dcu devices
I1111 04:15:59.480083       1 device.go:170] Counting iluvatar devices
I1111 04:15:59.480103       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:15:59.480132       1 score.go:32] devices status
I1111 04:15:59.480155       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:15:59.480165       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:15:59.480178       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:15:59.480189       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:15:59.480195       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:15:59.480212       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:15:59.480228       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:15:59.480238       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:15:59.480364       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:16:06.034082       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:16:06" nodeName="gpu-cluster-control-plane"
I1111 04:16:06.049654       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:16:36.096172       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:16:36" nodeName="gpu-cluster-control-plane"
I1111 04:16:36.112649       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:16:36.112701       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:16:36" nodeName="gpu-cluster-control-plane"
I1111 04:16:36.123660       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:17:06.156976       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:17:06" nodeName="gpu-cluster-control-plane"
I1111 04:17:06.172916       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:17:36.218272       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:17:36" nodeName="gpu-cluster-control-plane"
I1111 04:17:36.232934       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:18:06.279683       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:18:06" nodeName="gpu-cluster-control-plane"
I1111 04:18:06.294384       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:18:36.338427       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:18:36" nodeName="gpu-cluster-control-plane"
I1111 04:18:36.354655       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:19:06.408752       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:19:06" nodeName="gpu-cluster-control-plane"
I1111 04:19:06.423476       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:19:36.508757       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:19:36" nodeName="gpu-cluster-control-plane"
I1111 04:19:36.524316       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:20:06.570071       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:20:06" nodeName="gpu-cluster-control-plane"
I1111 04:20:06.584367       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:20:36.632236       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:20:36" nodeName="gpu-cluster-control-plane"
I1111 04:20:36.649814       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:21:06.693477       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:21:06" nodeName="gpu-cluster-control-plane"
I1111 04:21:06.707685       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:21:07.111875       1 route.go:44] Into Predicate Route inner func
I1111 04:21:07.112187       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 04:21:07.112203       1 device.go:170] Counting iluvatar devices
I1111 04:21:07.112210       1 device.go:245] Counting mlu devices
I1111 04:21:07.112217       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:21:07.112234       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:21:07.112249       1 device.go:179] Counting dcu devices
I1111 04:21:07.112284       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:21:07.112308       1 score.go:32] devices status
I1111 04:21:07.112334       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:21:07.112366       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:21:07.112375       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:21:07.112386       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:21:07.112397       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:21:07.112414       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:21:07.112439       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:21:07.112451       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:21:07.112539       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:21:36.754757       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:21:36" nodeName="gpu-cluster-control-plane"
I1111 04:21:36.770214       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:22:06.815698       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:22:06" nodeName="gpu-cluster-control-plane"
I1111 04:22:06.828575       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:22:36.874849       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:22:36" nodeName="gpu-cluster-control-plane"
I1111 04:22:36.889803       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:22:52.013998       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 46 items received
I1111 04:23:06.934275       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:23:06" nodeName="gpu-cluster-control-plane"
I1111 04:23:06.947349       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:23:36.994530       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:23:36" nodeName="gpu-cluster-control-plane"
I1111 04:23:37.009114       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:24:07.053844       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:24:07" nodeName="gpu-cluster-control-plane"
I1111 04:24:07.068771       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:24:22.047449       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 28 items received
I1111 04:24:37.116750       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:24:37" nodeName="gpu-cluster-control-plane"
I1111 04:24:37.130264       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:25:07.179138       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:25:07" nodeName="gpu-cluster-control-plane"
I1111 04:25:07.193719       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:25:37.239319       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:25:37" nodeName="gpu-cluster-control-plane"
I1111 04:25:37.254643       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:26:07.118371       1 route.go:44] Into Predicate Route inner func
I1111 04:26:07.118699       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 04:26:07.118721       1 device.go:179] Counting dcu devices
I1111 04:26:07.118730       1 device.go:170] Counting iluvatar devices
I1111 04:26:07.118737       1 device.go:245] Counting mlu devices
I1111 04:26:07.118748       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:26:07.118764       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:26:07.118788       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:26:07.118808       1 score.go:32] devices status
I1111 04:26:07.118825       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:26:07.118837       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:26:07.118846       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:26:07.118857       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:26:07.118864       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:26:07.118890       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:26:07.118904       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:26:07.118915       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:26:07.119044       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:26:07.305269       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:26:07" nodeName="gpu-cluster-control-plane"
I1111 04:26:07.320874       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:26:37.387279       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:26:37" nodeName="gpu-cluster-control-plane"
I1111 04:26:37.401481       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:27:07.461753       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:27:07" nodeName="gpu-cluster-control-plane"
I1111 04:27:07.476510       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:27:37.533732       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:27:37" nodeName="gpu-cluster-control-plane"
I1111 04:27:37.548955       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:28:07.595266       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:28:07" nodeName="gpu-cluster-control-plane"
I1111 04:28:07.609742       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:28:37.654741       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:28:37" nodeName="gpu-cluster-control-plane"
I1111 04:28:37.669805       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:29:07.719242       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:29:07" nodeName="gpu-cluster-control-plane"
I1111 04:29:07.733255       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:29:33.049251       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 7 items received
I1111 04:29:37.780315       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:29:37" nodeName="gpu-cluster-control-plane"
I1111 04:29:37.794639       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:30:07.844658       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:30:07" nodeName="gpu-cluster-control-plane"
I1111 04:30:07.859619       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:30:37.906292       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:30:37" nodeName="gpu-cluster-control-plane"
I1111 04:30:37.922232       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:31:07.124212       1 route.go:44] Into Predicate Route inner func
I1111 04:31:07.124642       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 04:31:07.124668       1 device.go:245] Counting mlu devices
I1111 04:31:07.124676       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:31:07.124696       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:31:07.124711       1 device.go:179] Counting dcu devices
I1111 04:31:07.124720       1 device.go:170] Counting iluvatar devices
I1111 04:31:07.124743       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:31:07.124761       1 score.go:32] devices status
I1111 04:31:07.124791       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:31:07.124802       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:31:07.124820       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:31:07.124834       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:31:07.124842       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:31:07.124859       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:31:07.124874       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:31:07.124885       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:31:07.124991       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:31:07.968294       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:31:07" nodeName="gpu-cluster-control-plane"
I1111 04:31:07.984276       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:31:13.015591       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 46 items received
I1111 04:31:38.027096       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:31:38" nodeName="gpu-cluster-control-plane"
I1111 04:31:38.040541       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:32:08.086594       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:32:08" nodeName="gpu-cluster-control-plane"
I1111 04:32:08.102215       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:32:38.150106       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:32:38" nodeName="gpu-cluster-control-plane"
I1111 04:32:38.164325       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:33:08.216024       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:33:08" nodeName="gpu-cluster-control-plane"
I1111 04:33:08.230263       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:33:38.279806       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:33:38" nodeName="gpu-cluster-control-plane"
I1111 04:33:38.294168       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:34:08.341366       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:34:08" nodeName="gpu-cluster-control-plane"
I1111 04:34:08.354720       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:34:38.401191       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:34:38" nodeName="gpu-cluster-control-plane"
I1111 04:34:38.417188       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:35:08.464077       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:35:08" nodeName="gpu-cluster-control-plane"
I1111 04:35:08.479316       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:35:38.523732       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:35:38" nodeName="gpu-cluster-control-plane"
I1111 04:35:38.538288       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:36:07.128902       1 route.go:44] Into Predicate Route inner func
I1111 04:36:07.129221       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 04:36:07.129238       1 device.go:245] Counting mlu devices
I1111 04:36:07.129246       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:36:07.129268       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:36:07.129282       1 device.go:179] Counting dcu devices
I1111 04:36:07.129290       1 device.go:170] Counting iluvatar devices
I1111 04:36:07.129316       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:36:07.129374       1 score.go:32] devices status
I1111 04:36:07.129407       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:36:07.129418       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:36:07.129431       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:36:07.129443       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:36:07.129450       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:36:07.129468       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:36:07.129482       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:36:07.129492       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:36:07.129614       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:36:08.584614       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:36:08" nodeName="gpu-cluster-control-plane"
I1111 04:36:08.599295       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:36:38.657909       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:36:38" nodeName="gpu-cluster-control-plane"
I1111 04:36:38.674554       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:37:08.722686       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:37:08" nodeName="gpu-cluster-control-plane"
I1111 04:37:08.738381       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:37:38.051324       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 10 items received
I1111 04:37:38.783458       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:37:38" nodeName="gpu-cluster-control-plane"
I1111 04:37:38.796627       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:38:08.846292       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:38:08" nodeName="gpu-cluster-control-plane"
I1111 04:38:08.861779       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:38:38.910917       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:38:38" nodeName="gpu-cluster-control-plane"
I1111 04:38:38.927429       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:39:08.975417       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:39:08" nodeName="gpu-cluster-control-plane"
I1111 04:39:08.989005       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:39:36.017601       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 43 items received
I1111 04:39:39.034079       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:39:39" nodeName="gpu-cluster-control-plane"
I1111 04:39:39.049745       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:40:09.093605       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:40:09" nodeName="gpu-cluster-control-plane"
I1111 04:40:09.106884       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:40:39.153218       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:40:39" nodeName="gpu-cluster-control-plane"
I1111 04:40:39.170395       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:41:07.134913       1 route.go:44] Into Predicate Route inner func
I1111 04:41:07.135246       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 04:41:07.135266       1 device.go:170] Counting iluvatar devices
I1111 04:41:07.135273       1 device.go:245] Counting mlu devices
I1111 04:41:07.135280       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:41:07.135296       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:41:07.135310       1 device.go:179] Counting dcu devices
I1111 04:41:07.135333       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:41:07.135365       1 score.go:32] devices status
I1111 04:41:07.135391       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:41:07.135401       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:41:07.135412       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:41:07.135424       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:41:07.135433       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:41:07.135450       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:41:07.135465       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:41:07.135475       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:41:07.135616       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:41:09.219686       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:41:09" nodeName="gpu-cluster-control-plane"
I1111 04:41:09.236310       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:41:39.279477       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:41:39" nodeName="gpu-cluster-control-plane"
I1111 04:41:39.294282       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:42:09.340136       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:42:09" nodeName="gpu-cluster-control-plane"
I1111 04:42:09.355233       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:42:39.402742       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:42:39" nodeName="gpu-cluster-control-plane"
I1111 04:42:39.418723       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:43:09.463416       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:43:09" nodeName="gpu-cluster-control-plane"
I1111 04:43:09.478332       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:43:28.053067       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 6 items received
I1111 04:43:39.524750       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:43:39" nodeName="gpu-cluster-control-plane"
I1111 04:43:39.539575       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:44:09.585287       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:44:09" nodeName="gpu-cluster-control-plane"
I1111 04:44:09.601210       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:44:39.687217       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:44:39" nodeName="gpu-cluster-control-plane"
I1111 04:44:39.707821       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:45:09.750614       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:45:09" nodeName="gpu-cluster-control-plane"
I1111 04:45:09.765521       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:45:39.809453       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:45:39" nodeName="gpu-cluster-control-plane"
I1111 04:45:39.824146       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:45:54.019489       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 35 items received
I1111 04:46:07.141621       1 route.go:44] Into Predicate Route inner func
I1111 04:46:07.141941       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 04:46:07.141961       1 device.go:245] Counting mlu devices
I1111 04:46:07.141972       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:46:07.141989       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:46:07.142005       1 device.go:179] Counting dcu devices
I1111 04:46:07.142016       1 device.go:170] Counting iluvatar devices
I1111 04:46:07.142036       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:46:07.142052       1 score.go:32] devices status
I1111 04:46:07.142073       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:46:07.142085       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:46:07.142099       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:46:07.142112       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:46:07.142119       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:46:07.142136       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:46:07.142153       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:46:07.142163       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:46:07.142311       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:46:09.870789       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:46:09" nodeName="gpu-cluster-control-plane"
I1111 04:46:09.884768       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:46:39.934165       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:46:39" nodeName="gpu-cluster-control-plane"
I1111 04:46:39.952516       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:47:09.998223       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:47:09" nodeName="gpu-cluster-control-plane"
I1111 04:47:10.014366       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:47:40.074088       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:47:40" nodeName="gpu-cluster-control-plane"
I1111 04:47:40.091601       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:48:10.135199       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:48:10" nodeName="gpu-cluster-control-plane"
I1111 04:48:10.148700       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:48:40.196708       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:48:40" nodeName="gpu-cluster-control-plane"
I1111 04:48:40.213260       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:49:10.257766       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:49:10" nodeName="gpu-cluster-control-plane"
I1111 04:49:10.275309       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:49:37.055067       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 8 items received
I1111 04:49:40.317713       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:49:40" nodeName="gpu-cluster-control-plane"
I1111 04:49:40.333307       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:50:10.378509       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:50:10" nodeName="gpu-cluster-control-plane"
I1111 04:50:10.396530       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:50:40.437843       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:50:40" nodeName="gpu-cluster-control-plane"
I1111 04:50:40.453235       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:51:07.146892       1 route.go:44] Into Predicate Route inner func
I1111 04:51:07.147203       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 04:51:07.147221       1 device.go:170] Counting iluvatar devices
I1111 04:51:07.147229       1 device.go:245] Counting mlu devices
I1111 04:51:07.147237       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:51:07.147261       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:51:07.147276       1 device.go:179] Counting dcu devices
I1111 04:51:07.147306       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:51:07.147325       1 score.go:32] devices status
I1111 04:51:07.147375       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:51:07.147391       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:51:07.147402       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:51:07.147414       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:51:07.147421       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:51:07.147439       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:51:07.147452       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:51:07.147461       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:51:07.147584       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:51:10.502491       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:51:10" nodeName="gpu-cluster-control-plane"
I1111 04:51:10.520066       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:51:40.564745       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:51:40" nodeName="gpu-cluster-control-plane"
I1111 04:51:40.578653       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:52:10.625400       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:52:10" nodeName="gpu-cluster-control-plane"
I1111 04:52:10.642092       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:52:40.689859       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:52:40" nodeName="gpu-cluster-control-plane"
I1111 04:52:40.705860       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:53:10.754425       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:53:10" nodeName="gpu-cluster-control-plane"
I1111 04:53:10.769913       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:53:40.812828       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:53:40" nodeName="gpu-cluster-control-plane"
I1111 04:53:40.825898       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:54:09.021228       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 42 items received
I1111 04:54:10.873327       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:54:10" nodeName="gpu-cluster-control-plane"
I1111 04:54:10.891743       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:54:40.933286       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:54:40" nodeName="gpu-cluster-control-plane"
I1111 04:54:40.948838       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:55:10.996452       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:55:10" nodeName="gpu-cluster-control-plane"
I1111 04:55:11.010872       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:55:41.057297       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:55:41" nodeName="gpu-cluster-control-plane"
I1111 04:55:41.072523       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:56:07.152466       1 route.go:44] Into Predicate Route inner func
I1111 04:56:07.152790       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 04:56:07.152808       1 device.go:245] Counting mlu devices
I1111 04:56:07.152816       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 04:56:07.152837       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 04:56:07.152852       1 device.go:179] Counting dcu devices
I1111 04:56:07.152860       1 device.go:170] Counting iluvatar devices
I1111 04:56:07.152882       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 04:56:07.152901       1 score.go:32] devices status
I1111 04:56:07.152926       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 04:56:07.152937       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 04:56:07.152949       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 04:56:07.152962       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 04:56:07.152971       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 04:56:07.152989       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 04:56:07.153004       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 04:56:07.153014       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 04:56:07.153170       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 04:56:11.118514       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:56:11" nodeName="gpu-cluster-control-plane"
I1111 04:56:11.132548       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:56:17.056893       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 7 items received
I1111 04:56:41.181012       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:56:41" nodeName="gpu-cluster-control-plane"
I1111 04:56:41.197087       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:57:11.247524       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:57:11" nodeName="gpu-cluster-control-plane"
I1111 04:57:11.260767       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:57:41.308883       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:57:41" nodeName="gpu-cluster-control-plane"
I1111 04:57:41.324036       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:58:11.370847       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:58:11" nodeName="gpu-cluster-control-plane"
I1111 04:58:11.386318       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:58:41.432212       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:58:41" nodeName="gpu-cluster-control-plane"
I1111 04:58:41.448562       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:59:11.491223       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:59:11" nodeName="gpu-cluster-control-plane"
I1111 04:59:11.506641       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 04:59:41.551633       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 04:59:41" nodeName="gpu-cluster-control-plane"
I1111 04:59:41.566748       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:00:11.613492       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:00:11" nodeName="gpu-cluster-control-plane"
I1111 05:00:11.628511       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:00:41.678290       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:00:41" nodeName="gpu-cluster-control-plane"
I1111 05:00:41.695206       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:01:07.159814       1 route.go:44] Into Predicate Route inner func
I1111 05:01:07.160170       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:01:07.160188       1 device.go:179] Counting dcu devices
I1111 05:01:07.160195       1 device.go:170] Counting iluvatar devices
I1111 05:01:07.160203       1 device.go:245] Counting mlu devices
I1111 05:01:07.160220       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:01:07.160288       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:01:07.160323       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:01:07.160353       1 score.go:32] devices status
I1111 05:01:07.160376       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:01:07.160388       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:01:07.160401       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:01:07.160413       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:01:07.160438       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:01:07.160457       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:01:07.160474       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:01:07.160481       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:01:07.160622       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:01:11.738951       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:01:11" nodeName="gpu-cluster-control-plane"
I1111 05:01:11.756286       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:01:37.022961       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 41 items received
I1111 05:01:41.800016       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:01:41" nodeName="gpu-cluster-control-plane"
I1111 05:01:41.815349       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:02:11.860571       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:02:11" nodeName="gpu-cluster-control-plane"
I1111 05:02:11.876557       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:02:41.923660       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:02:41" nodeName="gpu-cluster-control-plane"
I1111 05:02:41.944288       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:03:11.982597       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:03:11" nodeName="gpu-cluster-control-plane"
I1111 05:03:11.997699       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:03:42.042966       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:03:42" nodeName="gpu-cluster-control-plane"
I1111 05:03:42.058371       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:04:12.103590       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:04:12" nodeName="gpu-cluster-control-plane"
I1111 05:04:12.119116       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:04:42.164404       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:04:42" nodeName="gpu-cluster-control-plane"
I1111 05:04:42.178501       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:05:08.058061       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 9 items received
I1111 05:05:12.229582       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:05:12" nodeName="gpu-cluster-control-plane"
I1111 05:05:12.245031       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:05:42.292868       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:05:42" nodeName="gpu-cluster-control-plane"
I1111 05:05:42.306208       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:06:07.164229       1 route.go:44] Into Predicate Route inner func
I1111 05:06:07.164553       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:06:07.164575       1 device.go:179] Counting dcu devices
I1111 05:06:07.164584       1 device.go:170] Counting iluvatar devices
I1111 05:06:07.164591       1 device.go:245] Counting mlu devices
I1111 05:06:07.164598       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:06:07.164616       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:06:07.164640       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:06:07.164658       1 score.go:32] devices status
I1111 05:06:07.164684       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:06:07.164697       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:06:07.164706       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:06:07.164714       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:06:07.164719       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:06:07.164736       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:06:07.164751       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:06:07.164761       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:06:07.164923       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:06:12.354349       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:06:12" nodeName="gpu-cluster-control-plane"
I1111 05:06:12.369524       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:06:42.418465       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:06:42" nodeName="gpu-cluster-control-plane"
I1111 05:06:42.433730       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:07:11.024410       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 30 items received
I1111 05:07:12.482152       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:07:12" nodeName="gpu-cluster-control-plane"
I1111 05:07:12.496906       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:07:42.542529       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:07:42" nodeName="gpu-cluster-control-plane"
I1111 05:07:42.555983       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:08:12.601053       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:08:12" nodeName="gpu-cluster-control-plane"
I1111 05:08:12.616218       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:08:42.661048       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:08:42" nodeName="gpu-cluster-control-plane"
I1111 05:08:42.675449       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:09:12.722830       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:09:12" nodeName="gpu-cluster-control-plane"
I1111 05:09:12.736951       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:09:42.783394       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:09:42" nodeName="gpu-cluster-control-plane"
I1111 05:09:42.838871       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:10:12.846117       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:10:12" nodeName="gpu-cluster-control-plane"
I1111 05:10:12.860832       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:10:42.909113       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:10:42" nodeName="gpu-cluster-control-plane"
I1111 05:10:42.924042       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:11:07.170348       1 route.go:44] Into Predicate Route inner func
I1111 05:11:07.170669       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:11:07.170687       1 device.go:245] Counting mlu devices
I1111 05:11:07.170695       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:11:07.170712       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:11:07.170726       1 device.go:179] Counting dcu devices
I1111 05:11:07.170735       1 device.go:170] Counting iluvatar devices
I1111 05:11:07.170758       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:11:07.170775       1 score.go:32] devices status
I1111 05:11:07.170797       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:11:07.170809       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:11:07.170820       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:11:07.170832       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:11:07.170846       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:11:07.170866       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:11:07.170882       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:11:07.170894       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:11:07.171070       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:11:12.968201       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:11:12" nodeName="gpu-cluster-control-plane"
I1111 05:11:12.984560       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:11:43.030296       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:11:43" nodeName="gpu-cluster-control-plane"
I1111 05:11:43.044560       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:12:13.092524       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:12:13" nodeName="gpu-cluster-control-plane"
I1111 05:12:13.107167       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:12:18.025477       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 30 items received
I1111 05:12:43.157068       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:12:43" nodeName="gpu-cluster-control-plane"
I1111 05:12:43.173019       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:13:13.222238       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:13:13" nodeName="gpu-cluster-control-plane"
I1111 05:13:13.235500       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:13:43.288826       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:13:43" nodeName="gpu-cluster-control-plane"
I1111 05:13:43.301960       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:14:13.352386       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:14:13" nodeName="gpu-cluster-control-plane"
I1111 05:14:13.369199       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:14:36.024402       1 reflector.go:378] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: forcing resync
I1111 05:14:36.046540       1 reflector.go:378] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: forcing resync
I1111 05:14:43.411690       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:14:43" nodeName="gpu-cluster-control-plane"
I1111 05:14:43.428680       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:14:50.060045       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 10 items received
I1111 05:15:13.471632       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:15:13" nodeName="gpu-cluster-control-plane"
I1111 05:15:13.486301       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:15:43.547789       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:15:43" nodeName="gpu-cluster-control-plane"
I1111 05:15:43.561602       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:16:07.176886       1 route.go:44] Into Predicate Route inner func
I1111 05:16:07.177195       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:16:07.177212       1 device.go:245] Counting mlu devices
I1111 05:16:07.177218       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:16:07.177236       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:16:07.177251       1 device.go:179] Counting dcu devices
I1111 05:16:07.177260       1 device.go:170] Counting iluvatar devices
I1111 05:16:07.177285       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:16:07.177308       1 score.go:32] devices status
I1111 05:16:07.177330       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:16:07.177364       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:16:07.177378       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:16:07.177389       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:16:07.177396       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:16:07.177414       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:16:07.177429       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:16:07.177441       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:16:07.177533       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:16:13.606379       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:16:13" nodeName="gpu-cluster-control-plane"
I1111 05:16:13.620662       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:16:43.684822       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:16:43" nodeName="gpu-cluster-control-plane"
I1111 05:16:43.701817       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:17:13.744649       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:17:13" nodeName="gpu-cluster-control-plane"
I1111 05:17:13.759481       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:17:43.804875       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:17:43" nodeName="gpu-cluster-control-plane"
I1111 05:17:43.819142       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:18:13.865355       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:18:13" nodeName="gpu-cluster-control-plane"
I1111 05:18:13.880590       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:18:43.926250       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:18:43" nodeName="gpu-cluster-control-plane"
I1111 05:18:43.942466       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:19:13.991148       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:19:13" nodeName="gpu-cluster-control-plane"
I1111 05:19:14.005397       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:19:44.050017       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:19:44" nodeName="gpu-cluster-control-plane"
I1111 05:19:44.063295       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:19:47.027018       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 40 items received
I1111 05:20:14.109941       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:20:14" nodeName="gpu-cluster-control-plane"
I1111 05:20:14.123451       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:20:44.171198       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:20:44" nodeName="gpu-cluster-control-plane"
I1111 05:20:44.185894       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:21:07.182455       1 route.go:44] Into Predicate Route inner func
I1111 05:21:07.182806       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:21:07.182829       1 device.go:179] Counting dcu devices
I1111 05:21:07.182837       1 device.go:170] Counting iluvatar devices
I1111 05:21:07.182845       1 device.go:245] Counting mlu devices
I1111 05:21:07.182853       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:21:07.182869       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:21:07.182891       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:21:07.182917       1 score.go:32] devices status
I1111 05:21:07.182996       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:21:07.183013       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:21:07.183026       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:21:07.183038       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:21:07.183046       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:21:07.183067       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:21:07.183080       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:21:07.183092       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:21:07.183250       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:21:14.230710       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:21:14" nodeName="gpu-cluster-control-plane"
I1111 05:21:14.246472       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:21:44.292372       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:21:44" nodeName="gpu-cluster-control-plane"
I1111 05:21:44.311286       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:22:14.351650       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:22:14" nodeName="gpu-cluster-control-plane"
I1111 05:22:14.366222       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:22:44.412691       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:22:44" nodeName="gpu-cluster-control-plane"
I1111 05:22:44.428362       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:23:14.474068       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:23:14" nodeName="gpu-cluster-control-plane"
I1111 05:23:14.487693       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:23:44.533868       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:23:44" nodeName="gpu-cluster-control-plane"
I1111 05:23:44.547212       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:24:14.593877       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:24:14" nodeName="gpu-cluster-control-plane"
I1111 05:24:14.609210       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:24:38.061088       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 10 items received
I1111 05:24:44.654300       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:24:44" nodeName="gpu-cluster-control-plane"
I1111 05:24:44.669329       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:25:14.718831       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:25:14" nodeName="gpu-cluster-control-plane"
I1111 05:25:14.734070       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:25:44.776953       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:25:44" nodeName="gpu-cluster-control-plane"
I1111 05:25:44.789951       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:26:07.188974       1 route.go:44] Into Predicate Route inner func
I1111 05:26:07.189281       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:26:07.189301       1 device.go:245] Counting mlu devices
I1111 05:26:07.189310       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:26:07.189329       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:26:07.189356       1 device.go:179] Counting dcu devices
I1111 05:26:07.189364       1 device.go:170] Counting iluvatar devices
I1111 05:26:07.189401       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:26:07.189423       1 score.go:32] devices status
I1111 05:26:07.189449       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:26:07.189465       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:26:07.189478       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:26:07.189490       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:26:07.189499       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:26:07.189516       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:26:07.189527       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:26:07.189535       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:26:07.189686       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:26:14.846855       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:26:14" nodeName="gpu-cluster-control-plane"
I1111 05:26:14.862137       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:26:44.907729       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:26:44" nodeName="gpu-cluster-control-plane"
I1111 05:26:44.922370       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:27:14.973323       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:27:14" nodeName="gpu-cluster-control-plane"
I1111 05:27:14.989124       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:27:45.029662       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:27:45" nodeName="gpu-cluster-control-plane"
I1111 05:27:45.042714       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:28:15.089788       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:28:15" nodeName="gpu-cluster-control-plane"
I1111 05:28:15.102386       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:28:45.149556       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:28:45" nodeName="gpu-cluster-control-plane"
I1111 05:28:45.167388       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:28:57.028902       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 49 items received
I1111 05:29:15.212709       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:29:15" nodeName="gpu-cluster-control-plane"
I1111 05:29:15.227952       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:29:45.273825       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:29:45" nodeName="gpu-cluster-control-plane"
I1111 05:29:45.289652       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:30:15.335321       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:30:15" nodeName="gpu-cluster-control-plane"
I1111 05:30:15.350179       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:30:22.062590       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 6 items received
I1111 05:30:45.395472       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:30:45" nodeName="gpu-cluster-control-plane"
I1111 05:30:45.411008       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:31:07.193775       1 route.go:44] Into Predicate Route inner func
I1111 05:31:07.194090       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:31:07.194109       1 device.go:179] Counting dcu devices
I1111 05:31:07.194117       1 device.go:170] Counting iluvatar devices
I1111 05:31:07.194125       1 device.go:245] Counting mlu devices
I1111 05:31:07.194134       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:31:07.194152       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:31:07.194180       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:31:07.194200       1 score.go:32] devices status
I1111 05:31:07.194221       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:31:07.194233       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:31:07.194245       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:31:07.194258       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:31:07.194294       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:31:07.194316       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:31:07.194352       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:31:07.194367       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:31:07.194491       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:31:15.456937       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:31:15" nodeName="gpu-cluster-control-plane"
I1111 05:31:15.472764       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:31:45.519145       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:31:45" nodeName="gpu-cluster-control-plane"
I1111 05:31:45.534487       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:32:15.577732       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:32:15" nodeName="gpu-cluster-control-plane"
I1111 05:32:15.591478       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:32:45.638685       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:32:45" nodeName="gpu-cluster-control-plane"
I1111 05:32:45.653546       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:33:15.702553       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:33:15" nodeName="gpu-cluster-control-plane"
I1111 05:33:15.717785       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:33:45.763481       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:33:45" nodeName="gpu-cluster-control-plane"
I1111 05:33:45.778291       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:34:15.824029       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:34:15" nodeName="gpu-cluster-control-plane"
I1111 05:34:15.839307       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:34:45.883605       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:34:45" nodeName="gpu-cluster-control-plane"
I1111 05:34:45.900010       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:35:15.950574       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:35:15" nodeName="gpu-cluster-control-plane"
I1111 05:35:15.964933       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:35:46.017514       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:35:46" nodeName="gpu-cluster-control-plane"
I1111 05:35:46.032916       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:35:57.030221       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 37 items received
I1111 05:36:07.201007       1 route.go:44] Into Predicate Route inner func
I1111 05:36:07.201321       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:36:07.201351       1 device.go:170] Counting iluvatar devices
I1111 05:36:07.201360       1 device.go:245] Counting mlu devices
I1111 05:36:07.201370       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:36:07.201388       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:36:07.201403       1 device.go:179] Counting dcu devices
I1111 05:36:07.201427       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:36:07.201449       1 score.go:32] devices status
I1111 05:36:07.201468       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:36:07.201481       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:36:07.201492       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:36:07.201504       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:36:07.201518       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:36:07.201535       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:36:07.201554       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:36:07.201571       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:36:07.201672       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:36:16.080589       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:36:16" nodeName="gpu-cluster-control-plane"
I1111 05:36:16.094542       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:36:46.145982       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:36:46" nodeName="gpu-cluster-control-plane"
I1111 05:36:46.160533       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:37:16.209514       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:37:16" nodeName="gpu-cluster-control-plane"
I1111 05:37:16.224127       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:37:46.270702       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:37:46" nodeName="gpu-cluster-control-plane"
I1111 05:37:46.288611       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:38:16.330925       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:38:16" nodeName="gpu-cluster-control-plane"
I1111 05:38:16.346117       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:38:46.393239       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:38:46" nodeName="gpu-cluster-control-plane"
I1111 05:38:46.410574       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:39:16.458604       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:39:16" nodeName="gpu-cluster-control-plane"
I1111 05:39:16.474054       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:39:33.063952       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 11 items received
I1111 05:39:46.519084       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:39:46" nodeName="gpu-cluster-control-plane"
I1111 05:39:46.533930       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:40:16.580085       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:40:16" nodeName="gpu-cluster-control-plane"
I1111 05:40:16.593464       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:40:46.642141       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:40:46" nodeName="gpu-cluster-control-plane"
I1111 05:40:46.658361       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:41:07.208118       1 route.go:44] Into Predicate Route inner func
I1111 05:41:07.208443       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:41:07.208461       1 device.go:245] Counting mlu devices
I1111 05:41:07.208467       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:41:07.208484       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:41:07.208499       1 device.go:179] Counting dcu devices
I1111 05:41:07.208509       1 device.go:170] Counting iluvatar devices
I1111 05:41:07.208532       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:41:07.208549       1 score.go:32] devices status
I1111 05:41:07.208570       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:41:07.208582       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:41:07.208593       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:41:07.208606       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:41:07.208615       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:41:07.208632       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:41:07.208644       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:41:07.208655       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:41:07.208819       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:41:16.704185       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:41:16" nodeName="gpu-cluster-control-plane"
I1111 05:41:16.718579       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:41:46.764711       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:41:46" nodeName="gpu-cluster-control-plane"
I1111 05:41:46.781074       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:42:16.827449       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:42:16" nodeName="gpu-cluster-control-plane"
I1111 05:42:16.846788       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:42:46.891731       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:42:46" nodeName="gpu-cluster-control-plane"
I1111 05:42:46.906090       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:43:16.951782       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:43:16" nodeName="gpu-cluster-control-plane"
I1111 05:43:16.967642       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:43:47.027862       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:43:47" nodeName="gpu-cluster-control-plane"
I1111 05:43:47.042667       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:44:17.103384       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:44:17" nodeName="gpu-cluster-control-plane"
I1111 05:44:17.117724       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:44:47.168108       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:44:47" nodeName="gpu-cluster-control-plane"
I1111 05:44:47.185039       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:44:48.065204       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 6 items received
I1111 05:45:17.232880       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:45:17" nodeName="gpu-cluster-control-plane"
I1111 05:45:17.247747       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:45:40.031403       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Node total 51 items received
I1111 05:45:47.293509       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:45:47" nodeName="gpu-cluster-control-plane"
I1111 05:45:47.314238       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:46:07.212841       1 route.go:44] Into Predicate Route inner func
I1111 05:46:07.213308       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:46:07.213483       1 device.go:179] Counting dcu devices
I1111 05:46:07.213496       1 device.go:170] Counting iluvatar devices
I1111 05:46:07.213503       1 device.go:245] Counting mlu devices
I1111 05:46:07.213511       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:46:07.213592       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:46:07.213632       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:46:07.213649       1 score.go:32] devices status
I1111 05:46:07.213759       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:46:07.213816       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:46:07.213853       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:46:07.213894       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:46:07.213906       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:46:07.213925       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:46:07.213981       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:46:07.214013       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:46:07.214165       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:46:17.354182       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:46:17" nodeName="gpu-cluster-control-plane"
I1111 05:46:17.369528       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:46:47.416787       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:46:47" nodeName="gpu-cluster-control-plane"
I1111 05:46:47.432219       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:47:17.482011       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:47:17" nodeName="gpu-cluster-control-plane"
I1111 05:47:17.497414       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:47:47.540480       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:47:47" nodeName="gpu-cluster-control-plane"
I1111 05:47:47.555653       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:48:17.599825       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:48:17" nodeName="gpu-cluster-control-plane"
I1111 05:48:17.613092       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:48:47.660537       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:48:47" nodeName="gpu-cluster-control-plane"
I1111 05:48:47.678262       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:49:17.725039       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:49:17" nodeName="gpu-cluster-control-plane"
I1111 05:49:17.740639       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:49:47.786408       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:49:47" nodeName="gpu-cluster-control-plane"
I1111 05:49:47.801913       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:49:52.067062       1 reflector.go:790] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Watch close - *v1.Pod total 7 items received
I1111 05:50:17.848006       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:50:17" nodeName="gpu-cluster-control-plane"
I1111 05:50:17.861955       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:50:47.909875       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:50:47" nodeName="gpu-cluster-control-plane"
I1111 05:50:47.925823       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:51:07.217244       1 route.go:44] Into Predicate Route inner func
I1111 05:51:07.217570       1 scheduler.go:445] "begin schedule filter" pod="gpu-pod" uuid="9c2f8cbd-23bf-4ea6-9423-099b99b1e558" namespaces="default"
I1111 05:51:07.217589       1 device.go:245] Counting mlu devices
I1111 05:51:07.217600       1 device.go:250] idx= nvidia.com/gpu val= {{2 0} {<nil>} 2 DecimalSI} {{0 0} {<nil>}  }
I1111 05:51:07.217617       1 device.go:250] idx= nvidia.com/gpumem val= {{1 3} {<nil>} 1k DecimalSI} {{0 0} {<nil>}  }
I1111 05:51:07.217633       1 device.go:179] Counting dcu devices
I1111 05:51:07.217643       1 device.go:170] Counting iluvatar devices
I1111 05:51:07.217663       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":1000,"MemPercentagereq":101,"Coresreq":0}}]
I1111 05:51:07.217681       1 score.go:32] devices status
I1111 05:51:07.217703       1 score.go:34] "device status" device id="GPU-a02b30b8-5b20-131c-0e47-6bc99948e264" device detail={"Device":{"ID":"GPU-a02b30b8-5b20-131c-0e47-6bc99948e264","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":15360,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-Tesla T4","Health":true},"Score":0}
I1111 05:51:07.217715       1 node_policy.go:61] node gpu-cluster-control-plane used 0, usedCore 0, usedMem 0,
I1111 05:51:07.217730       1 node_policy.go:73] node gpu-cluster-control-plane computer score is 0.000000
I1111 05:51:07.217742       1 gpu_policy.go:70] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 user 0, userCore 0, userMem 0,
I1111 05:51:07.217749       1 gpu_policy.go:76] device GPU-a02b30b8-5b20-131c-0e47-6bc99948e264 computer score is 2.651042
I1111 05:51:07.217767       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1
I1111 05:51:07.217780       1 score.go:225] "calcScore:node not fit pod" pod="default/gpu-pod" node="gpu-cluster-control-plane"
I1111 05:51:07.217791       1 scheduler.go:479] All node scores do not meet for pod gpu-pod
I1111 05:51:07.217940       1 event.go:307] "Event occurred" object="default/gpu-pod" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I1111 05:51:17.972672       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:51:17" nodeName="gpu-cluster-control-plane"
I1111 05:51:17.988619       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:
I1111 05:51:48.033309       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2024.11.11 05:51:48" nodeName="gpu-cluster-control-plane"
I1111 05:51:48.050057       1 util.go:163] Encoded node Devices: GPU-a02b30b8-5b20-131c-0e47-6bc99948e264,10,15360,100,NVIDIA-Tesla T4,0,true:

wawa0210 commented 1 week ago

I1111 05:51:07.217767       1 score.go:158] "request devices nums cannot exceed the total number of devices on the node." pod="default/gpu-pod" request devices nums=2 node device nums=1

The log shows that there is only one physical GPU on the node, but you declared 2 in the pod, so an exception occurs.

The 10 vGPUs registered by the node indicate that the GPU can be shared by 10 pods, which does not mean that a pod can declare multiple vGPUs.

cr7258 commented 1 week ago

The 10 vGPUs registered by the node indicate that the GPU can be shared by 10 pods, which does not mean that a pod can declare multiple vGPUs.

Ah, got it. I initially misunderstood it. Thank you for your explanation.

Project-HAMi / HAMi

Unable to allocate a vGPU, even though there are available vGPUs #602