Required information

Distribution:Ubuntu
Distribution version:20.4
The output of "lxc info" or if that fails: config: {} api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total api_status: stable api_version: "1.0" auth: trusted public: false auth_methods:
- tls environment: addresses: [] architectures:
- x86_64
- i686 certificate: | -----BEGIN CERTIFICATE----- MIICHjCCAaSgAwIBAgIQBkK0q1MFZ6Xbq9GhcvkUTzAKBggqhkjOPQQDAzA9MRww GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMR0wGwYDVQQDDBRyb290QGluc3B1 ci1ORjU0NjhNNTAeFw0yMzA2MDMxNjE4NTVaFw0zMzA1MzExNjE4NTVaMD0xHDAa BgNVBAoTE2xpbnV4Y29udGFpbmVycy5vcmcxHTAbBgNVBAMMFHJvb3RAaW5zcHVy LU5GNTQ2OE01MHYwEAYHKoZIzj0CAQYFK4EEACIDYgAE/2v23vUutsnHiipjn3zx LdssijLWILKVZ+gLphyL5t59LGwVJMw6KANI17RePlFdIWIzGbRy6fLzuWCOBcLn /rbUSMFS4hD9XXwWKTmKd4XHPWqhzzlcaOPk619+bkvfo2kwZzAOBgNVHQ8BAf8E BAMCBaAwEwYDVR0lBAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAyBgNVHREE KzApgg9pbnNwdXItTkY1NDY4TTWHBH8AAAGHEAAAAAAAAAAAAAAAAAAAAAEwCgYI KoZIzj0EAwMDaAAwZQIwMKCrCZbYe4q9n5ZwCYXQzA5JXiOMVoyWmnW/kfG87J9k LyskFBL2NMGER98QH1srAjEA6DTamPs4B2yXiT2O8/qwSqLMBKhfoBZwPGzjtcNw 5kWg7LjjgDZguYosDP2p/oLO -----END CERTIFICATE----- certificate_fingerprint: 2525f52ba39aa4c0dbeeb77f8d8418b7d4994aa099af16a727f549a85192340c driver: lxc | qemu driver_version: 5.0.2 | 8.0.0 firewall: xtables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.15.0-73-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "20.04" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: inspur-NF5468M5 server_pid: 4125 server_version: "5.14" storage: zfs storage_version: 2.1.5-1ubuntu6~22.04.1 storage_supported_drivers:
- name: dir version: "1" remote: false
- name: lvm version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0 remote: false
- name: zfs version: 2.1.5-1ubuntu6~22.04.1 remote: false
- name: btrfs version: 5.16.2 remote: false
- name: ceph version: 17.2.5 remote: true
- name: cephfs version: 17.2.5 remote: true
- name: cephobject version: 17.2.5 remote: true

Issue description

I used the lxd to create a contain. Inside the lxd_contain I use the Nvidia NGC https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch to create another contain. But when I use docker run --gpus all --name test1 -it --runtime=nvidia nvcr.io/nvidia/pytorch:23.05-py3,get into the contain. It shows that


ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.  GPU functionality will not be available.
   [[ Initialization error (error 3) ]]

and use torch.cuda.is_available()

>>> torch.cuda.is_available()
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:115: UserWarning: CUDA initialization: CUDA driver initializati                                                                                                on failed, you might not have a CUDA gpu. (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>>

I run nvidia-smi in lxd_contain，it output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           Off| 00000000:40:00.0 Off |                   On |
| N/A   27C    P0               31W / 400W|      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB           Off| 00000000:B1:00.0 Off |                   On |
| N/A   26C    P0               36W / 400W|      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG|
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I run nvcc --version in docker_contain ,it show

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

I run nvidia-smi in docker_contain， and it output

Wed Jun  7 12:17:34 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           Off| 00000000:40:00.0 Off |                   On |
| N/A   27C    P0               31W / 400W|      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB           Off| 00000000:B1:00.0 Off |                   On |
| N/A   26C    P0               36W / 400W|      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG|
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I run nvcc --version in docker_contain ,it show

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Besides, I run this contain in a lxc contain.

Information to attach

[ ] Any relevant kernel output (dmesg)

[ ] Container log (lxc info NAME --show-log)


Name: chenjincong
Status: RUNNING
Type: container
Architecture: x86_64
PID: 11616
Created: 2023/06/05 16:32 CST
Last Used: 2023/06/07 12:08 CST

Resources: Processes: 156 Disk usage: root: 550.46GiB CPU usage: CPU usage (in seconds): 6233 Memory usage: Memory (current): 2.87GiB Memory (peak): 3.36GiB Network usage: docker0: Type: broadcast State: UP MAC address: 02:42:99:a9:b4:65 MTU: 1500 Bytes received: 625.32kB Bytes sent: 32.49MB Packets received: 12895 Packets sent: 21297 IP addresses: inet: 172.17.0.1/16 (global) inet6: fe80::42:99ff:fea9:b465/64 (link) eth0: Type: broadcast State: UP Host interface: veth7570b1bc MAC address: 00:16:3e:99:81:19 MTU: 1500 Bytes received: 12.64GB Bytes sent: 410.08MB Packets received: 8441356 Packets sent: 5625534 IP addresses: inet: 10.124.188.222/24 (global) inet6: fd42:eb5f:d57b:3c13:216:3eff:fe99:8119/64 (global) inet6: fe80::216:3eff:fe99:8119/64 (link) lo: Type: loopback State: UP MTU: 65536 Bytes received: 120.83kB Bytes sent: 120.83kB Packets received: 1018 Packets sent: 1018 IP addresses: inet: 127.0.0.1/8 (local) inet6: ::1/128 (local)

Log:

lxc chenjincong 20230607040824.928 ERROR conf - ../src/src/lxc/conf.c:turn_into_dependent_mounts:3948 - No such file or directory - Failed to recursively turn old root mount tree into dependent mount. Continuing...

 - [ ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)

architecture: x86_64 config: image.architecture: amd64 image.description: Ubuntu focal amd64 (20230602_07:43) image.name: ubuntu-focal-amd64-default-20230602_07:43 image.os: ubuntu image.release: focal image.serial: "20230602_07:43" image.variant: default security.nesting: "true" security.privileged: "true" security.syscalls.intercept.mknod: "true" security.syscalls.intercept.setxattr: "true" volatile.base_image: 88b26c8cd8737818c062f547b1f7cb472ed3dc82bd66bcf95779dff4ae6cc5c5 volatile.cloud-init.instance-id: 217fc8fc-978e-4e85-a543-a408c4b9ca41 volatile.eth0.host_name: veth7570b1bc volatile.eth0.hwaddr: 00:16:3e:99:81:19 volatile.idmap.base: "0" volatile.idmap.current: '[]' volatile.idmap.next: '[]' volatile.last_state.idmap: '[]' volatile.last_state.power: RUNNING volatile.last_state.ready: "false" volatile.uuid: 053d9a82-df16-4fd3-8184-96d47cb1f799 volatile.uuid.generation: 053d9a82-df16-4fd3-8184-96d47cb1f799 devices: eth0: name: eth0 network: lxdbr0 type: nic gpu0: gputype: physical pci: "40:00.0" type: gpu gpu1: gputype: physical pci: B1:00.0 type: gpu root: path: / pool: zfs-pool size: 30TB type: disk ephemeral: false profiles:

default stateful: false description: ""

canonical / lxd

CUDA failed to initialize. #11804

Required information

Issue description

Information to attach