GPUOpen-LibrariesAndSDKs / MxGPU-Virtualization

MIT License
182 stars 84 forks source link

infinite loop when capabilities error in function validate_link_status #7

Open flintcq opened 6 years ago

flintcq commented 6 years ago

the source code is in function validate_link_status, lines from 405~421 of file gim_reset.c


do { // to get position of a capability kcl_pci_read_config_byte(adapt->p2p_bridge_dev, pos, &data_8); if (data_8 == 0)//i guess, no capabilities left, then stop break;

    pos = data_8;

    gim_info("pos %x\n", pos);

    /* Go to next cap */
            // to get the capability id, and stored in data_8
    kcl_pci_read_config_byte(adapt->p2p_bridge_dev, pos, &data_8);
    if (data_8 == PCI_CAP_ID_EXP)//is a pci express capability id, then stop
        break;
    /* Set next cap's position */
    pos = pos + 1;

} while (1);

    /* PCI_CAP_ID_EXP found? */
if (pos) {

    initially, pos = PCI_CAPABILITY_LIST, the code will iterate all capabilities to find the cap 

whose id is PCI_CAP_ID_EXP, but unfortunately, when cap data is broken, the pos of a cap is always 0xff, which will cause a infinite loop, a workaround is the add a limitation which i borrowed from function"vfio_cap_init" in file vfio_pci_config.c, by adding :

   loops = (PCI_CFG_SPACE_SIZE - PCI_STD_HEADER_SIZEOF) / PCI_CAP_SIZEOF;

and change "} while (1);" to "} while (loops--);"

but this will cause following condition to be unpredictable:

     /* PCI_CAP_ID_EXP found? */
if (pos) {
flintcq commented 6 years ago

if only by add and changed code i mentioned up, it will cause a stack problem

[16045.295678] Thread overran stack, or stack corrupted [16045.295683] Oops: 0000 [#1] SMP [16045.295858] Modules linked in: gim(O) vfio_pci vfio_iommu_type1 vfio_virqfd vfio xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables binfmt_misc snd_hda_codec_hdmi input_leds bridge stp llc snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_intel coretemp snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd sb_edac edac_core ioatdma soundcore shpchp lpc_ich mac_hid kvm_intel kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 [16045.295948] async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic nouveau crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 mxm_wmi lrw video gf128mul glue_helper ttm ablk_helper cryptd drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm igb dca usbhid ptp pps_core ahci hid i2c_algo_bit libahci fjes wmi [16045.295956] CPU: 11 PID: 1051 Comm: kworker/11:2 Tainted: G O 4.4.1174.4.0-vgpu #1 [16045.295960] Hardware name: Supermicro X10DAi/X10DAI, BIOS 3.0a 02/05/2018 [16045.296005] Workqueue: events sched_work_handler [gim] [16045.296010] task: ffff8808542d0e00 ti: ffff880855a40000 task.ti: ffff880855a40000 [16045.296020] RIP: 0010:[] [] cpuacct_charge+0x23/0x40 [16045.296025] RSP: 0018:ffff88085f8c3d70 EFLAGS: 00010046 [16045.296030] RAX: 0000000000010548 RBX: ffff8808542d0e60 RCX: 0000000055bac338 [16045.296034] RDX: ffffffff81e54f60 RSI: 00000000003e97cc RDI: ffff8808542d0e00 [16045.296039] RBP: ffff88085f8c3d70 R08: ffff88085f8d72c0 R09: 0000000000000001 [16045.296044] R10: 00000000000000ca R11: 0000000000000000 R12: ffff88085f8d7330 [16045.296048] R13: 00000000003e97cc R14: ffff8808542d0e00 R15: ffff88085f8d1728 [16045.296055] FS: 0000000000000000(0000) GS:ffff88085f8c0000(0000) knlGS:0000000000000000 [16045.296063] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16045.296067] CR2: 000000022fc9dc80 CR3: 0000000002e0a000 CR4: 0000000000160670 [16045.296070] Stack: [16045.296079] ffff88085f8c3db0 ffffffff810b7a53 0000000000000000 ffff88085f8d72c0 [16045.296087] ffff8808542d0e60 ffff88085f8d7330 0000000000000000 ffff88085f8d1728 [16045.296110] ffff88085f8c3e30 ffffffff810be434 0000000000000001 ffff8808542d0e00 [16045.296111] Call Trace: [16045.296126] [16045.296126] [] update_curr+0xe3/0x170 [16045.296136] [] task_tick_fair+0x44/0x8e0 [16045.296147] [] ? wake_up+0x44/0x50 [16045.296159] [] ? sched_clock+0x9/0x10 [16045.296170] [] ? sched_clock_cpu+0x8f/0xa0 [16045.296181] [] scheduler_tick+0x62/0xe0 [16045.296192] [] ? tick_sched_do_timer+0x30/0x30 [16045.296200] [] update_process_times+0x51/0x60 [16045.296211] [] tick_sched_handle.isra.14+0x25/0x60 [16045.296220] [] tick_sched_timer+0x3d/0x70 [16045.296230] [] hrtimer_run_queues+0x104/0x290 [16045.296241] [] hrtimer_interrupt+0xa8/0x1a0 [16045.296253] [] local_apic_timer_interrupt+0x3e/0x60 [16045.296263] [] smp_apic_timer_interrupt+0x43/0x60 [16045.296273] [] apic_timer_interrupt+0xbf/0xd0 [16045.296287] [16045.296288] [] ? console_unlock+0x313/0x550 [16045.296298] [] vprintk_emit+0x2d7/0x520 [16045.296309] [] vprintk_default+0x29/0x40 [16045.296322] [] printk+0x5a/0x76 [16045.296358] [] wait_cmd_complete+0x3bd/0x461 [gim] [16045.296398] [] idle_vf+0x19d/0x1cc [gim] [16045.296433] [] stop_current_vf+0x73/0x166 [gim] [16045.296466] [] remove_from_run_list+0x10b/0x31f [gim] [16045.296499] [] gim_sched_reset_gpu+0x214/0x228 [gim] [16045.296533] [] gim_sched_reset+0x1e/0x21 [gim] [16045.296565] [] stop_current_vf+0x8c/0x166 [gim] [16045.296596] [] remove_from_run_list+0x10b/0x31f [gim] [16045.296629] [] gim_sched_reset_gpu+0x214/0x228 [gim] [16045.296663] [] gim_sched_reset+0x1e/0x21 [gim] [16045.296694] [] stop_current_vf+0x8c/0x166 [gim] [16045.296726] [] remove_from_run_list+0x10b/0x31f [gim] [16045.296759] [] gim_sched_reset_gpu+0x214/0x228 [gim] [16045.296798] [] gim_sched_reset+0x1e/0x21 [gim] [16045.296829] [] stop_current_vf+0x8c/0x166 [gim] [16045.296861] [] remove_from_run_list+0x10b/0x31f [gim] [16045.296894] [] gim_sched_reset_gpu+0x214/0x228 [gim] [16045.296928] [] gim_sched_reset+0x1e/0x21 [gim] [16045.296959] [] stop_current_vf+0x8c/0x166 [gim] [16045.296991] [] remove_from_run_list+0x10b/0x31f [gim] [16045.297024] [] gim_sched_reset_gpu+0x214/0x228 [gim] [16045.297058] [] gim_sched_reset+0x1e/0x21 [gim] [16045.297089] [] stop_current_vf+0x8c/0x166 [gim] ............ [16045.306396] [] remove_from_run_list+0x10b/0x31f [gim] [16045.306432] [] gim_sched_reset_gpu+0x214/0x228 [gim] [16045.306466] [] gim_sched_reset+0x1e/0x21 [gim] [16045.306500] [] stop_current_vf+0x8c/0x166 [gim] [16045.306532] [] remove_from_run_list+0x10b/0x31f [gim] [16045.306568] [] gim_sched_reset_gpu+0x214/0x228 [gim] [16045.306602] [] gim_sched_reset+0x1e/0x21 [gim] [16045.306636] [] stop_current_vf+0x8c/0x166 [gim] [16045.306668] [] remove_from_run_list+0x10b/0x31f [gim] [16045.306704] [] gim_sched_reset_gpu+0x214/0x228 [gim] [16045.306738] [] gim_sched_reset+0x1e/0x21 [gim] [16045.306772] [] stop_current_vf+0x8c/0x166 [gim] [16045.306804] [] remove_from_run_list+0x10b/0x31f [gim] [16045.306840] [] gim_sched_reset_gpu+0x214/0x228 [gim] [16045.306874] [] gim_sched_reset+0x1e/0x21 [gim] [16045.306908] [] world_switch+0x17d/0x238 [gim] [16045.306941] [] triger_world_switch+0xc5/0xd8 [gim] [16045.306976] [] sched_work_handler+0x1a/0x1c [gim] [16045.306991] [] process_one_work+0x16b/0x490 [16045.307001] [] worker_thread+0x4b/0x4d0 [16045.307012] [] ? process_one_work+0x490/0x490 [16045.307026] [] kthread+0xe7/0x100 [16045.307037] [] ? kthread_create_on_node+0x1e0/0x1e0 [16045.307046] [] ret_from_fork+0x55/0x80 [16045.307055] [] ? kthread_create_on_node+0x1e0/0x1e0