SoftRoCE / rxe-dev

Development Repository for RXE
Other
128 stars 55 forks source link

SoftRoce failed while working with NVMeoF #63

Open raki2322 opened 7 years ago

raki2322 commented 7 years ago

Hi, I am trying to setup the NVMeoF using soft-roce but I didnt able to make it due to the following error

rdma_rxe:qp#17 moved to error state. nvme nvme0 : identify controller failed. when I debug it using KASAN it shows this: [ 7.345365] rdma_rxe: qp#17 moved to error state [ 8.847464] nvme nvme0 : identify controller failed. [ 8.859829]

[ 8.861048] BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x12f3/0x1880 [rdma_rxe] at addr ffff88001f787838 which will flows to the infiniband driver code.that is (gdb) list *(rxe_post_send+0x12f3) 0x1e133 is in rxe_post_send (drivers/infiniband/sw/rxe/rxe_verbs.c:685). 680 switch (wr->opcode) { 681 case IB_WR_RDMA_WRITE_WITH_IMM: 682 wr->ex.imm_data = ibwr->ex.imm_data; 683 case IB_WR_RDMA_READ:

please help me to find the solution for it.

Seth5141 commented 6 years ago

Hello,

I ran into the same issues attempting to set up a machine to test SPDK functionality with RDMA NICs. I have included a bash script that creates an NVMe-oF target and host on the same machine and attempts to discover the NVMe-oF loopback device. This script, when run on a machine connected to an RMDA-enabled NIC (Mellanox ConnectX-3 Pro), functions properly and the host is able to discover the loopback device. I attempted to run this script on two different machines running soft-RoCE and came back with the same error encountered by raki2322 each time. The error occurs at the NVMe discover command at the end of the script. Everything before that command appears to execute properly. Passing machine: fedora 25 Linux 4.9.7 with (Mellanox ConnectX-3 Pro NIC) Failing machine: fedora 25 Linux 4.11.5 with soft-RoCE

I would really appreciate your expertise in finding a workaround for this issue.

Thank you

#!/bin/bash 

set -v

NVMF_PORT=4420
NVMF_IP_PREFIX="143.182.136"
NVMF_IP_LEAST_ADDR=117
NVMF_FIRST_TARGET_IP=$NVMF_IP_PREFIX.$NVMF_IP_LEAST_ADDR
RPC_PORT=5260
subsystemname=nqn.2016-06.io.spdk:testnqn

modprobe ib_cm
modprobe ib_core
modprobe ib_ucm
modprobe ib_umad
modprobe ib_uverbs
modprobe iw_cm
modprobe rdma_cm
modprobe rdma_ucm

if ! hash lspci; then
    return 0
fi

nvmf_nic_bdfs=`lspci | grep Ethernet | grep Mellanox | awk -F ' ' '{print "0000:"$1}'`
mlx_core_driver="mlx4_core"
mlx_ib_driver="mlx4_ib"
mlx_en_driver="mlx4_en"

if [ -z "$nvmf_nic_bdfs" ]; then
    return 0
fi

# for nvmf target loopback test, suppose we only have one type of card.
for nvmf_nic_bdf in $nvmf_nic_bdfs
do
    result=`lspci -vvv -s $nvmf_nic_bdf | grep 'Kernel modules' | awk -F ' ' '{print $3}'`
    if [ "$result" == "mlx5_core" ]; then
        mlx_core_driver="mlx5_core"
        mlx_ib_driver="mlx5_ib"
        mlx_en_driver=""
    fi
    break;
done

modprobe $mlx_core_driver
modprobe $mlx_ib_driver
if [ -n "$mlx_en_driver" ]; then
    modprobe $mlx_en_driver
fi

# The mlx4 driver takes an extra few seconds to load after modprobe returns,
# otherwise ifconfig operations will do nothing.
sleep 5

let count=$NVMF_IP_LEAST_ADDR
for nic_type in `ls /sys/class/infiniband`; do
    for nic_name in `ls /sys/class/infiniband/${nic_type}/device/net`; do
        ifconfig $nic_name $NVMF_IP_PREFIX.$count netmask 255.255.254.0 up

        # dump configuration for debug log
        ifconfig $nic_name
        let count=$count+1
    done
done

modprobe null_blk nr_devices=1
modprobe nvmet
modprobe nvmet-rdma
modprobe nvme-fabrics
modprobe nvme-rdma
sleep 5

#nvmetcli restore test_nqn.json

#ln -s /sys/kernel/config/nvmet/subsystems/nqn.2016-06.io.spdk:testnqn /sys/kernel/config/nvmet/ports/1/subsystems/nqn.2016-06.io.spdk:testnqn

if [ ! -d /sys/kernel/config/nvmet/subsystems/$subsystemname ]; then
    mkdir /sys/kernel/config/nvmet/subsystems/$subsystemname
fi
echo 1 > /sys/kernel/config/nvmet/subsystems/$subsystemname/attr_allow_any_host

if [ ! -d /sys/kernel/config/nvmet/subsystems/$subsystemname/namespaces/1 ]; then
    mkdir /sys/kernel/config/nvmet/subsystems/$subsystemname/namespaces/1
fi

echo -n /dev/nullb0 > /sys/kernel/config/nvmet/subsystems/$subsystemname/namespaces/1/device_path
echo 1 > /sys/kernel/config/nvmet/subsystems/$subsystemname/namespaces/1/enable

if [ ! -d /sys/kernel/config/nvmet/ports/1 ]; then
    mkdir /sys/kernel/config/nvmet/ports/1
fi

echo -n rdma > /sys/kernel/config/nvmet/ports/1/addr_trtype
echo -n ipv4 > /sys/kernel/config/nvmet/ports/1/addr_adrfam
echo -n $NVMF_FIRST_TARGET_IP > /sys/kernel/config/nvmet/ports/1/addr_traddr
echo -n $NVMF_PORT > /sys/kernel/config/nvmet/ports/1/addr_trsvcid

ln -s /sys/kernel/config/nvmet/subsystems/$subsystemname /sys/kernel/config/nvmet/ports/1/subsystems/$subsystemname

sleep 5
wanqunintel commented 6 years ago

Does any of one has met with this issue or have any solution on it? Currently, i am setting up the environment on vagrant vm ro run with nvmf and also have the same issue.