Open raki2322 opened 7 years ago
Hello,
I ran into the same issues attempting to set up a machine to test SPDK functionality with RDMA NICs. I have included a bash script that creates an NVMe-oF target and host on the same machine and attempts to discover the NVMe-oF loopback device. This script, when run on a machine connected to an RMDA-enabled NIC (Mellanox ConnectX-3 Pro), functions properly and the host is able to discover the loopback device. I attempted to run this script on two different machines running soft-RoCE and came back with the same error encountered by raki2322 each time. The error occurs at the NVMe discover command at the end of the script. Everything before that command appears to execute properly. Passing machine: fedora 25 Linux 4.9.7 with (Mellanox ConnectX-3 Pro NIC) Failing machine: fedora 25 Linux 4.11.5 with soft-RoCE
I would really appreciate your expertise in finding a workaround for this issue.
Thank you
#!/bin/bash
set -v
NVMF_PORT=4420
NVMF_IP_PREFIX="143.182.136"
NVMF_IP_LEAST_ADDR=117
NVMF_FIRST_TARGET_IP=$NVMF_IP_PREFIX.$NVMF_IP_LEAST_ADDR
RPC_PORT=5260
subsystemname=nqn.2016-06.io.spdk:testnqn
modprobe ib_cm
modprobe ib_core
modprobe ib_ucm
modprobe ib_umad
modprobe ib_uverbs
modprobe iw_cm
modprobe rdma_cm
modprobe rdma_ucm
if ! hash lspci; then
return 0
fi
nvmf_nic_bdfs=`lspci | grep Ethernet | grep Mellanox | awk -F ' ' '{print "0000:"$1}'`
mlx_core_driver="mlx4_core"
mlx_ib_driver="mlx4_ib"
mlx_en_driver="mlx4_en"
if [ -z "$nvmf_nic_bdfs" ]; then
return 0
fi
# for nvmf target loopback test, suppose we only have one type of card.
for nvmf_nic_bdf in $nvmf_nic_bdfs
do
result=`lspci -vvv -s $nvmf_nic_bdf | grep 'Kernel modules' | awk -F ' ' '{print $3}'`
if [ "$result" == "mlx5_core" ]; then
mlx_core_driver="mlx5_core"
mlx_ib_driver="mlx5_ib"
mlx_en_driver=""
fi
break;
done
modprobe $mlx_core_driver
modprobe $mlx_ib_driver
if [ -n "$mlx_en_driver" ]; then
modprobe $mlx_en_driver
fi
# The mlx4 driver takes an extra few seconds to load after modprobe returns,
# otherwise ifconfig operations will do nothing.
sleep 5
let count=$NVMF_IP_LEAST_ADDR
for nic_type in `ls /sys/class/infiniband`; do
for nic_name in `ls /sys/class/infiniband/${nic_type}/device/net`; do
ifconfig $nic_name $NVMF_IP_PREFIX.$count netmask 255.255.254.0 up
# dump configuration for debug log
ifconfig $nic_name
let count=$count+1
done
done
modprobe null_blk nr_devices=1
modprobe nvmet
modprobe nvmet-rdma
modprobe nvme-fabrics
modprobe nvme-rdma
sleep 5
#nvmetcli restore test_nqn.json
#ln -s /sys/kernel/config/nvmet/subsystems/nqn.2016-06.io.spdk:testnqn /sys/kernel/config/nvmet/ports/1/subsystems/nqn.2016-06.io.spdk:testnqn
if [ ! -d /sys/kernel/config/nvmet/subsystems/$subsystemname ]; then
mkdir /sys/kernel/config/nvmet/subsystems/$subsystemname
fi
echo 1 > /sys/kernel/config/nvmet/subsystems/$subsystemname/attr_allow_any_host
if [ ! -d /sys/kernel/config/nvmet/subsystems/$subsystemname/namespaces/1 ]; then
mkdir /sys/kernel/config/nvmet/subsystems/$subsystemname/namespaces/1
fi
echo -n /dev/nullb0 > /sys/kernel/config/nvmet/subsystems/$subsystemname/namespaces/1/device_path
echo 1 > /sys/kernel/config/nvmet/subsystems/$subsystemname/namespaces/1/enable
if [ ! -d /sys/kernel/config/nvmet/ports/1 ]; then
mkdir /sys/kernel/config/nvmet/ports/1
fi
echo -n rdma > /sys/kernel/config/nvmet/ports/1/addr_trtype
echo -n ipv4 > /sys/kernel/config/nvmet/ports/1/addr_adrfam
echo -n $NVMF_FIRST_TARGET_IP > /sys/kernel/config/nvmet/ports/1/addr_traddr
echo -n $NVMF_PORT > /sys/kernel/config/nvmet/ports/1/addr_trsvcid
ln -s /sys/kernel/config/nvmet/subsystems/$subsystemname /sys/kernel/config/nvmet/ports/1/subsystems/$subsystemname
sleep 5
Does any of one has met with this issue or have any solution on it? Currently, i am setting up the environment on vagrant vm ro run with nvmf and also have the same issue.
Hi, I am trying to setup the NVMeoF using soft-roce but I didnt able to make it due to the following error
rdma_rxe:qp#17 moved to error state. nvme nvme0 : identify controller failed. when I debug it using KASAN it shows this: [ 7.345365] rdma_rxe: qp#17 moved to error state [ 8.847464] nvme nvme0 : identify controller failed. [ 8.859829]
[ 8.861048] BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x12f3/0x1880 [rdma_rxe] at addr ffff88001f787838 which will flows to the infiniband driver code.that is (gdb) list *(rxe_post_send+0x12f3) 0x1e133 is in rxe_post_send (drivers/infiniband/sw/rxe/rxe_verbs.c:685). 680 switch (wr->opcode) { 681 case IB_WR_RDMA_WRITE_WITH_IMM: 682 wr->ex.imm_data = ibwr->ex.imm_data; 683 case IB_WR_RDMA_READ:
please help me to find the solution for it.