AIFM-sys / AIFM

AIFM: High-Performance, Application-Integrated Far Memory
MIT License
105 stars 35 forks source link

mlx5_init: IB device not found #6

Closed sctb512 closed 3 years ago

sctb512 commented 3 years ago

Hello, i try to run this project on my nodes and get the error as follows:

mlx5_init: IB device not found

I found this issue happen in the file ./shenango/runtime/net/directpath/mlx5/mlx5_init.c

int mlx5_common_init(struct hardware_q **rxq_out, struct direct_txq **txq_out,
                 unsigned int nr_rxq, unsigned int nr_txq, bool use_rss)

The value of dev_list[0] is NULL:

dev_list[i]: (nil)

It looks like i can't get device list.

Question 1: Why dev_list[0] is NULL? Is there any way to solve this problem?

Then, I found there is only mlx5 directory in ./shenango/runtime/net/directpath/

common.c common.d common.o defs.h mlx5

but my nodes use mlx4:

02:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]

If i modify CONFIG_DIRECTPATH=y to CONFIG_DIRECTPATH=n in shared.mk, the runtime not works.

Question 2: Whether there is only mlx5 implementation? If I want to run this project on ConnectX-3 devices, can you give me some advice? (I can't apply for a cloudlab account successful.)

Thanks!

BinZlP commented 3 years ago

There's build option for mlx4 devices in shenango/shared.mk. Modify it as below and re-build shenango:

CONFIG_MLX5=n
CONFIG_MLX4=y

I'm not sure it's working but you can try.

sctb512 commented 3 years ago

Thanks for your reply. Before I was able to build successfully, I had modified these places in shenango/shared.mk. Before modifying, I would get the error as follows:

iokernel/mlx.h:5:10: fatal error: mlx5_custom.h: No such file or directory

zainryan commented 3 years ago

Thanks for your reply. Before I was able to build successfully, I had modified these places in shenango/shared.mk. Before modifying, I would get the error as follows:

iokernel/mlx.h:5:10: fatal error: mlx5_custom.h: No such file or directory

Sorry for the late reply. The error is caused by the intermediate mlx5 files left by your first compilation with CONFIG_MLX5=y. You can simply clone a new repo from scratch, set CONFIG_MLX5=n & CONFIG_MLX4=y & CONFIG_DIRECTPATH=n, and recompile.

zainryan commented 3 years ago

In addition, our project is mostly implemented using mlx5 NIC and only has limited support for mlx4, so you may observe reduced performance with mlx4. When running the program, you have to delete enable_directpath 1 of all config files in AIFM/aifm/configs/. Please let me know if you have any further questions, I'm happy to answer.