enfiskutensykkel / ssd-gpu-dma

Build userspace NVMe drivers and storage applications with CUDA support
BSD 2-Clause "Simplified" License
338 stars 47 forks source link

Floating Point Exception #22

Open ZaidQureshi opened 5 years ago

ZaidQureshi commented 5 years ago

I am trying to run the example nvm-identify but I get the following output: Resetting controller and setting up admin queues... Floating point exception

The dmesg output is this: [May24 16:40] traps: nvm-identify[3179] trap divide error ip:7f6d2f98a434 sp:7ffd9a74e3b0 error:0 in libnvm.so[7f6d2f985000+9000] I am not sure what is going on. Any help would be appreciated.

enfiskutensykkel commented 5 years ago

Hi Zaid,

This probably means that there is a division by zero somewhere. Not really sure where that would be. Would it be possible for you to add some print statements in the examples/identify/module.c and examples/identify/common.c files to see how far it gets before the bug occurs, and ideally which library call that fails? I'm wondering if the error occurs in the reset_ctrl function (in common.c) or in the identify_ctrl function (also in common.c).

Thanks

ZaidQureshi commented 5 years ago

I did some debugging and it seems the problem is on line 51 in include/nvm_queue.h which has the following text: if ((uint16_t) ((sq->tail - sq->head) % sq->max_entries) == sq->max_entries - 1) So for some reason the tail, head, and max_entries of the admin queue are all 0.

enfiskutensykkel commented 5 years ago

Thank you for your effort.

I suspect that this may be because it is not able to read controller registers properly. I assume you are using the kernel module version (and not the SmartIO version)?

In examples/identify/module.c, could you print out the members of ctrl directly after the call to nvm_ctrl_init, particularly the values of ctrl->page_size and ctrl->max_entries ?

ZaidQureshi commented 5 years ago

These are the values for each of the members of ctrl after the nvm_ctrl_init call:

CTRL->DSTRD: 0
CTRL->timeout: 30000
CTRL->MAX_ENTRIES: 0
CTRL->MM_SIZE: 8192
CTRL->mm_ptr: 0x7f511d31f000
enfiskutensykkel commented 5 years ago

Thank you again.

This is a bit strange, it appears that it managed to read the time-out value from the registers, but he maximum number of entries has become 0. I may be doing something wrong in src/ctrl.c or maybe I’ve interpreted the spec wrong. There are some fields that are defined as 0 meaning 1, so I need to check if this is the case here.

Out of curiosity, what kind of nvme controller/disk is this?

ZaidQureshi commented 5 years ago

This is a Corsair NX500 SSD.

ZaidQureshi commented 5 years ago

So the value returned from CAP.MQES reg value in initialize_handle in src/ctrl.c for me is 65535. When we add 1, it overflows and turns 0. Now the question is why does it return 65535. When I remove the plus 1, I get valid values for max entries for the admin queues and the controller information is returned correctly..

enfiskutensykkel commented 5 years ago

The reason I add 1 is that MQES is, according to the spec,

This is a 0's based value. The minimum value is 1h, indicating two entries Meaning that a value of 0 is illegal, and a value of 1 means it should be 2 etc.

If it returns 65535, I suspect a different type of issue, namely that all MMIO reads are returning 0xFFs (this is a common symptom when reading does not work). If you attempt to read any other values directly from the memory mapped pointer, does it all return 0xFFs? Could you for example try this: printf("%lx\n", *((uint64_t*) mm_ptr));

ZaidQureshi commented 5 years ago

This is the value printed due to printf("%lx\n", *((uint64_t*) ctrl->mm_ptr)); after calling nvm_ctrl_init: 203c03ffff

enfiskutensykkel commented 5 years ago

That's a good thing!

The controller probably supports 65536 entries, and the bug is that I'm using 16-bit variable to read out the value and add one. I'm trying to find a datasheet or documentation in order to verify that it supports that many queue entries, but I suspect that this is the case.

I'll add a quick fix for it and add your name in the commit message (if you want), or you can send me a patch or alternatively a pull request if you want. In the mean time, you can probably comment out the +1 for now.

enfiskutensykkel commented 5 years ago

Thank you for your time and effort, by the way.

ZaidQureshi commented 5 years ago

Wait but the specification states that this register is only 16 bits.

enfiskutensykkel commented 5 years ago

Yes, the register is only 16 bits, but I belive the value allowed to be 65536. I'll look around, and check the spec errata if it is mentioned, and I'll also have a look at what the Linux kernel NVMe driver does. Does the indentify example program work for you if you don't add one?

ZaidQureshi commented 5 years ago

Yes that example works but I think typically what people do is they always use the MQES+1 value. So I think you will need to change the type of the max_entries field in all of your code. I believe this is leading to another issue at the end of the latency benchmark when you are unmapping memory as I get the following error at the end of the latency benchmark: [unmap_memory] Page unmapping kernel request failed: Invalid argument

I am trying to debug it and see what is happening.

enfiskutensykkel commented 5 years ago

The spec says in the introduction chapter:

Support for 65,535 I/O queues, with each I/O queue supporting up to 64K outstanding commands.

Since it's 2^16 - 1 number of I/O queues, does that mean 64K in this case perhaps means 2^16?

Regardless, the unmapping is a known bug (see #18), I haven't had time to rewrite the kernel module yet. The problem is how I'm currently storing address mappings, as I'm storing everything into one global list when I should instead create a list per open descriptor. It happens with the nvm-latency-benchmark because I check against the PID, and for some reason CUDA programs fork into child processes so the "wrong" PID is attempting to clean up the stored mapping.

It will clean up automatically once you unload the module, so do that every now and then if you run the nvm-latency-bench program multiple times. The warnings and errors look bad, but it is expected.

enfiskutensykkel commented 5 years ago

The permanent fix would be to change the type of max_entries to uint32_t as you suggested, or alternatively always use + 1 when I reference it. I'll have to look into the impact of both solutions.