FPGA indication dropped when lots of request/indication traffic?

acw1251 commented 7 years ago

I'm trying to run one of our Riscy processors on the kc705g2 FPGA target, and I'm running into some weird issues that seem to go away when doing stuff to the executable that will cause it to run slower.

The main issue shows up in a function that loads a program onto the FPGA's BRAM. This function makes many calls to a "memWrite" function. This memWrite function calls a request interface method to send the write request to the FPGA, and then it waits on a semaphore until the FPGA sends back the appropriate indication back.

Without any debugging, our ubuntu.exe program successfully loads the FPGA bitfile and hangs during the function that loads the program onto BRAM on the FPGA.

If I attach gdb to this process while it is stalled, I see that the main thread is waiting for the semaphore in the memWrite function, and the other thread is in the Poll() function.

If I run ubuntu.exe under gdb from the start, ubuntu.exe runs as intended.

If I run ubuntu.exe under strace, it also runs as intended.

If I add c++ print statements to the functions that send the request and receive the indication, ubuntu.exe sometimes stalls while waiting for an indication, but it sometimes works completely. When it stalls, sometimes it's early in the program loading, and sometimes it's late.

I'm not sure what is going on, but it appears that there is some bug that doesn't appear when it ubuntu.exe is running slowly. I'm going to try producing a simple example of this bug and I'll try a bit of debugging on my own.

I'm using version 17.01.2 of connectal + one extra commit that I'll send in a pull request in a couple minutes. On the FPGA machine, I'm using the most recent version of the connectal drivers from the PPA.

debugging.txt has example output from ubuntu.exe when it stalls and the corresponding pcieflat output.

acw1251 commented 7 years ago

I created a small design to try to reproduce this bug in a simple environment, and I was unsuccessful.

The first attempt was a module with one request portal and one indication portal. Each had one method that took an 8 bit value. Each time the software called the request method with data x, the hardware would call the indication method with the same data x. The software would send a request and then wait for the indication using a semaphore. This ran on the kc705g2 platform without any issues.

I then made the request and indication methods take in 4 64-bit values to see if the issue came from having lots of argument data. This was also unable to reproduce the bug on the kc705g2 platform.

There are only three things that I can think of that may be required to reproduce this bug (things our processor has that this simple example doesn't have): 1) adding shared memory - The processor example has 2 read and 1 write shared memory portals. 2) adding more portals - The processor example has many more request and indication portals. 3) adding hardware bugs from the processor - While there is a very strong correlation between software setup and the observation of bugs, there still could be some issue with the hardware design.

jameyhicks commented 7 years ago

I've seen similar behavior with missing indications under heavy load but it is sporadic enough that I have not tracked down the problem.

Both gdb and strace use ptrace to control the process. Somehow that seems to affect the timing (or something) enough to avoid the problem.

Just one thread is running? And there should be only one request at a time given the wait on the semaphore?

jameyhicks commented 7 years ago

Could you add request and response counters to the hardware and a method to read them? I would read them out periodically and on timeout to see if the counters on the hardware side deviate from the software side.

I suspect the indication is being lost, but maybe it's the request.

You could also try setting trace_hardware=1 in cpp/transportHardware.c.

acw1251 commented 7 years ago

gdb showed there were two threads running: one was the main thread, and one was the thread that handled indications. I don't know if those two threads were actually running at the same time when the program was run with gdb or strace.

The software ensures only one request is given at a time since it waits on a semaphore for the response to come back before the next request.

I'll add the counters to the request and response method calls. I'll let you know if I make any progress.

jankcorn commented 7 years ago

When trying to track this down, could you try changing cpp/poller.cpp at line 50: Currently: timeout = -1; Please try: timeout = 100; ?

If this change works, it means that the h/w is fine and there is a problem on the s/w side, either with receiving interrupts or a race condition for processing events from an interrupt.

jca

On Mon, Jan 30, 2017 at 11:32 AM, Andy Wright notifications@github.com wrote:

gdb showed there were two threads running: one was the main thread, and one was the thread that handled indications. I don't know if those two threads were actually running at the same time when the program was run with gdb or strace.

The software ensures only one request is given at a time since it waits on a semaphore for the response to come back before the next request.

I'll add the counters to the request and response method calls. I'll let you know if I make any progress.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cambridgehackers/connectal/issues/133#issuecomment-276165380, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvY-Nzenx4eTGoQVwi5lXoBzlHqiKTFks5rXjrZgaJpZM4Lw1JI .

acw1251 commented 7 years ago

I made the change to poller.cpp and rebuilt the exe, and now it looks like my requests to the FPGA are being corrupted and are triggering other indications to happen. These indications are expected behavior if the requests are being corrupted such that the addresses are greater than 0x6000_0000. Here is some output from my ubuntu.exe that corresponds to requests and indications:

1:  [INFO] Platform::memWrite( addr = 0x0, data = 0x297 )
2:  [INFO] Platform::memResponse( write = true, data = 0x297 )
3:  [INFO] Platform::memWrite( addr = 0x4, data = 0xd028293 )
4:  [INFO] Platform::memResponse( write = true, data = 0xd028293 )
5:  [INFO] Platform::memWrite( addr = 0x8, data = 0x30529073 )
6:  [INFO] MMIO requeset: write = 0x0, length = 0x4, addr = 0x3052907300000001, data = 0x1
7:  [ERROR] MMIO read failed, addr = 0x3052907300000001
8:  [INFO] Platform::memResponse( write = true, data = 0x1 )
9:  [INFO] Platform::memWrite( addr = 0xc, data = 0x93 )
10: [INFO] MMIO requeset: write = 0x0, length = 0x4, addr = 0x100000001, data = 0x1
11: [ERROR] MMIO read failed, addr = 0x100000001
12: [INFO] Platform::memResponse( write = true, data = 0x1 )
    ...

Lines 1 and 3 are requests to write a 32-bit word to the FPGA's bram. Lines 2 and 4 are the indications that say the write is done. If these try to write to an address that is outside of the FPGA's bram, then the write will be routed to whatever that address corresponds to. In my processor, I have requests to all addresses above 0x6000_0000 send to the host through connectal so I can emulate memory mapped IO devices. If I sent a Platform::memWrite request to address 0x6000_0000, we will see a MMIO request in the connectal software.

In line 5 we send a write to address 8, but in line 6 we get an MMIO request from the processor for address 0x3052907300000001. This looks like corrupted data. My thought is that the memWrite() request is being received by the hardware as a memWrite( addr = 0x3052907300000001, data = 0x? ) request, and the RISC-V processor is then handling that request by sending the external MMIO request corresponding to line 6. It looks like this is the case because the RISC-V processor always sets the length field to 4 for the MMIO requests, and that value never appears to be corrupted.

This debugging.txt has output from ubuntu.exe and output for pcieflat. I took a look at it to see what it is doing for line 5 above, and I think it is doing what it is supposed to be doing, but I'm not familiar enough to know for sure.

acw1251 commented 7 years ago

Setting trace_hardware=1 in cpp/transportHardware.c makes the bug disapppear just like strace and gdb.

acw1251 commented 7 years ago

I added registers to keep track of the number of memory requests and responses to see which one is dropped, and I added a "connectal debug" portal to get the values of these registers. I setup the ctrl-c signal handler to get the number of requests and responses and print the value to the screen so I can choose when to stop it.

If timeout=-1, the system is so stalled that I don't get responses when I send new requests to the connectal debug portal. If timeout=100, the debug registers work as expected, but the system is getting corrupted data somewhere along the line.

It looks like this issue I'm seeing is causing the system to get into a bad state that it never recovers from until the FPGA is reprogrammed. If timeout=-1, the system is stalled, if timeout=100, the system is getting corrupted data.

acw1251 commented 7 years ago

I added some additional requests to the end of the timeout=-1 case that would be easy to find in pcieflat, and I see the requests, but I don't see any responses in pcieflat.

I don't know exactly how pcieflat works. Does that mean the board is definitely receiving the requests but it is not sending responses?

acw1251 commented 7 years ago

I got a much smaller example that has this bug. It is the same as my old example, except I added multiple portals that all do the same simple echo of 4 64-bit words.

Here is the connectal project: fastecho2.tar.gz

Would you like me to add this to the tests directory in connectal? In theory, it should work without stalling, but it doesn't, so I think it would be a good addition.

jameyhicks commented 7 years ago

Please add it to the tests directory and I'll give it a try.

acw1251 commented 7 years ago

In the fastecho test I just added, the bug doesn't show up if you remove the --mainclockperiod=32ns connectal flag here: https://github.com/cambridgehackers/connectal/blob/master/tests/fastecho/Makefile#L19

acw1251 commented 7 years ago

I just tried fastecho on kc705g2 with the 32 ns clock with strace and gdb, and they both eventually stalled. It looks like this fastecho bug is a clock domain crossing bug that may or may not be related to the bug I saw in the processor. I compiled with Xilinx Vivado ~~14.02~~ 15.02, and I saw some clock related warnings with the 32 ns clock that I didn't see with the default clock. Most of them were "Clock enable controlled CDC structure detected".

jameyhicks commented 7 years ago

When you had corrupt data in the log it made me think clock domain crossing problem. You're using 32ns clock in the riscy processor?

On Tue, Jan 31, 2017 at 12:44 PM Andy Wright notifications@github.com wrote:

I just tried fastecho on kc705g2 with the 32 ns clock with strace and gdb, and they both eventually stalled. It looks like this fastecho bug is a clock domain crossing bug that may or may not be related to the bug I saw in the processor. I compiled with Xilinx Vivado 14.02, and I saw some clock related warnings with the 32 ns clock that I didn't see with the default clock. Most of them were "Clock enable controlled CDC structure detected".

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/cambridgehackers/connectal/issues/133#issuecomment-276436577, or mute the thread https://github.com/notifications/unsubscribe-auth/ACU3s02n3kPxAozRgXwzHeSBOiiF-vx-ks5rX3LogaJpZM4Lw1JI .

acw1251 commented 7 years ago

Yes.

jameyhicks commented 7 years ago

I committed a change to the test to use a Xilinx FIFO instead of the BSV SyncFIFO, but it still fails.

When it fails for you, does it hang the machine? I think my first test with SyncFIFO did hang the machine but the recent tests are failing without wedging the pcie chipset.

Running some more tests ...

acw1251 commented 7 years ago

When it fails, it doesn't hang the machine. I can easily quit the program and reprogram the fpga without rebooting.

I saw the Xilinx FIFOs included in GetPutWithClocks, and I got them working yesterday (after making very similar changes to yours), and I got the same results. I got a lot of bluespec compilation errors about AXI methods that should be always that weren't, so I tried using the non-axi interface for the Xilinx FIFO as well. When I messed up this implementation (forgot to actually dequeue from the FIFO in my mkConnectionWithClocks), ubuntu.exe failed with an error saying /dev/portal_ couldn't be found. Once I fixed my bug in the implementation it resulted in similar results to the other Xilinx FIFO.

I also tried adding extra indications to the fastecho test (creating the fastmultiecho test) and it looks like the issue may be tied to the request failing, not the indications. In the fastmultiecho test, each request triggers an indication from all four indication interfaces. The test waits for all four indications to come back before sending the next request. Each time the program stalled, it had received none of the four indications it expected.

jameyhicks commented 7 years ago

I added some probes in PcieToMem because there are a couple of places where it will drop an incoming PCIE transaction if a FIFO is full, but those were never triggered.

There is a case in PcieSplitter (which demuxes CPU->FPGA requests from CPU->FPGA responses) where it drops incoming packets that are not for one of the memory regions on the board's PCIE endpoint. That does get triggered even when the design is not running. The PCIE chipset does broadcasts, and I believe that is what is triggering this case. The splitter does not drop packets just because a FIFO is full, as far as I know.

It does seem like the error is in the request path somehow.

acw1251 commented 7 years ago

I pushed some files I added for debugging to my fork: https://github.com/acw1251/connectal/tree/iss-133-debugging

This includes my version of the Xilinx Sync FIFO without the Axi Stream interface and my library AddImplicitCondition.bsv that I used for adding backpressure to PcieTop.bsv.

acw1251 commented 7 years ago

I was playing with the connection between host.tpciehost.master and portalTop.slave in PcieTop.bsv, and I have tried many different configurations that all fail to reproduce the bug. The most interesting one is where the pcie and the portals are in the same clock domain (pcie clock), but the connection between host.tpciehost.master and portalTop.slave passes through a different clock domain (portal clock).

All my attempts to reproduce the bug in a simpler design can be found on my debugging branch. I think next I'm going to look at the connections between portalTop.masters, host.tpciehost.slave

acw1251 commented 7 years ago

I think I'm onto something now. It looks like you only get errors if both the connection from host.tpciehost.master to portalTop.slave and the connection from portalTop.masters to host.tpciehost.slave have FIFOs between them. I think it may be due to requests getting reordered between the two connections. For example, if these pairs of interfaces are directly connected, then each side will agree on the ordering of their requests with respect to the other side's requests (requests will be received by the other side instantly). If there are FIFOs between the interfaces, then both sides could send a request at the same time, and when they finally receive the other's request, they each think their request should happen first.

I'm not familiar with the exact communication that goes on between these interfaces. Do you think request reordering could this be the source of the problem?

This commit has the simplest change to reproduce the bug. If you compile this with the portal clock frequency the same as the pcie clock frequency, you will see the bug.

acw1251 commented 7 years ago

The last comment wasn't exactly correct. I guess I missed something. Just having a FIFO between host.tpciehost.master to portalTop.slave is enough to reproduce this bug.

I'm making progress on a workaround for this issue in my iss-133-workaround branch. Right now I'm forcing host.tpciehost.master to have at most one outstanding request at a time to fix the bug. I'm going to see if I can reduce it to restricting just reads or just writes so the workaround doesn't affect performance too much.

jameyhicks commented 7 years ago

Sounds like you're onto something!

acw1251 commented 7 years ago

I am onto something! I was able to simplify the one request at a time constraint to just one write at a time as seen in this commit. From this, I think I found the problem:

I think mkPhysMemSlaveMux is accepting more writeData's than it should per writeReq.

When pcie is sending write requests, it always sends write requests and write data in-order. When pcie is directly connected to the PhysMemSlaveMux, pcie doesn't send the next data until PhysMemSlaveMux is ready for the next request, so when the next data is sent, the previous write request is already gone. When you connect the two with FIFOs, you can enqueue multiple write requests and multiple write data. When PhysMemSlaveMux sees the first write req, it will start dequeuing write data until the server responds that the write is done. In this case, some write data could overtake a write req and cause the write portion of the mux to deadlock.

I'll work on fixing mkPhysMemSlaveMux tonight, and I'll send a PR if it works.

jameyhicks commented 7 years ago

Makes sense.

jameyhicks commented 7 years ago

Excellent work!

On Tue, Feb 7, 2017 at 12:00 PM Andy Wright notifications@github.com wrote:

Closed #133 https://github.com/cambridgehackers/connectal/issues/133.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/cambridgehackers/connectal/issues/133#event-952423810, or mute the thread https://github.com/notifications/unsubscribe-auth/ACU3syfYV9JFcTpoUM1FQlPclSVDcxf1ks5raKLTgaJpZM4Lw1JI .

cambridgehackers / connectal

FPGA indication dropped when lots of request/indication traffic? #133