Bulk receive on HostLink

m8pple commented 6 years ago

For good performance we need to blast messages into and out of PCI fairly fast, and probably want to send/recv multiple messages in one go, probably on different threads.

At the moment one the recv side we have:

void recv(void* flit);
void recvMsg(void* msg, uint32_t numBytes);

So for every message we have at least one call to get the header, then another to get the body (unless the message is flit-sized). Ultimately these boil down to read calls on the PCI stream, so there is a fair amount of per call overhead.

Given the receiver has to deal with some kind of parsing, it would make sense to have a "firehose" type call, whereby they can get a whole bunch of flits, then deal with them later (possibly on another thread). That way we have the best chance of saturating PCI expression bandwidth.

My suggesting is a function:

/*! Attempts to read up to maxFlits
\retval Number of flits actually read. 0 <= retval <= maxFlits
*/
uint32_t HostLink::tryRecv(void* buffer, uint32_t maxFlits);

On the backend this would hopefully result in just one read getting a whole bunch of flits.

I'm a bit unsure on the interaction with sub-flit sized reads from read though, as I see there is the standard looping logic within HostLink::recv. Possibly it requires a partial buffer within HostLink, so that partial flits will be stored there, then completed and returned when the rest of the bytes turn up. However, there is no such thing as a partial flit being sent (I think?), so it could also make sense to loop until any partial flit is completed.

Also, this might be seen as premature optimisation - I'm only looking at 0.3 in detail at the moment, so possibly this isn't an issue in practise and we are still bottlenecked on raw PCI performance, rather than API performance.

mn416 commented 6 years ago

However, there is no such thing as a partial flit being sent (I think?), so it could also make sense to loop until any partial flit is completed.

Yes, I think this simple approach would work fine. Sounds like it should be straightforward, I'll take a look.

Thanks for the suggestion. At some point, I'll need to validate the HostLink/PCIe performance to make sure it's as expected, so this is a good thing for me to keep in mind. As I recently discovered, DRAM performance was much worse than expected (and some simple changes made a big difference) so it's really important to measure everything!

mn416 commented 5 years ago

Finally supported in commit 20c7f0b.

Performance of graph download in POLite improves by 10x.

mn416 commented 5 years ago

Both bulk send and receive now supported. There is scope to move to 8Gbps PCIe lanes instead of 5Gbps lanes in the bridge board. It's just clicking a checkbox in QSys. Will leave this issue open as a reminder to try this at some point...

POETSII / tinsel

Bulk receive on HostLink #35