Where to put SRAM in the memory map?

mn416 commented 6 years ago

@m8pple I'd love to have your opinion on this:

The DE5 has 32MB of 256-bit wide SRAM that can sustain full throughput independent of access pattern. The reason why I want to use this is that the max DRAM throughput suffers a bit with all 1024 threads accessing different partitions at the same time. I'm planning to look at doubling the cache line size to help with that, and actually 50% throughput from each DDR3 DIMM is enough to satisfy 64 cores, but still I'd like to give access to the SRAM too.

I've hooked the SRAM up to Tinsel in the sram branch but the question is, where should I put it in the memory map? (Maybe in future we might use it as an L2 cache, but for now I just want to do something simple.)

The way I'm currently doing it is to map the first 64KB of each thread's partition to both SRAM and DRAM, interleaved at the cache-line granularity. A pointer to the beginning of a thread's partition is returned by tinselHeapBase(), and my POETS frontends store most of their data there. I think this will give excellent performance because it balances the load between the SRAM and DRAM. But your particle simulator uses the stack for most of its data, so it would benefit more from the last 64KB of each thread partition being mapped to SRAM and DRAM. So how about a compromise: map the first 32KB of each partition and the last 32KB of each partition to SRAM and DRAM?

Any other thoughts much appreciated.

Matt

m8pple commented 6 years ago

Allocating the particle sim state into the stack was more of a stop-gap, as it was the easiest place to put a lot of data, and most of the DRAM space is allocated to the stack. Eventually it would migrate to a dynamically allocated BLOB which goes somewhere in the heap space, so I don't think that current application behaviour should drive this hardware decision. It's easy enough for the application to change where the stacks are in the first few instructions of user code, so I was thinking about changing the stack sizes if more space was needed to fit in the application data.

My initial inclination would be to simply take the SRAM and map it into the address space as a single 32MB section. It is then up to the application developer to work out how to use it. For example, the orchestrator might choose to put the device states and properties there, along with other hot data - we have to assume that the orchestrator has good visibility over both the application's data and the hardware, so will be able to make good per-application decisions about how to allocate the different types of memory.

You could even map the SRAM in twice: you get 32MB of pure SRAM, then have the SRAM mapped again as interleaved 64MB of DRAM+SRAM. At run-time the orchestrator gets to choose how it wants to use it, and has sufficient knowledge to realise that they are aliased. It also gives more freedom for A-B testing, where we need to run the application twice to work out which method is better.

One would hope that the tops of the stacks will keep themselves hot, so should not cause that much memory pressure if they only contain normal local variables. Given we're on a 4-way (I think?) set associative cache you'd hope that data wouldn't evict the stack that often.

So to me it would make sense to expose the different types of memory explicitly, and ask the infrastructure to use it intelligently. However, this is rather assuming that the software infrastructure is more advanced than it currently is, so...

If you want to do a per thread split following ginselHeapBase, then I would suggest not trying to cache the top of the stack. I think we should be aiming for explicit performance decisions, so just put it in one place then rely on the applications to actually exploit it. My preference would just be for a single 32MB or 64MB section, but if it is per thread, then keep it as one per-thread block.

This is more just based on feeling though - I don't have any hard data.

mn416 commented 6 years ago

Thanks David, very helpful.

I think you're right that it's best to let the application decide how to use the different kinds of memory, and putting all the SRAM in a single section is simplest.

I want to encourage applications to use both SRAM and DRAM at the same time, so I like the sound of a single 64MB section whose even lines are DRAM and whose odd lines are SRAM. The application can then just consider this 64MB section as "extra-efficient memory".

On the other hand I could have a single 32MB SRAM-only section and just explain in the docs that applications should try to interleave access to SRAM and DRAM as much as possible. Will ponder some more on this.

Matt

mn416 commented 6 years ago

And like you say, mapping it twice we provide both solutions... maybe that is the way to go.

m8pple commented 6 years ago

(Reading this back, this may all be obvious. However, I've never encountered this kind of interleaving, so I am trying to guess how it might interleave with "typical" applications, where "typical" usually ends up meaning "bloody-minded applications that manage to hit the perfect wrong stride pattern").

In terms of the interleaved mapping, presumably this would happen at the DRAM controller level, so from the point of view of a d-cache cluster the interleaving would be invisible? So that would mean that d-cache lines are still allocated for SRAM-backed lines, just like for DRAM-back lines?

I was trying to imagine how they whole system would behave, and en-masse the threads would (often) behave somewhat like a stochastic SIMT machine. We expect that locations where one thread is hot a lot, other threads will be hot a lot, and where one thread is cold, others will be cold. They are statistically interleaved, but in a lot of applications they will tend to have some kind of emergent or explicit synchronisation. c.f. @STFleming 's firefly stuff, which I think is a reasonable model for how even asynchronous applications are likely to end up working.

A mild worry (though not a big one), is that we end up with pathological cases where an application manages to thrash the DRAM-backed lines, while somehow avoiding the SRAM-backed lines. Given that all threads look basically the same, experience suggests that all the threads would probably manage to magically hit the stride which gives an access pattern where nobody actually uses the lines mapped to SRAM...

So for the interleaved case, it might be worth thinking about some kind of randomised interleaving, so that the line -> (SRAM|DRAM) decision varies as you move through RAM. For example, if you choose a memory range [0,n) in terms of line addresses, you could create a bijection [0,n) -> [0,n) via a cheap hash function, then map [0,n/2) to SRAM and [n/2,n) to DRAM. A disadvantage is that it destroys locality in DRAM accesses, so there are more page switches in DDR. Meh, maybe apply permutations at the level of a DRAM page? No idea how big a DRAM page is - below the level I know a lot about...

But if the SRAM is mapped in twice, then (as a user) I would interpret it as:

32MB section: this is uniformly fast, and if you understand/control your access patterns well, then you'll make sure all hot lines are here.
64MB section: this is sort-of fast, so if you know that you have a hot section but can't say exactly how at the per-line level, then put it here.

Ultimately, if you understand at the odd-even line level which lines are hot/critical, then you would split them between SRAM and DRAM explicitly. But if you don't understand at the line level, you are in danger of hitting aliasing with an odd-even line split.

Sorry, a bit wandering, I've never really encountered this kind of interleave before, so am wondering how it will behave in practise.

As always, more data probably needed :)

On 14/04/2018 21:28, mn416 wrote:

And like you say, mapping it twice we provide both solutions... maybe that is the way to go.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/POETSII/tinsel/issues/42#issuecomment-381358216, or mute the thread https://github.com/notifications/unsubscribe-auth/AC4nQROm2hcCpitUiNgAJHe4-NmRkHSPks5tolv5gaJpZM4TVJQ2.

mn416 commented 6 years ago

In terms of the interleaved mapping, presumably this would happen at the DRAM controller level, so from the point of view of a d-cache cluster the interleaving would be invisible? So that would mean that d-cache lines are still allocated for SRAM-backed lines, just like for DRAM-back lines?

Yeah, that's right. There are 4x 64-bit SRAMs and I'm just concatenating them to give a 256-bit data bus. So I need the cache for this to work well.

The idea of the interleaved SRAM/DRAM section is to double the bandwidth for this particular region of memory. The only way (I think) that this wouldn't work well is if the application data structures happened to be aligned so that, say, the odd lines were hardly being used. But that is something that can usually be controlled by the programmer or loader, right? Or am I missing something?

I'm reflashing Aesop's FPGAs at the moment with an image that has the first 64KB of each thread's partition mapped to interleaved SRAM/DRAM. It will be nice to see the performance numbers, but even if it works well, I'm still open to having a single interleaved 64MB section per board.

mn416 commented 6 years ago

I didn't really follow your discussion about implementing both a 32MB and a 64MB section. Would you imagine that both sections would be accessed at the same time? I had thought that the implementation would just decide to use one or the other.

m8pple commented 6 years ago

Implementation = hardware configuration, or implementation = software running?

Assuming the latter: whether it uses one or both at the same time would come down to: 1 - whether the hardware allows both sections being used in that way; and 2 - whether the software is able to actually distinguish between sections that should go in the 32MB versus the 64MB section.

Whether the hardware allows both to be used at the same time would mainly come down to whether it was possible to determine when SRAM lines in the two sections alias over each other. So in the odd-even case that would be quite simple, and even in a randomised mapping it could be managed (e.g. randomise the line interleave within 1MB sections or something). So that seems at least feasible.

Whether the software can exploit it is a different matter. In general that kind of subtle allocation of data to slightly different memory sections would be very difficult, but we are supposed to be vertically integrated and cross-layer optimising here. So if we were to assume that we get hold of a Sufficiently Advanced Orchestrator (TM), I'd hope that it could exploit those subtle differences in size and performance. Maybe?

We could invoke the magic of Machine Learning for data placement, and wander around in black capes looking shifty. Probably reinforcement learning...

So I guess ultimately the answer to "Would you imagine that both sections would be accessed at the same time?" is: "yes". For a sufficiently advanced orchestrator and/or a sufficiently advanced programmer.

Practical example. We have looked at an application, and knows that it needs:

16KB of stack per thread, which is always hot and always performance critical
48KB of read/write data per thread, which is fairly performance critical
1MB of data per thread, which is only accessed infrequently. If I could, presumably I would map them as SRAM, SRAM+DRAM, DRAM. The mild latency advantage/determinism of SRAM would (hopefully) reduce average delay, particularly if I need to turn around messages back on the network as soon as possible.

To me simultaneous dual access to interleaved and direct SRAM seems like an opportunity with only two costs:

Slightly increase hardware complexity
Losing either 32MB or 64MB in the address space

STFleming commented 6 years ago

In the past, people have told me that I look shifty and I've always fancied trying out a black cape.

mn416 commented 6 years ago

Thanks for all comments & suggestions.

The version that uses SRAMs works and is in a branch, but there are no plans to merge it at present -- we found ways to improve DRAM performance instead.

Still, the partition-interleaving can be useful for DRAM alone. The implementation I settled on is now described in Appendix C. As explained there, either the direct OR interleaved mapping should be used, but not both. This is because the address translation is done after the cache, not before.

POETSII / tinsel

Where to put SRAM in the memory map? #42