Open hgreving2304 opened 5 years ago
I already implemented this functionality but haven't yet cleaned it up for a PR. I can handle this issue, should you wish (depending on urgency), but I only have some time during the weekend.
Always great to have contributions like that. We do need it rather sooner than later. Are you able to share what you have, maybe sneak-peek style, so we could evaluate if that's what we want and we go from there?
An old version is available publicly here. My private fork which I use for my PhD has bug fixes and improved performance. I'll try and get a branch in during the weekend.
Ok. We also need to be able to spill sth other than xmm. For example specifically, AVX-512 mask registers (k0, k1, ... k7). Maybe consider having API that is able to spill an arbitrary register opnd_t? I am assigning this issue to you for now.
Hello @hgreving2304 - I created a branch with drreg capable of spilling and restoring XMM registers. Shall we take a step-by-step approach and focus on getting the branch merged and then see what your needs are specifically? If so, I'll include tests now before any further API changes. Uploading once tests are done.
Pushed the branch now.
Hi can I look at the branch somewhere? Didn't see. The problem with making this specific to xmm is that I was hoping we can somehow make it more general. Not sure if that's the best approach but have you thought about that in your design?
Thank you for your reply. The branch has been pushed to this repository. https://github.com/DynamoRIO/dynamorio/tree/i3844-drreg-xmm-spill
I think achieving a general framework would essentially require an incremental approach. First support XMM... then YMM, then ZMM ?
BTW: I am not saying that the current branch will fix this issue completely.
Just had a quick look. In any case, thanks for the preliminary patch! It looks like you're duplicating the API for xmms. I think this should be ok, we might not need to support all simd registers, and could limit to maximal 4 simds for example for now. Do you have a use case that requires more? From the start, I would make the new slots zmm wide and make sure it is easily extendable to supporting ymm and zmm registers. Also the API needs to be extendable straight forwardly so we can add a few mask register slots as well. Would you be able to set up a PR?
Also the API needs to be extendable straight forwardly so we can add a few mask register slots as well.
I think I know what you mean here, but I am not really sure on what you expect and how to do it at the moment. I need to think a bit more about the design. For instance, mask registers would require their own slots and cannot use existing ones, say of gprs. Therefore, there will have to be some work despite the goal of extensibility.
I think it's ok to have a simd_slots[MAX_SIMD_SLOTS] and a mask_slots[MAX_OPMASK_SLOTS] area. Basically like mcontext, but we prob. don't need the entire register range for now.
Just checking the slot use: looks like a 2-step, with a direct-access pointer out to the new block. I think we have to go that route for these large registers, since some platforms have limited numbers of TLS direct-access slots.
This is similar to what Dr Memory does for storing shadow values for XMM registers.
Important point, I missed that when glancing at it earlier. The simd data should be - I was expecting it is - in per_thread_t for example. Or - maybe better? - to a new simd_spill_block_t that gets allocated on demand.
Important point, I missed that when glancing at it earlier. The simd data should be - I was expecting it is - in per_thread_t for example. Or - maybe better? - to a new simd_spill_block_t that gets allocated on demand.
It is already pointed at by a field in per_thread_t, so I'm not sure what you're asking for: to inline it into that struct? By being separately allocated it could be made lazy. Maybe it should be a per-thread alloc instead of on global heap.
Also with separate allocations we avoid aligning issues and ensure the use of movdqa instead of movdqu when spilling
Yes, I would either inline it into per_thread_t, or if separate, allocate it lazily. Otherwise, why adding another indirection if always allocated anyway. Yes, alignment is better (xref #438). And it should prob. be dr_thead_alloc().
So Derek, with respect to your comment, you're ok with as suggested right?
I would make the new area zmm compatible from the beginning and code it in a way that it easily can get extended to [yz]mm later and a mask register area can get added, too, would be my comment.
So Derek, with respect to your comment, you're ok with as suggested right?
That comment is about not putting the SIMD spill slots directly into TLS, which is what the branch is already doing. Unfortunately I think we have to pay that indirection cost.
Yep, we're on the same page then.
@johnfxgalea , WDYT, your thumbs up suggests you're on board as well?
Okay, that is clear.
With regards to saving mask registers. Do you expect an API such as drreg_reserve_mask_register? Are we going to have dedicated spill blocks per register class or shall k masks try to use slots of gprs to avoid indirection?
Do you expect an API such as drreg_reserve_mask_register?
I think that would be good. Or add it later, but make it easy to do.
Are we going to have dedicated spill blocks per register class or shall k masks try to use slots of gprs to avoid indirection?
I think it's better to make them separate and not mix with GPRs. But again, I think we could limit the number of supported spills. I don't think more than 1 or 2 for masks is needed.
Are we going to have dedicated spill blocks per register class or shall k masks try to use slots of gprs to avoid indirection?
I think it's better to make them separate and not mix with GPRs.
Separating them only because they are logically distinct? That does not seem worthwhile to me: it is more efficient to use the same pool, and the code would be simpler as well, so I don't see any simplicity advantage to separating. The only reason to make them separate is if there's a concern about using too many TLS slots, right?
But again, I think we could limit the number of supported spills. I don't think more than 1 or 2 for masks is needed.
This should be up to the user: hardcoding a limit does not seem like a good idea.
Are we going to have dedicated spill blocks per register class or shall k masks try to use slots of gprs to avoid indirection?
I think it's better to make them separate and not mix with GPRs.
Separating them only because they are logically distinct? That does not seem worthwhile to me: it is more efficient to use the same pool, and the code would be simpler as well, so I don't see any simplicity advantage to separating. The only reason to make them separate is if there's a concern about using too many TLS slots, right?
I think the main reason I said that is because they are of different size. I looked at it like treating them the same way as they are treated in mcontext: simd + opmask structure.
But again, I think we could limit the number of supported spills. I don't think more than 1 or 2 for masks is needed.
This should be up to the user: hardcoding a limit does not seem like a good idea.
Do you mean dynamically via drreg_options_t? Only problem, what about nested drreg_init calls then? We would have to re-allocate and copy existing data.
But again, I think we could limit the number of supported spills. I don't think more than 1 or 2 for masks is needed.
This should be up to the user: hardcoding a limit does not seem like a good idea.
Do you mean dynamically via drreg_options_t? Only problem, what about nested drreg_init calls then? We would have to re-allocate and copy existing data.
Re-allocating would be expensive (would require a synchall to redirect all threads). I would think that either they are in the TLS slots shared with GPR regs, where the user picks how many total, or if they're in a separate alloc drreg should allocate space equal to the max # of mask regs on any supported cpu since a few bytes per thread is worth never having to realloc.
Aren't k masks always 64-bit (16 normally used) in size while gpr slots can be 32-bit for 32-bit architectures? The use of the same slots could get a bit confusing due to size differences no?
Aren't k masks always 64-bit (16 normally used) in size while gpr slots can be 32-bit for 32-bit architectures? The use of the same slots could get a bit confusing due to size differences no?
Yes.
With respect to this, let's assume we allocate separately: would the code become simpler if the spill area would treat mask and zmms the same? And wasting 64-8 bytes for each mask slot? I guess this is a question to @johnfxgalea ?
Personally, I would keep everything separate and also avoid lazy initialisation. This keeps things simple and extendable. By default, I would let drreg only set up slots for GPRs but also enable the passing of some flags to drreg_init indicating which auxiliary spillage is also necessary. According to these flags, additional indirect slots will be allocated. The passing of something like DRREG_SPILL_XMM | DRREG_SPILL_K_MASKS will instruct drreg to allocate indirect well-adjusted-sized slots for XMM and mask registers. If the user just wants to spill XMM registers there is no need to allocate 512 bits due to the blind ZMM limit but simply 128
I would keep everything separate (i.e. xmm area, mask area) but do lazy initialization :) In principal I like the drreg_init idea with flags, the only problem, see above, re-allocating seems not like a good idea. Remember that the drreg_init() calls can be nested. But if we're always allocating everything anyway, what do we gain by having the flags?
If the question is just whether to have the xmm area and mask area as a combined area or not, I don't have a strong preference, I could live with either or.
Derek, any strong preference?
I'll wait for what @derekbruening has to say but I think I have a better understanding now.
Derek, any strong preference?
No, just try to anticipate future needs so we don't have to change the interface. Flags may give flexibility to change the implementation later without the interface having to change.
Okay, so I think using flags does help avoid changing the interface for future extensions. Still work in progress, but let me know what you think particularly for init_vector functions.
I would like to avoid defining drreg_reserve_xmm_register and drreg_reserve_mask_register functions. Instead, we should have one single function, namely drreg_reserve_register, and drreg should internally trigger the appropriate reservation depending on the passed register, i.e., using branch statements:
if (reg_is_gpr(reg)) {
} else if (reg_is_strictly_xmm(reg)) {
} else {
return DRREG_ERROR;
}
PS: Sorry for the delayed progress, I am currently overrun with work.
I also updated drreg_is_register_dead.
So your plan is that the user will maintain different vectors per spill class?
we should have one single function, namely drreg_reserve_register
I think this is good.
So your plan is that the user will maintain different vectors per spill class?
we should have one single function, namely drreg_reserve_register
I think so, because different register classes have different number of registers. We can also instead keep things even simpler, and always initialise the vector assuming the largest register class. -- Thinking about it now, I think the latter is better.
I think different vectors is better, as it is more clear.
Just in case I wasn't clear, I mean having different vectors, but the user won't need to pass an enum register class value to the init_vector function.
No strong preference. If you do decide to make only one _ex function, just include in the description that the function initializes max(drreg_class0, drreg_class1, ...)
Sorry for the delay. I finally have some free time so I can address some PRs which are backlogged.
Hi @johnfxgalea , thanks again for helping DynamoRIO. Do you have a rough ETA for this?
I will soon be able to send a PR for XMM spill hopefully by today/tomorrow - there are some things which I need to discuss with you regarding the state restoration of other types of registers but not a big concern for the first PR.
xref #3823 , these patches may overlap, beware.
Okay no problem, thanks for letting me know!
Okay I made some progress and I think the interface is finalised. Pushing in 15 mins. Need to sort some things out internally and include tests for SIMD spillage.
The biggest change is the inclusion of the DRREG_SPILL_CLASS.
All current tests related to drreg are passing on my machine, which is a good sign given some refactoring. As stated above, I have some finalising left to do, mainly relating to some more fixes and testing.
Ok. Your contributions are very appreciated.
For a contribution like this, would you expect to include all new API functions in api/docs/release.dox? or just one line stating the new feature?
Any API change needs to be mentioned in the release docs.
Specifically currently hitting this issue is #2985. The case needs some capability to spill - in this case - xmm and a mask register value, cross-app instruction (hence mcontext can not be used). This issue covers providing a method to spill into the TLS data that drreg already has, managed by drmgr.
Interface TBD:
Specifically allow for reserving xmm and mask registers? Also needs an unreserve method.:
drreg_reserve_xmm_register()
+drreg_reserve_mask_register()
?_Instructing drreg to spill arbitrary number of bytes from opnd_t, or opnd_size_in_bytes() of opndt? Also needs a restore method.:
drreg_reserve_spill_bytes()?
Other ideas?
Case #2985 needs this cross-app, hence no automatic management is needed. Should new API be written to provide automatic spill/restore management when called from insertion phase?