Provide support in drreg to spill something other than GPRs, cross-app instructions.

hgreving2304 commented 5 years ago

Specifically currently hitting this issue is #2985. The case needs some capability to spill - in this case - xmm and a mask register value, cross-app instruction (hence mcontext can not be used). This issue covers providing a method to spill into the TLS data that drreg already has, managed by drmgr.

Interface TBD:

Specifically allow for reserving xmm and mask registers? Also needs an unreserve method.: drreg_reserve_xmm_register() + drreg_reserve_mask_register()?

_Instructing drreg to spill arbitrary number of bytes from opnd_t, or opnd_size_in_bytes() of opndt? Also needs a restore method.: drreg_reserve_spill_bytes()?

Other ideas?

Case #2985 needs this cross-app, hence no automatic management is needed. Should new API be written to provide automatic spill/restore management when called from insertion phase?

johnfxgalea commented 5 years ago

I already implemented this functionality but haven't yet cleaned it up for a PR. I can handle this issue, should you wish (depending on urgency), but I only have some time during the weekend.

hgreving2304 commented 5 years ago

Always great to have contributions like that. We do need it rather sooner than later. Are you able to share what you have, maybe sneak-peek style, so we could evaluate if that's what we want and we go from there?

johnfxgalea commented 5 years ago

An old version is available publicly here. My private fork which I use for my PhD has bug fixes and improved performance. I'll try and get a branch in during the weekend.

hgreving2304 commented 5 years ago

Ok. We also need to be able to spill sth other than xmm. For example specifically, AVX-512 mask registers (k0, k1, ... k7). Maybe consider having API that is able to spill an arbitrary register opnd_t? I am assigning this issue to you for now.

johnfxgalea commented 5 years ago

Hello @hgreving2304 - I created a branch with drreg capable of spilling and restoring XMM registers. Shall we take a step-by-step approach and focus on getting the branch merged and then see what your needs are specifically? If so, I'll include tests now before any further API changes. Uploading once tests are done.

johnfxgalea commented 5 years ago

Pushed the branch now.

hgreving2304 commented 5 years ago

Hi can I look at the branch somewhere? Didn't see. The problem with making this specific to xmm is that I was hoping we can somehow make it more general. Not sure if that's the best approach but have you thought about that in your design?

johnfxgalea commented 5 years ago

Thank you for your reply. The branch has been pushed to this repository. https://github.com/DynamoRIO/dynamorio/tree/i3844-drreg-xmm-spill

johnfxgalea commented 5 years ago

I think achieving a general framework would essentially require an incremental approach. First support XMM... then YMM, then ZMM ?

johnfxgalea commented 5 years ago

BTW: I am not saying that the current branch will fix this issue completely.

hgreving2304 commented 5 years ago

Just had a quick look. In any case, thanks for the preliminary patch! It looks like you're duplicating the API for xmms. I think this should be ok, we might not need to support all simd registers, and could limit to maximal 4 simds for example for now. Do you have a use case that requires more? From the start, I would make the new slots zmm wide and make sure it is easily extendable to supporting ymm and zmm registers. Also the API needs to be extendable straight forwardly so we can add a few mask register slots as well. Would you be able to set up a PR?

johnfxgalea commented 5 years ago

Also the API needs to be extendable straight forwardly so we can add a few mask register slots as well.

I think I know what you mean here, but I am not really sure on what you expect and how to do it at the moment. I need to think a bit more about the design. For instance, mask registers would require their own slots and cannot use existing ones, say of gprs. Therefore, there will have to be some work despite the goal of extensibility.

hgreving2304 commented 5 years ago

I think it's ok to have a simd_slots[MAX_SIMD_SLOTS] and a mask_slots[MAX_OPMASK_SLOTS] area. Basically like mcontext, but we prob. don't need the entire register range for now.

derekbruening commented 5 years ago

Just checking the slot use: looks like a 2-step, with a direct-access pointer out to the new block. I think we have to go that route for these large registers, since some platforms have limited numbers of TLS direct-access slots.

johnfxgalea commented 5 years ago

This is similar to what Dr Memory does for storing shadow values for XMM registers.

hgreving2304 commented 5 years ago

Important point, I missed that when glancing at it earlier. The simd data should be - I was expecting it is - in per_thread_t for example. Or - maybe better? - to a new simd_spill_block_t that gets allocated on demand.

derekbruening commented 5 years ago

Important point, I missed that when glancing at it earlier. The simd data should be - I was expecting it is - in per_thread_t for example. Or - maybe better? - to a new simd_spill_block_t that gets allocated on demand.

It is already pointed at by a field in per_thread_t, so I'm not sure what you're asking for: to inline it into that struct? By being separately allocated it could be made lazy. Maybe it should be a per-thread alloc instead of on global heap.

johnfxgalea commented 5 years ago

Also with separate allocations we avoid aligning issues and ensure the use of movdqa instead of movdqu when spilling

hgreving2304 commented 5 years ago

Yes, I would either inline it into per_thread_t, or if separate, allocate it lazily. Otherwise, why adding another indirection if always allocated anyway. Yes, alignment is better (xref #438). And it should prob. be dr_thead_alloc().

So Derek, with respect to your comment, you're ok with as suggested right?

I would make the new area zmm compatible from the beginning and code it in a way that it easily can get extended to [yz]mm later and a mask register area can get added, too, would be my comment.

derekbruening commented 5 years ago

So Derek, with respect to your comment, you're ok with as suggested right?

That comment is about not putting the SIMD spill slots directly into TLS, which is what the branch is already doing. Unfortunately I think we have to pay that indirection cost.

hgreving2304 commented 5 years ago

Yep, we're on the same page then.

hgreving2304 commented 5 years ago

@johnfxgalea , WDYT, your thumbs up suggests you're on board as well?

johnfxgalea commented 5 years ago

Okay, that is clear.

With regards to saving mask registers. Do you expect an API such as drreg_reserve_mask_register? Are we going to have dedicated spill blocks per register class or shall k masks try to use slots of gprs to avoid indirection?

hgreving2304 commented 5 years ago

Do you expect an API such as drreg_reserve_mask_register?

I think that would be good. Or add it later, but make it easy to do.

Are we going to have dedicated spill blocks per register class or shall k masks try to use slots of gprs to avoid indirection?

I think it's better to make them separate and not mix with GPRs. But again, I think we could limit the number of supported spills. I don't think more than 1 or 2 for masks is needed.

derekbruening commented 5 years ago

Are we going to have dedicated spill blocks per register class or shall k masks try to use slots of gprs to avoid indirection?

I think it's better to make them separate and not mix with GPRs.

Separating them only because they are logically distinct? That does not seem worthwhile to me: it is more efficient to use the same pool, and the code would be simpler as well, so I don't see any simplicity advantage to separating. The only reason to make them separate is if there's a concern about using too many TLS slots, right?

But again, I think we could limit the number of supported spills. I don't think more than 1 or 2 for masks is needed.

This should be up to the user: hardcoding a limit does not seem like a good idea.

hgreving2304 commented 5 years ago

Are we going to have dedicated spill blocks per register class or shall k masks try to use slots of gprs to avoid indirection?

I think it's better to make them separate and not mix with GPRs.

Separating them only because they are logically distinct? That does not seem worthwhile to me: it is more efficient to use the same pool, and the code would be simpler as well, so I don't see any simplicity advantage to separating. The only reason to make them separate is if there's a concern about using too many TLS slots, right?

I think the main reason I said that is because they are of different size. I looked at it like treating them the same way as they are treated in mcontext: simd + opmask structure.

But again, I think we could limit the number of supported spills. I don't think more than 1 or 2 for masks is needed.

This should be up to the user: hardcoding a limit does not seem like a good idea.

Do you mean dynamically via drreg_options_t? Only problem, what about nested drreg_init calls then? We would have to re-allocate and copy existing data.

derekbruening commented 5 years ago

But again, I think we could limit the number of supported spills. I don't think more than 1 or 2 for masks is needed.

This should be up to the user: hardcoding a limit does not seem like a good idea.

Do you mean dynamically via drreg_options_t? Only problem, what about nested drreg_init calls then? We would have to re-allocate and copy existing data.

Re-allocating would be expensive (would require a synchall to redirect all threads). I would think that either they are in the TLS slots shared with GPR regs, where the user picks how many total, or if they're in a separate alloc drreg should allocate space equal to the max # of mask regs on any supported cpu since a few bytes per thread is worth never having to realloc.

johnfxgalea commented 5 years ago

Aren't k masks always 64-bit (16 normally used) in size while gpr slots can be 32-bit for 32-bit architectures? The use of the same slots could get a bit confusing due to size differences no?

hgreving2304 commented 5 years ago

Aren't k masks always 64-bit (16 normally used) in size while gpr slots can be 32-bit for 32-bit architectures? The use of the same slots could get a bit confusing due to size differences no?

Yes.

With respect to this, let's assume we allocate separately: would the code become simpler if the spill area would treat mask and zmms the same? And wasting 64-8 bytes for each mask slot? I guess this is a question to @johnfxgalea ?

johnfxgalea commented 5 years ago

Personally, I would keep everything separate and also avoid lazy initialisation. This keeps things simple and extendable. By default, I would let drreg only set up slots for GPRs but also enable the passing of some flags to drreg_init indicating which auxiliary spillage is also necessary. According to these flags, additional indirect slots will be allocated. The passing of something like DRREG_SPILL_XMM | DRREG_SPILL_K_MASKS will instruct drreg to allocate indirect well-adjusted-sized slots for XMM and mask registers. If the user just wants to spill XMM registers there is no need to allocate 512 bits due to the blind ZMM limit but simply 128

hgreving2304 commented 5 years ago

I would keep everything separate (i.e. xmm area, mask area) but do lazy initialization :) In principal I like the drreg_init idea with flags, the only problem, see above, re-allocating seems not like a good idea. Remember that the drreg_init() calls can be nested. But if we're always allocating everything anyway, what do we gain by having the flags?

If the question is just whether to have the xmm area and mask area as a combined area or not, I don't have a strong preference, I could live with either or.

Derek, any strong preference?

johnfxgalea commented 5 years ago

I'll wait for what @derekbruening has to say but I think I have a better understanding now.

derekbruening commented 5 years ago

Derek, any strong preference?

No, just try to anticipate future needs so we don't have to change the interface. Flags may give flexibility to change the implementation later without the interface having to change.

johnfxgalea commented 5 years ago

Okay, so I think using flags does help avoid changing the interface for future extensions. Still work in progress, but let me know what you think particularly for init_vector functions.

I would like to avoid defining drreg_reserve_xmm_register and drreg_reserve_mask_register functions. Instead, we should have one single function, namely drreg_reserve_register, and drreg should internally trigger the appropriate reservation depending on the passed register, i.e., using branch statements:

if (reg_is_gpr(reg)) {
    } else if (reg_is_strictly_xmm(reg)) {
    } else {
        return DRREG_ERROR;
    }

PS: Sorry for the delayed progress, I am currently overrun with work.

johnfxgalea commented 5 years ago

I also updated drreg_is_register_dead.

hgreving2304 commented 5 years ago

So your plan is that the user will maintain different vectors per spill class?

we should have one single function, namely drreg_reserve_register

I think this is good.

johnfxgalea commented 5 years ago

So your plan is that the user will maintain different vectors per spill class?

we should have one single function, namely drreg_reserve_register

I think so, because different register classes have different number of registers. We can also instead keep things even simpler, and always initialise the vector assuming the largest register class. -- Thinking about it now, I think the latter is better.

hgreving2304 commented 5 years ago

I think different vectors is better, as it is more clear.

johnfxgalea commented 5 years ago

Just in case I wasn't clear, I mean having different vectors, but the user won't need to pass an enum register class value to the init_vector function.

hgreving2304 commented 5 years ago

No strong preference. If you do decide to make only one _ex function, just include in the description that the function initializes max(drreg_class0, drreg_class1, ...)

johnfxgalea commented 5 years ago

Sorry for the delay. I finally have some free time so I can address some PRs which are backlogged.

hgreving2304 commented 5 years ago

Hi @johnfxgalea , thanks again for helping DynamoRIO. Do you have a rough ETA for this?

johnfxgalea commented 5 years ago

I will soon be able to send a PR for XMM spill hopefully by today/tomorrow - there are some things which I need to discuss with you regarding the state restoration of other types of registers but not a big concern for the first PR.

hgreving2304 commented 5 years ago

xref #3823 , these patches may overlap, beware.

johnfxgalea commented 5 years ago

Okay no problem, thanks for letting me know!

johnfxgalea commented 5 years ago

Okay I made some progress and I think the interface is finalised. Pushing in 15 mins. Need to sort some things out internally and include tests for SIMD spillage.

The biggest change is the inclusion of the DRREG_SPILL_CLASS.

johnfxgalea commented 5 years ago

All current tests related to drreg are passing on my machine, which is a good sign given some refactoring. As stated above, I have some finalising left to do, mainly relating to some more fixes and testing.

hgreving2304 commented 5 years ago

Ok. Your contributions are very appreciated.

johnfxgalea commented 5 years ago

For a contribution like this, would you expect to include all new API functions in api/docs/release.dox? or just one line stating the new feature?

hgreving2304 commented 5 years ago

Any API change needs to be mentioned in the release docs.

DynamoRIO / dynamorio

Provide support in drreg to spill something other than GPRs, cross-app instructions. #3844