DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.61k stars 554 forks source link

ldrex..strex pair constraints challenge instrumentation and even core operation #1698

Open derekbruening opened 9 years ago

derekbruening commented 9 years ago

A ldrx..strex pair has some constraints that make inserting instrumentation in between, or even core DR operation, challenging:

From the ARMv8 manual, section E2.10.5:

  An implementation might clear an exclusive monitor between the LoadExcl
  instruction and the StoreExcl, instruction without any application-related
  cause. For example, this might happen because of cache evictions.  Software
  must, in any single thread of execution, avoid having any explicit memory
  accesses or cache maintenance instructions between the LoadExcl instruction
  and the associated StoreExcl instruction.

  Implementations can benefit from keeping the LoadExcl and StoreExcl
  operations close together in a single thread of execution. This minimizes
  the likelihood of the exclusive monitor state being cleared between the
  LoadExcl instruction and the StoreExcl instruction. Therefore, for best
  performance, ARM strongly recommends a limit of 128 bytes between
  LoadExcl and StoreExcl instructions in a single thread of execution.

One-time switches back to dispatch (during code discovery, or synchall or something) should be fine, because the ldrex..strex code has to loop and be prepared to fail a few times. The problems are all related to inserted memory operations that execute every single time.

Thus we have problems with:

There could be multiple cbr's in between ldrex and strex, so do we give up on getting into a single bb? Should we study sequences we see in the wild to see if having some assumptions would make the problem drastically easier? Should we have a special trace head trigger and stop supporting -disable_traces?

One suggestion is adding a check for each instrumentation memory reference to skip if it's in the middle of ldrex/strex: but this gets tricky without allowing memory refs itself.

zhaoqin commented 9 years ago

It looks like the hardware is forgiving if there is only a few memory reference in between, (based on my experiment with memtrace instrumentation, inline instrumentation is fine, but large clean call would cause problem), so I would not worry about the stolen reg handling.

Just for memtrace, if we assume there is no memory reference in between and the hardware is forgiving, we can just not insert the clean call.

derekbruening commented 9 years ago

We need to know what "only a few" means, and how it varies with different hardware: let's not build something that only runs on one machine. It could be that just a few memory refs that happen to collide on cache lines with ldrex will cause a hang, so it's not clear yet that we can consider inline instrumentation or even stolen reg mangling to be safe.

algr commented 9 years ago

One option would be to compile to a fast case and a fallback case, and take the fallback case if the fast case takes too many trips through the loop, indicating it's spinning as a result of something the code itself is doing rather than because of action of other threads. Depending on the core microarchitecture and the actual addresses, it may be that the fast case works most of the time. Question is, what does the code in the fallback case look like?

derekbruening commented 9 years ago

First we would need to group the whole loop, which runs counter to DR's normal single-basic-block approach: normally we do not speculatively build blocks prior to execution. While we do not expect large numbers of branches or things like syscalls in between ldrex and strex, certainly there can be more than one branches (we have observed at least 2 in normal, compiled apps), meaning we're talking about a 3+ basic block region on which we need to impose some novel-to-DR constraints.

zhaoqin commented 9 years ago

If there are conditional branches in between, it might be really hard to speculatively build a code fragment of multiple basic blocks. In the worst case, we might need build a code fragment with 3 bbs for 1 cbr, and 7 bbs for 2 cbrs, i.e., one ldrex has separate strex in different path.

derekbruening commented 8 years ago

We need to clearly document this issue and ideally add checks that identify problems in client code, as many new users build tools that have a clean call before every instruction or every memory reference, and as noted above that many memory operations for the clean call between the app's ldrex...strex can result in an infinite loop.

derekbruening commented 8 years ago

I planned to add an instruction trace sample (#1933) , which hopefully can be the basis for similar tools that new users build, which should remove the most common reason people put clean calls before every single instruction.

derekbruening commented 8 years ago

A buffer filling API that includes the use of faults for handling full buffers (#513) could help here by eliminating explicit instrumentation and esp clean calls in tracing tools in problematic places like between ldrex..strex pairs. It is possible that the buffer could keep filling up at exactly the same point but it is unlikely, or at least it seems less likely than the problematic scenarios of the current instrumentation schemes.

egrimley commented 7 years ago

We have run into a real-world problem with "Stolen register mangling [...] on ldrex/strex themselves" on AArch64. On Ubuntu 14.04 the C library uses X28 in an LDAXR/STXR pair, with the result that "bin64/drrun -- echo foo" does not work: it goes into an infinite loop before exiting. For what it's worth, "bin64/drrun -steal_reg 27 -- echo foo" did work, but we found another case where X27 is used in an LDAXR/STXR pair... It would be great to have a fix for this before the next release. So, any suggestions for a quick fix, even a dirty one?

derekbruening commented 7 years ago

Qin's comment above https://github.com/DynamoRIO/dynamorio/issues/1698#issuecomment-104719066 led us to not worry about small numbers of memrefs like in stolen reg mangling -- what is the difference between what he observed and what you are observing? Is AArch64 more fragile, or did he just not stress test enough?

egrimley commented 7 years ago

It depends on the hardware implementation rather than the architecture version or AArch32/AArch64. The spec says that LoadExcl / StoreExcl loops are guaranteed to make forward progress only if there are no explicit memory accesses between the Load-Exclusive and the Store-Exclusive. The hardware we'd tested on so far lets you get away with a small number of memory accesses, but now we've starting testing on AArch64 hardware that is much less forgiving. We haven't yet encountered it, but there may also be AArch32 hardware that is similarly strict. In any case, a solution for AArch64 would probably also work on AArch32.

The spec also forbids indirect branches between the Load-Exclusive and the Store-Exclusive, which helps us.

I'm wondering whether identifying and mangling a "macro-instruction" would cover all cases likely to be met in practice. By a "macro-instruction" (my own provisional name, not a standard term) I mean a block of instructions in which all the branches are direct branches to an instruction in the block or to the instruction just after the block. One could perhaps identify a "macro-instruction" that contains both the Load-Exclusive and the Store-Exclusive and treat this as a single instruction for the purposes of stolen-register-mangling. This would cover the cases I've seen so far, though I haven't searched widely. A block of five instructions would handle the worst cases I've seen.

derekbruening commented 7 years ago

I'm sure we can get plain DR to work: the biggest challenge is instrumentation. How would clients operate on a multi-instr bundle? The precedent for that is jecxz/loop* on x86 but those expansions have the same operands and semantics as the original and thus have no effect on instrumentation. Here we can't just treat the block as a single instr with a ton of operands, because the semantics matter (think about taint tracking or Dr. Memory).

If we give up on instrumentation getting the actual load and register values, we could try to auto-magically make two versions: the real ldrex..strex with no instrumentation, followed by an emulation that uses regular ldr..str. So the client instruments the ldrex...strex as normal, using the fundamental approach of iterating over individual instructions, and then there's a pass afterward that changes the instrumentation to the emulated version and adds the raw version in front. We'd have to think through how the cbr, speculation, etc. would work: this would still require a super-block (with the loop as a self-loop) and so more of a trace, which will complicate things.

The "emulation" approach would work for shadow propagation (taint tracking, Dr. Memory), memtrace, instrace: most tools don't need precise values. It just seems too hard to get the values in the general case: we'd have to store every result of each instr in ldrex...strex into a spare register and just hope there are enough registers.

If even a single memref in between is no good, it does seem like there is no solution that does not construct a super-block, unfortunately...

derekbruening commented 7 years ago

The emulation would have to save the registers to preserve the "real" ldrex..strex sequence, but the spills would be after the strex and thus safe. It will incur overhead though: so maybe we'd want the fastpath approach suggested by Al that has a counter to see whether it's looping?

egrimley commented 7 years ago

In some cases that I've seen the smallest single-entry single-exit block containing the exclusive load and exclusive store consists of 7 instructions, 2 of which are conditional branches. So the code to rewrite the instrumentation would have to be fairly generic.

egrimley commented 7 years ago

I have a draft patch that bundles loops on AArch64 containing an exclusive load and store into a macro-instruction. There's no instrumentation for anything in these macro-instructions, but it lets the tests pass (except client.truncate). It's more of a work-around than a proper fix but perhaps it's useful for some people.

Here's an example of the contents of one of these macro-instrutions (though this particular one doesn't use the stolen register):

1:      ldaxr   w1, [x2]
        cmp     w1, wzr
        b.ne    2f
        stxr    w3, w0, [x2]
        cbnz    w3, 1b
2:
egrimley commented 7 years ago

Valgrind has exactly the same problem, and possible solutions have been discussed since Feb 2015. See: https://bugs.kde.org/show_bug.cgi?id=369459

egrimley commented 7 years ago

An alternative to building a macro instruction is to translate LDXR into an instruction sequence that does an ordinary load and saves the address, the loaded value and the width of the memory transaction into three TLS slots, and to translate STXR into an instruction sequence that checks that the address and the width agree with the saved values and then does an atomic compare-and-swap (itself implemented using LDXR and STXR) so that the simulated STXR succeeds if the value in memory hasn't changed since the simulated LDXR. This is not strictly speaking equivalent to the original code but would probably work in practice, at the cost of three extra TLS slots and the code expansion.

derekbruening commented 7 years ago

Right, that is what is discussed in the Valgrind issue, emulating with a different atomic sequence. Xref atomic emulation in the recent QEMU paper, Sec 4.3 of "Cross-ISA Machine Emulation for Multicores" Cota, et al.: http://www.cs.columbia.edu/~luca/research/cota_CGO17.pdf

egrimley commented 7 years ago

We need a link here to https://github.com/DynamoRIO/dynamorio/pull/2259, which implements the "macro-instruction" work-around. We are still undecided between this and the "compare-and-swap" alternative.

derekbruening commented 7 years ago

We need a link here to #2259, which implements the "macro-instruction" work-around. We are still undecided between this and the "compare-and-swap" alternative.

A macro-instruction does not work with clients (as discussed above: https://github.com/DynamoRIO/dynamorio/issues/1698#issuecomment-273830730) so it does not seem like a long-term solution: it is a temporary solution for plain DR with no tool.

derekbruening commented 6 years ago

A user hit this on Cavium ThunderX: https://groups.google.com/forum/#!topic/dynamorio-users/bp2pjyj8zFk

fhahn commented 6 years ago

I looked into most of the remaining failures with -unsafe_build_ldstex.

I guess we could work around those somehow, but I have been thinking a bit more about what would be necessary to implement the approach discussed in https://github.com/DynamoRIO/dynamorio/issues/1698#issuecomment-283771396:

Where would the best place to implement the expansion of ldxr/stxr be? We have to prevent clients from instrumenting the ldxr/stxr in the expanded sequence. Could we do it after all instrumentation has been inserted by the clients? If we use compare-and-swap (either via ldxr/stxr or CAS), IIUC we could miss cases, e.g. when a different thread stores the existing value in between. But Sec 4.3 of "Cross-ISA Machine Emulation for Multicores" Cota, et al. indicates that this is usually not a problem in practice. Also, would it make sense to try their 'Pico-CAS' approach, by replacing the stxr & conditional branch by a CAS instruction followed by a check if we managed to swap the value? This again would suffers from the ABA problem.

As an optimization we could only do the expansion if instrumentation has been added a ldxr/stxr block.

derekbruening commented 6 years ago

Could we do it after all instrumentation has been inserted by the clients?

During mangling inside core DR? It seems a good choice at first glance.

derekbruening commented 4 years ago

I'm wondering whether another possible solution here is to take an approach inspired by what we're using for restartable sequences ("rseq"), detailed at https://github.com/DynamoRIO/dynamorio/wiki/Restartable-Sequences. Adapting that "run twice" approach here would mean having two versions of the code: one that's instrumented but may not make forward progress, and another that runs without instrumentation and should then complete (might still have to loop a small number of times though in case of kernel/processor actions losing the monitor: we should be prepared for that).

This is similar to @algr's fallback idea when not making progress (https://github.com/DynamoRIO/dynamorio/issues/1698#issuecomment-107379637) where the fallback is to run natively.

The downside is that the instrumentation does not always accurately observe the original program behavior: it may always see a monitor failure and loop and never see the success case.

Complications include identifying the looping conditional branch. If every sequence is as simple as the example above:

1:      ldaxr   w1, [x2]
        cmp     w1, wzr
        b.ne    2f
        stxr    w3, w0, [x2]
        cbnz    w3, 1b
2:

Then what we'd do is run normally and instrument normally, but insert via DR mangling in the "stxr;cbnz" block extra code with a counter or something that detects looping and if it finds it, it redirects control to an embedded native copy of the whole ldaxr...cbnz sequence. I don't think we want an internal loop there (breaks bounded-time assumptions inside blocks): if the native monitor fails it would have to loop all the way back around. That should be ok.

There seem to be a lot of complexities and what-ifs here: just like rseq. Just throwing it out as an idea. Maybe converting to CAS is so much simpler we should just do that.

Desperado1985 commented 4 years ago

Greetings,

I seem to have this problem with the recent DynamoRIO-AArch64-Linux-7.91.18342 build for an application with 150+ threads and drcov. Adding the -unsafe_build_ldstex flag to the drrun parameters doesn't help, in fact it has problems reading the -root parameter if I add it before that. Although the program seems to proceed slightly further than without it. Do you have any further recommendations? Regards.

derekbruening commented 4 years ago

Looks like a duplicate comment -- simpler to have one thread: https://github.com/DynamoRIO/dynamorio/issues/3005#issuecomment-604752183

derekbruening commented 4 years ago

From #4263: we may be able to translate some exclusive monitor loops into the new ARMv8.1 atomic opcodes and avoid the mid-loop-instrumentation problems.

derekbruening commented 3 years ago

I'm planning to take a stab at implementing the compare-and-swap solution.

AssadHashmi commented 3 years ago

I'm planning to take a stab at implementing the compare-and-swap solution.

Feel free to prod me if you'd like me to run any changes on a variety of AArch64 h/w, e.g. ThunderX2, Softiron.

derekbruening commented 3 years ago

I wrote up a design doc on the ideas in this issue: https://github.com/DynamoRIO/dynamorio/wiki/Exclusive-Monitors

I included in the doc a sample compare-and-swap transformation which is how I planned to implement it. @AssadHashmi maybe you could take a look and see if it looks reasonable.

AssadHashmi commented 3 years ago

@AssadHashmi maybe you could take a look and see if it looks reasonable.

Will do @derekbruening. It deserves full attention so I'll get back to you tomorrow.

AssadHashmi commented 3 years ago

Thanks for a really well-written/communicated description of the problem and its possible solutions!

The proposed solution B (Compare-and-Swap Simulation) seems to be the best choice in terms of v8.x h/w agnosticism as well as least intrusive to DR.

The example transformation sequence is for store-if-zero read-modify-write synchronization. Are you planning to have multiple transformation sequence templates for other usages of r-m-w synchronization, e.g. atomic arithmetic ops? Or will this one sequence be modified for all? Since we don't know what all of these usages could be, perhaps we could build a collection of transformation code templates to cover each type as they're needed, rather than one sequence which may get increasingly complex?

I appreciate the disadvantages of solution C but if there is value in using the dedicated atomic instructions available from v8.1 onwards, then detecting h/w capabilities before picking a transformation may be a possibility. It's a fairly straightforward case of using the MRS instruction to read the appropriate instruction set attribute register. In general though, a v8.x agnostic solution is preferable unless there are compelling reasons to use particular v8 version instructions.

The weaker sequence does not have the exact same semantics of the original, but it is pretty close, and the claim is that the difference almost never matters for real programs.

Can we fallback (automatically or with user option) to the current -unsafe_build_ldstex sequence?

The number of instructions could be reduced by using paired load/stores if the spill slots and monitor addr/val slots are consecutive in memory. For:

str     x2, [x28, #<monitor-addr-slot>]
str     x1, [x28, #<monitor-value-slot>]

to:

stp x2, x1, [x28], #<monitor-addr-slot>

And:

ldr     x1, [x28, #<x1-spill-slot>]
ldr     x0, [x28, #<x0-spill-slot>]

to:

ldp x1, x0, [x28], #<x1-spill-slot>

Similarly for:

str     x1, [x28, #<x1-spill-slot>] // Spill for common restore.
str     x0, [x28, #<x0-spill-slot>]

Should the b.ne 2f have a larger offset?

derekbruening commented 3 years ago

The example transformation sequence is for store-if-zero read-modify-write synchronization. Are you planning to have multiple transformation sequence templates for other usages of r-m-w synchronization, e.g. atomic arithmetic ops? Or will this one sequence be modified for all? Since we don't know what all of these usages could be, perhaps we could build a collection of transformation code templates to cover each type as they're needed, rather than one sequence which may get increasingly complex?

I thought the transformation would be identical for arithmetic ops? Or am I missing something? I figured the same rules would apply to any sequence:

  1. If you see an LD.X. opcode: A. Transform the LD.X. to the equivalent opcode without the X (i.e., keeping the size and whether an acquire). B. Insert the sequence from the wiki to store the address, value (result of the load), and size to dedicated TLS slots.

  2. If you see a ST.X. opcode: A. Insert checks that the address and size match: if not, pretend the ST.X. failed. B. Add a LD.X. and comparison to the value in TLS. Hmm: here we need to remember the opcode of the original LD..X.? On mismatch, fail.

Would we ever need to change the ST.X. opcode (e.g., whether it has release semantics)?

derekbruening commented 3 years ago

I appreciate the disadvantages of solution C but if there is value in using the dedicated atomic instructions available from v8.1 onwards, then detecting h/w capabilities before picking a transformation may be a possibility

The challenges listed in the doc are that it's difficult to transform multiple instructions after instrumentation, and doing it before means tools don't see the original opcodes. Though if the main types of tools like tracing and taint tracking are all good and have the same semantics (except opcode counting) maybe it would work out ok: if the dataflow for taint metadata is identical. I figure we could try something like this later: my plan was to start with CAS which has only one downside (the ABA problem).

derekbruening commented 3 years ago

Can we fallback (automatically or with user option) to the current -unsafe_build_ldstex sequence?

Yes, I definitely agree to putting the CAS under an option so we can control it, in case the ABA turns out to matter.

derekbruening commented 3 years ago

Thanks for the stp tips: we should be able to do that.

Should the b.ne 2f have a larger offset?

It will be a separate block and will have regular DR branch mangling.

derekbruening commented 3 years ago

It will be a separate block and will have regular DR branch mangling.

Maybe the doc didn't make that clear, that that transformation is really in two pieces. That's one of the advantages of the CAS: that it acts separately on the load and store and it doesn't matter how many branches are in between and doesn't need to act across blocks.

johnfxgalea commented 3 years ago

Just a fly-by comment: For taint tracking and tracing, adding Solution C is not very severe (provided that the resulting transformed code has the same semantics). I guess an issue arises if a taint policy is particularly concerned with load-exclusive store-exclusive sequences but this is a specific use case and not the norm. However, Solution A would be detrimental as instrumentation code placed prior to the sequence would require emulation to get app data such as addresses.

AssadHashmi commented 3 years ago

I thought the transformation would be identical for arithmetic ops? Or am I missing something? I figured the same rules would apply to any sequence

Maybe the doc didn't make that clear, that that transformation is really in two pieces.

Ah ok. So my understanding was flawed in that I imposed a constraint which doesn't exist, i.e. limiting the span between, and scope of what can happen between, LD.X and ST.X.

That raises the question: are there limits to what instructions and events can happen between LD.X and ST.X ? (The main worry would be nesting I guess but you've already stressed that can't happen because each hardware thread supports only one monitor.)

yury-khrustalev commented 3 years ago

Just a couple of minor notes about the design doc:

  1. Exclusive monitor can be cleared by CLREX instruction, not only by corresponding store-exclusive (or any non-exclusive memory access).
  2. We shouldn't assume that software will behave as recommended. So whatever solutions is implemented it should cope with all kinds of scenarios.

I don't fully understand how any of the proposed A, B or C solutions will work for anything other than the few considered examples. Setting and clearing exclusive monitor happens at execution time rather than at instrumentation time. So a "load-exclusive store-exclusive atomic sequence" doesn't really exist, because, depending on the logic right after the load-exclusive, the execution may "jump" to various places in code (which may not yet be seen by DynamoRIO), so the "tail" of of the sequence is essentially dynamic.

I think what would really be helpful is ability to defer execution of instrumentation for anything between "exclusive monitor set" and "exclusive monitor released", meaning that we would instrument basic blocks as we usually do, but with ability to know when we enter the sequence and when we exit from it (maybe new type of event?).

yury-khrustalev commented 3 years ago

It would be consistent to expose instr_is_exclusive_load (a counterpart for API function instr_is_exclusive_store).

derekbruening commented 3 years ago

Exclusive monitor can be cleared by CLREX instruction, not only by corresponding store-exclusive (or any non-exclusive memory access).

Right, but I don't think this impacts the solutions much: e.g., B (CAS) should clear its TLS values on CLREX but that only matters for pathological cases where the app would fail on CLREX..STXR and DR incorrectly thinks the STXR matches a LDXR prior to the CLREX. I'll add it to the doc.

We shouldn't assume that software will behave as recommended. So whatever solutions is implemented it should cope with all kinds of scenarios.

This is indeed the general DR philosophy, to handle any possible program, but unfortunately there are problems that simply can't be solved for all cases. E.g., for restartable sequences (see the wiki design page) we had to make assumptions on ABI conventions listing the rseq regions for us; the only alternative requires monitoring every single store which would introduce prohibitive, unacceptable overhead likely breaking many tool deployments. Similarly here, there may not be solutions short of unacceptable schemes like serializing all the application threads or other non-starters. We can have layered multi-pronged imperfect solutions as the doc talks about. I think starting with B (CAS) with A (-unsafe_build_ldstex) as a fallback is a reasonable starting place: neither solution is perfect.

So a "load-exclusive store-exclusive atomic sequence" doesn't really exist, because, depending on the logic right after the load-exclusive, the execution may "jump" to various places in code (which may not yet be seen by DynamoRIO), so the "tail" of of the sequence is essentially dynamic.

This is why Solution B, the CAS, is my preference: it does not rely on seeing the entire sequence as it acts on the load and store separately and can easily handle different dynamic paths pairing up loads and stores (or even unpaired). It should handle any sequence and not just the examples. Do you have a counter-example where it fails, other than the ABA issue?

derekbruening commented 3 years ago

That raises the question: are there limits to what instructions and events can happen between LD.X and ST.X ? (The main worry would be nesting I guess but you've already stressed that can't happen because each hardware thread supports only one monitor.)

Limits required by Solution B (CAS)? No, it will handle anything. Which is why it is my preferred solution. I am pretty sure that the ABA issue is the only problem. As for the other solutions: they will all have limits, because they need to identify the entire sequence up front. The manual does say that indirect branches in between void the guarantee of forward progress, so we would not expect to see them, but you could imagine a pathological case with indirect branches preventing simple static decoding from the load to find the paired store(s), and even with all direct branches you could have some serpentine, complex code with conditional branches.

derekbruening commented 3 years ago

I think what would really be helpful is ability to defer execution of instrumentation for anything between "exclusive monitor set" and "exclusive monitor released", meaning that we would instrument basic blocks as we usually do, but with ability to know when we enter the sequence and when we exit from it (maybe new type of event?).

Separating instrumentation from its target instruction can get very complex. Imagine 32-bit ARM with predicated instructions inside the monitor region where the flags change several times in the region. Separating then involves emulating all the flags changes to properly predicate or otherwise handle the instrumentation. Branches inside the region complicate dataflow instrumentation like taint tracking, which has to try to figure out which paths were taken once it reaches a point where it can write to memory again. It does not seem tenable.

derekbruening commented 3 years ago

So I did find a complication with Solution B: multiple load opcodes for the same store. Potential solutions: 1) Flush and replace with multi-version inside the fragment 2) Return to dispatch and use gencode created at init for every opcode, with special translation support For the initial version of the implementation I'll at least detect this.

derekbruening commented 3 years ago

So I did find a complication with Solution B: multiple load opcodes for the same store. Potential solutions:

  1. Flush and replace with multi-version inside the fragment
  2. Return to dispatch and use gencode created at init for every opcode, with special translation support For the initial version of the implementation I'll at least detect this.

New proposal: always use the acquire version of the load for the CAS. This removes all the opcode complexity. We have the size from the store opcode. This shouldn't affect correctness and would just be a performance issue from the extra barrier I assume.

derekbruening commented 3 years ago

Ideas for a local small test:

It seems difficult to have more sophisticated synch tests that will show up in CI.

Are there some known stress tests of ldstex somewhere?

derekbruening commented 3 years ago

One interesting issue is this app code:

      1:
        ldaxr    w1, [x0]
        add      w2, w1, #0x1
        mov      x0, #0
        stxr     w3, w2, [x0]
        cbnz     w3, 1b

Natively, the stxr faults, despite the base mismatch. The simple mangling from the proposal has us fail the base-equal check and so not even execute a store, causing the app to loop and crash on the ldaxr. You could tweak this to make us loop forever where the app crashes on the stxr.

My plan is to add a cloned stxr on the mismatch path to ensure fault fidelity.

derekbruening commented 3 years ago

Another issue: when the ldex or stex uses the stolen reg, my current impl has a swap of the stolen reg to a different spilled reg, keeping the app ldex+stex how they were, just like all other stolen reg mangling, across the whole inserted sequence. This means that all the added TLS refs are using the swapped reg as the base. This means they are not recognized by the spill/restore identification code during translation for restoring spilled regs, and won't be recognized by clients either (instr_is_reg_spill_or_restore() API function). I'm thinking this should be revisited: it may be better to change the stolen reg in the app instrs in this case, to allow spill/restore identification to occur naturally.

derekbruening commented 3 years ago

While implementing this I have run into quite a few missing features and bugs in the AArch64 port:

I fixed some of these: