Non-overlapping version of memory.copy?

penzn commented 5 years ago

C/C++ requires non-overlapping source and destination to memcpy. Rust supports both overlapping and non-overlapping copy operations. Would anybody mind a PR with non-overlapping variant of memory.copy?

conrad-watt commented 5 years ago

C/C++ and Rust make the non-overlapping copy faster by specifying undefined behaviour in the case that there is actually an overlap. This doesn't really fit Wasm's design philosophy. Alternatively, if such an instruction were spec'd to trap immediately in the overlapping case, it would be hard to make it faster than the regular memory.copy (since you could just do the same check and go to a fast path instead of trapping).

AndrewScheidecker commented 5 years ago

I was thinking about this in the context of the other thread about inlining memory.copy, but the problem is that without a dynamic overlap check, it's possible to end up with a worse form of non-determinism than racing changes to memory bounds/protection: using a non-overlapping copy instruction with address ranges that actually overlap may behave differently on different runtimes, and WASM code that "works" in one runtime might not work in another.

It would be possible to avoid that inter-runtime non-determinism by defining such a non-overlapping memory.copy to be semantically equivalent to a one-byte-at-a-time copy, but that constrains the implementation to something that is inefficient for the small copies that benefit the most from getting rid of the dynamic direction check.

lars-t-hansen commented 5 years ago

C/C++ requires non-overlapping source and destination to memcpy. Rust supports both overlapping and non-overlapping copy operations. Would anybody mind a PR with non-overlapping variant of memory.copy?

What @conrad-watt said. I don't think this yields any benefits in the context of wasm, where we have to check. It's a bit like regular memory accesses: they may declare that they are properly aligned, but they could be lying (or mistaken) about that. Yet we have to make them work anyway.

binji commented 4 years ago

Seems like there's not much to be done here, closing.

laughinghan commented 4 years ago

Hi sorry, I don't understand the resolution to this thread.

@conrad-watt @lars-t-hansen My understanding is that a dynamic overlap check that traps is practically free because it's branch-predictor-friendly (because in practice, in hot code, it should never trap). Whereas dynamic overlap checks to decide whether or not to allow parallelized copy, and whether to forward- or reverse-copy, would likely see all cases and couldn't be optimized by the branch predictor.

The same consideration applies to alignment, too. I understand that for large bulk moves, branch predictor optimization would be negligible, but #1 seems to say we explicitly assume these operations will be used for small moves too. This was also discussed re alignment in #111, but that seems to have been closed after consensus was achieved on a change to bounds-checking.

Am I missing something?

conrad-watt commented 4 years ago

I have only amateur knowledge of branch prediction, but this would only be a potential issue in situations where the checks aren't inlined at the site, right? If the short moves are of constant length, I'd expect the checks to be inlined or unnecessary (based on the resolution of https://github.com/WebAssembly/bulk-memory-operations/issues/111). So to see a real performance hit, you'd need a program performing a weird enough mixture of overlapping/non-overlapping small variable-size moves to defeat the branch predictor and have this be observable. My knee-jerk reaction is that such programs are unlikely to exist, and if they were common then engines would probably want to more aggressively inline anyway, to avoid paying the cost of calling an intrinsic for such small moves.

laughinghan commented 4 years ago

I am no expert either, but I don't think inlining renders the checks unnecessary, and isn't the point of this intrinsic to be faster than equivalent Wasm code, not to impose an additional cost that engines have to weigh?

I don't think programs have to be particularly weird to defeat the branch predictor (and related mechanisms like the cache prefetcher), but I'm not in a position to prove it either way. You could be right that inlining obviates the principle that a dynamic check to trap is free whereas a dynamic check that alters control flow and precludes a fast path is expensive. That wasn't the conclusion I drew from #111, where it seems like LLVM is explicitly choosing not to use memory.copy for short constant copies because of the loss of alignment and overlap information (https://github.com/WebAssembly/bulk-memory-operations/issues/111#issuecomment-531238828), but I should probably let implementation experts like @eqrion speak for themselves.

tlively commented 4 years ago

The resolution to this thread had nothing to do with branch predictors or inlining. Wasm cannot have a version of memory.copy that assumes the memory regions do not overlap because in practice the memory regions might overlap and WebAssembly has to define an exact behavior in that case. As @conrad-watt said in his first comment, languages that provide a version of memcpy that assumes non-overlapping memory regions make it undefined behavior to call that function with overlapping regions. That is not an option for WebAssembly because WebAssembly does not have undefined behavior.

conrad-watt commented 4 years ago

To be fair to @laughinghan, I think their comment posits that a non-overlapping version of memory.copy that does check for non-overlap and traps if that is not satisfied could still be faster than a general memory.copy that checks for non-overlap and chooses between fast path/slow path, because the latter could have problems with branch prediction. I'd still be surprised if this was an issue in real-world code.

CryZe commented 4 years ago

Well memcpy is really hot in Rust and the branch isn‘t super predictable as it gets called in so many different ways. So it might actually make a difference. I guess you would just have to define the behavior of what happens if it is indeed overlapping despite using the non-overlapping variant.

tlively commented 4 years ago

Oh, I see. Sorry for misunderstanding that. It would be interesting to get data on how common overlapping memcpy is in practice and whether a trap-on-overlap variant would be useful.

conrad-watt commented 4 years ago

If this did turn out to be a real perf concern, my preference would be that a non-overlapping-optimised variant appears in the semantics as a "hint" to memory.copy, similar to the alignment hints for load/store. So the Wasm-level semantics would still be the same, and there would still be a slow-path fallback, but an engine could see the hint and, if it wanted to optimise for it, inline the checks/use a different intrinsic/OOL.

Maybe such a "hinted variant" would have to be encoded as a separate instruction in the binary format either way, just because this proposal is so far along.

lars-t-hansen commented 4 years ago

I think that if we want a nonoverlapping variant of memory.copy, then we should have some data to back up the utility of it, and we should do this as a separate proposal, as we really need to ship this one. ie we should kick this issue up to the design repo, probably.

I probably agree with @conrad-watt that a hint is more useful than a trap-on-unaligned, and that hints could be overlapping/not + alignment of pointers. As usual, there is a memory index field in memory.copy that can be repurposed as a flag byte / indicator of additional information, and now that we are aware of this issue we can take that into account when we work on the multi-memory feature, which will need to use the memory index.

But if the purpose of this variant is performance, it would also be useful to determine (through experimentation) whether there's an edge to be had from trapping rather than hinting.

WebAssembly / bulk-memory-operations

Non-overlapping version of memory.copy? #121