Open sebcrozet opened 7 years ago
Just thinking out loud here: in theory, one should be able to replace every copy by a move or clone (whatever lifetimes allow) and then relax the trait bounds from Copy
to Clone
and it should just work, because the clones will reduce to copies for Copy
types.
Of course, the implementations will end up being more verbose. Except for any kind of unsafe code that relies on stuff being Copy
. Do we have such code?
Yes, that's the idea.
We have some code that do rely on the scalar type being copy though. I believe those are limited to places where the std::mem
module is used; especially uses of mem::uninitialized()
. In particular, matrix creation with non-initialized components is commonly used throughout the library so we have to be careful when non-copy types are considered if they have a Drop
implementation.
Any updates on this?
@StevenDoesStuffs No, this requires a massive amount of work so we didn't have the time to explore this yet.
I made a minimal-ish test case that gets different debug-mode codegen (including an explicit call to <usize as Clone>::Clone
) for T: Copy
vs T: Clone
bounds. This could probably be addressed via an inlining pass in rustc.
The closest rustc issue under the query is:issue is:open Clone inline label:I-slow
seems to be https://github.com/rust-lang/rust/issues/37538, but that's about release-mode inlining threshhold.
The closest rustc issue under the query is:issue is:open Clone debug label:I-slow
seems to be https://github.com/rust-lang/rust/issues/41160, but that's also about release-mode codegen.
The issue https://github.com/rust-lang/rust/issues/59375 (found via is:issue is:open inline debug label:I-slow
) is more relevant in that it's about debug-mode performance, but it's about PartialEq::eq
.
If using Clone bounds causes a debug-mode performance regression, but has no release mode impact, will it still be considered? If not, I'll try making a rustc patch that at least inlines clone in this example before trying the Copy -> Clone refactoring.
Example:
fn memcpyish_clone<T: Clone>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = a.clone();
}
}
fn memcpyish_copy<T: Copy>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = *a;
}
}
#[no_mangle]
#[inline(never)]
fn memcpyish_clone_usize(x: &[usize], y: &mut [usize]) {
memcpyish_clone::<usize>(x, y)
}
#[no_mangle]
#[inline(never)]
fn memcpyish_copy_usize(x: &[usize], y: &mut [usize]) {
memcpyish_copy::<usize>(x, y)
}
fn main() {
let x = [1,2,3];
let mut y = [0,0,0];
let mut z = [0,0,0];
memcpyish_copy_usize(&x, &mut y);
memcpyish_clone_usize(&x, &mut z);
}
/*
$ rustc copy_debug_codegen_20191121.rs
$ objdump -d copy_debug_codegen_20191121 | grep '<_ZN[^>]*memcpyish_[^>]*>:' -A53
$ r2 copy_debug_codegen_20191121
[0x000062b0]> pd 5 @ sym.memcpyish_clone_usize
;-- memcpyish_clone_usize:
0x00006590 50 pushq %rax
0x00006591 e84afeffff callq sym.copy_debug_codegen_20191121::memcpyish_clone::hf5036453dec78982
0x00006596 58 popq %rax
0x00006597 c3 retq
0x00006598 0f1f84000000. nopl 0(%rax, %rax)
The block for _ZN27copy_debug_codegen_2019112114memcpyish_copy17h08e0b47cae3dd97dE is shorter, and does not contain a call to _ZN4core5clone5impls54_$LT$impl$u20$core..clone..Clone$u20$for$u20$usize$GT$5clone17h80760c01c3acfef1E
With -O, rustc recognizes that the optimized function bodies are identical, and emits two symbols for the same block of code:
$ rustc -O copy_debug_codegen_20191121.rs
$ r2 copy_debug_codegen_20191121
[0x000063f0]> s sym.memcpyish_copy_usize
[0x00006520]> pd 14
;-- memcpyish_clone_usize:
;-- memcpyish_copy_usize:
0x00006520 4889f0 movq %rsi, %rax
0x00006523 4839ce cmpq %rcx, %rsi
0x00006526 480f47c1 cmovaq %rcx, %rax
0x0000652a 4885c0 testq %rax, %rax
,=< 0x0000652d 7418 je 0x6547
| 0x0000652f 50 pushq %rax
| 0x00006530 4889fe movq %rdi, %rsi
| 0x00006533 48c1e003 shlq $3, %rax
| 0x00006537 4889d7 movq %rdx, %rdi
| 0x0000653a 4889c2 movq %rax, %rdx
| 0x0000653d ff15edc72200 callq 0x00006543 ; reloc.memcpy ; [0x232d30:8]=0
| 0x00006543 4883c408 addq $8, %rsp
`-> 0x00006547 c3 retq
0x00006548 0f1f84000000. nopl 0(%rax, %rax)
Interestingly, even in debug mode, `-C inline-threshold=` does change the binary (in that memcpyish_{copy,clone}_usize contain the actual implementation instead of a 4-instruction stub), but increasing it further doesn't seem to inline Clone
$ rustc -C inline-threshold=5 copy_debug_codegen_20191121.rs
*/
@aweinstock314
If using Clone bounds causes a debug-mode performance regression, but has no release mode impact, will it still be considered?
It will be considered only if the regression is not too significant (up to 10%) in debug mode.
However, we can accept a regression if a the support of non-Copy types requires a feature to be enabled. That way the user will have to enable this feature explicitly before using non-Copy types, and thus explicitly accept this will cause a performance regression in debug mode. Existing users of Copy types won't be affected by the regression because they won't have the feature enabled.
If not, I'll try making a rustc patch that at least inlines clone in this example before trying the Copy -> Clone refactoring.
Yeah, it would be great that if the type is Copy
then its .clone()
is automatically inlined. Though I'm not sure this is always desirable for debug purposes.
I guess an alternative could be to have a trait like that, and use .inlined_clone()
instead of .clone()
:
trait InlinedClone: Clone {
fn clone_inlined(&self) -> Self;
}
impl<T: Clone> InlinedClone for T {
#[inline(always)]
fn clone_inlined(&self) -> Self {
self.clone()
}
}
impl<T: Copy> InlinedClone for T {
#[inline(always)]
fn clone_inlined(&self) -> Self {
*self
}
}
My guess is that the #[inline(always)]
also works in debug mode. Therefore Copy
type will use the copy instead of the clone.
Of course this is not possible without specialization (overlapping impl), which is why it may not be the best option.
Yet another option would be to add the inlined_clone
method to the existing Scalar
trait directly:
pub trait Scalar: PartialEq + Debug + Any {
#[inline]
/// Tests if `Self` the same as the type `T`
///
/// Typically used to test of `Self` is a f32 or a f64 with `N::is::<f32>()`.
fn is<T: Scalar>() -> bool {
TypeId::of::<Self>() == TypeId::of::<T>()
}
fn inlined_clone(&self) -> Self;
}
impl<T: Copy + PartialEq + Debug + Any> Scalar for T {
#[inline(always)]
fn inlined_clone(&self) -> Self {
*self
}
}
This should still allow the user to implement Scalar
for non-copy types because I think an impl like the following won't generate an overlapping impl error:
impl Scalar for MyNonCopyType {
#[inline(always)]
fn inlined_clone(&self) -> Self {
self.clone()
}
}
However, we can accept a regression if a the support of non-Copy types requires a feature to be enabled.
How would that be accomplished without risking breaking downstream code when the feature is enabled?
I tweaked the codegen demo to add the InlinedClone
trait approach; it generates a handful of extra mov
s relative to the Copy
version (and a jump to a sequential basic block, the eb00
which ought to be a nop), but no call to clone
.
I'm optimistic that this won't cause much of a debug-mode regression, so I'll try implementing the approach of adding inlined_clone
to Scalar
.
If I'm understanding correctly, adding inlined_clone
to Scalar
shouldn't break downstream crates because all downstream crates are currently using Copy
Scalar
s, which will be covered by the blanket impl. Removing inlined_clone
if a rustc pass gets accepted for this case would be a breaking change. Is this worth using #[doc(hidden)]
or #[deprecated]
in anticipation of a proper fix upstream, or is that long-term enough that using a major version to remove it is fine?
It's worth noting that MaybeUninit<T>
is only Clone
, not Copy
, even if T: Copy
, so a newtype wrapper around MaybeUninit<T>
to make it Scalar
(to address #556) seems to depend on this. A draft of that approach is at https://github.com/aweinstock314/nalgebra/commit/7d040b7d13e6fb6e143f1c14987bcc2f48dc585b, but it's not wired up to anything because of this dependency.
Codegen demo follows:
trait InlinedClone: {
fn inlined_clone(&self) -> Self;
}
impl<T: Copy> InlinedClone for T {
#[inline(always)]
fn inlined_clone(&self) -> T {
*self
}
}
fn memcpyish_inlined_clone<T: InlinedClone>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = a.inlined_clone();
}
}
fn memcpyish_clone<T: Clone>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = a.clone();
}
}
fn memcpyish_copy<T: Copy>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = *a;
}
}
#[no_mangle]
#[inline(never)]
fn memcpyish_inlined_clone_usize(x: &[usize], y: &mut [usize]) {
memcpyish_inlined_clone::<usize>(x, y)
}
#[no_mangle]
#[inline(never)]
fn memcpyish_clone_usize(x: &[usize], y: &mut [usize]) {
memcpyish_clone::<usize>(x, y)
}
#[no_mangle]
#[inline(never)]
fn memcpyish_copy_usize(x: &[usize], y: &mut [usize]) {
memcpyish_copy::<usize>(x, y)
}
fn main() {
let x = [1,2,3];
let mut y1 = [0,0,0];
let mut y2 = [0,0,0];
let mut y3 = [0,0,0];
memcpyish_copy_usize(&x, &mut y1);
memcpyish_clone_usize(&x, &mut y2);
memcpyish_inlined_clone_usize(&x, &mut y3);
}
/*
ending part of memcpyish_inlined_clone_usize:
| | : ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_inlined_clone::h0c8f347077976026 (0x6f6e)
| `----> 0x00006f7a 488b8424c800. movq local_c8h, %rax ; [0xc8:8]=0
| : 0x00006f82 488b8c24d000. movq local_d0h, %rcx ; [0xd0:8]=0x315b4 segment_end.LOAD0
| : 0x00006f8a 488b00 movq 0(%rax), %rax
| : 0x00006f8d 48894c2410 movq %rcx, local_10h
| : 0x00006f92 4889442408 movq %rax, local_8h
| ,==< 0x00006f97 eb00 jmp 0x6f99
| |: ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_inlined_clone::h0c8f347077976026 (0x6f97)
| `--> 0x00006f99 488b442410 movq local_10h, %rax ; [0x10:8]=0x1003e0003
| : 0x00006f9e 488b4c2408 movq local_8h, %rcx ; [0x8:8]=0
| : 0x00006fa3 488908 movq %rcx, 0(%rax)
\ `=< 0x00006fa6 eb92 jmp 0x6f3a
ending part of memcpyish_copy_usize:
| | : ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_copy::h32096c80aa649f39 (0x712e)
| `----> 0x0000713a 488b8424b800. movq local_b8h, %rax ; [0xb8:8]=0
| : 0x00007142 488b8c24c000. movq local_c0h, %rcx ; [0xc0:8]=0
| : 0x0000714a 488b00 movq 0(%rax), %rax
| : 0x0000714d 488901 movq %rax, 0(%rcx)
\ `=< 0x00007150 eba8 jmp 0x70fa
ending part of memcpyish_clone_usize:
| | : ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_clone::h36d27fecff3c0f2c (0x704e)
| `----> 0x0000705a 488bbc24c800. movq local_c8h, %rdi ; [0xc8:8]=0
| : 0x00007062 488b8424d000. movq local_d0h, %rax ; [0xd0:8]=0x315b4 segment_end.LOAD0
| : 0x0000706a 4889442410 movq %rax, local_10h
| : 0x0000706f e8acf6ffff callq sym.core::clone::impls::__impl_core::clone::Clone_for_usize_::clone::h3d9a1e411c34f5b2
| : 0x00007074 4889442408 movq %rax, local_8h
| ,==< 0x00007079 eb00 jmp 0x707b
| |: ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_clone::h36d27fecff3c0f2c (0x7079)
| `--> 0x0000707b 488b442410 movq local_10h, %rax ; [0x10:8]=0x1003e0003
| : 0x00007080 488b4c2408 movq local_8h, %rcx ; [0x8:8]=0
| : 0x00007085 488908 movq %rcx, 0(%rax)
\ `=< 0x00007088 eb90 jmp 0x701a
*/
Is this worth using #[doc(hidden)] or #[deprecated] in anticipation of a proper fix upstream, or is that long-term enough that using a major version to remove it is fine?
Is there a specific need for external code to call it? If not, #[doc(hidden)]
seems like the safest option. In any case, nalgebra breaks compat often enough, and rustc moves slowly enough, that a deprecation cycle should be fine if necessary.
External code won't need to call it, but it will need to impl it (for Clone
Scalar
s) unless we make use of type-level-bools and associated-type-injectivity to provide a blanket impl for Clone
(which is ergonomically better on downstream crates, but I'd have to test that the pattern does compile away in debug mode). I guess #[doc(hidden)]
+ blanket Clone
impl is the better approach (relative to later deprecation) if and only if that trait pattern compiles away in debug.
External code won't need to call it, but it will need to impl it
Ah, good point. In that case I'd just plan for a deprecation cycle; a blanket impl sounds like a lot of complexity at best.
I've written https://github.com/rustsim/nalgebra/pull/684. What is the recommended way to benchmark debug performance? IIRC cargo bench
only does release mode benchmarks.
@aweinstock314 You can benchmark in "debug mode" by disabling optimization in bench mode. Simply add that to the Cargo.toml
of nalgebra:
[profile.bench]
opt-level = 0
lto = false
Seems to be addressed by https://github.com/dimforge/nalgebra/pull/949 and https://github.com/dimforge/nalgebra/pull/962 .
See the comments on reddit.