dimforge / nalgebra

Linear algebra library for Rust.
https://nalgebra.org
Apache License 2.0
4k stars 477 forks source link

Consider supporting non-Copy scalar types. #282

Open sebcrozet opened 7 years ago

sebcrozet commented 7 years ago

See the comments on reddit.

sebcrozet commented 7 years ago

Acording to the comment cited above, this would be useful in order to support types like:

BigFloats [...] BigInts, BigRationals, BigDecimals, etc. dual numbers

Crates implementing (some of) them are for example: ramp, rust-gmp, and dual_num

milibopp commented 7 years ago

Just thinking out loud here: in theory, one should be able to replace every copy by a move or clone (whatever lifetimes allow) and then relax the trait bounds from Copy to Clone and it should just work, because the clones will reduce to copies for Copy types.

Of course, the implementations will end up being more verbose. Except for any kind of unsafe code that relies on stuff being Copy. Do we have such code?

sebcrozet commented 7 years ago

Yes, that's the idea.

We have some code that do rely on the scalar type being copy though. I believe those are limited to places where the std::mem module is used; especially uses of mem::uninitialized(). In particular, matrix creation with non-initialized components is commonly used throughout the library so we have to be careful when non-copy types are considered if they have a Drop implementation.

StevenDoesStuffs commented 5 years ago

Any updates on this?

sebcrozet commented 5 years ago

@StevenDoesStuffs No, this requires a massive amount of work so we didn't have the time to explore this yet.

aweinstock314 commented 4 years ago

I made a minimal-ish test case that gets different debug-mode codegen (including an explicit call to <usize as Clone>::Clone) for T: Copy vs T: Clone bounds. This could probably be addressed via an inlining pass in rustc.

The closest rustc issue under the query is:issue is:open Clone inline label:I-slow seems to be https://github.com/rust-lang/rust/issues/37538, but that's about release-mode inlining threshhold. The closest rustc issue under the query is:issue is:open Clone debug label:I-slow seems to be https://github.com/rust-lang/rust/issues/41160, but that's also about release-mode codegen.

The issue https://github.com/rust-lang/rust/issues/59375 (found via is:issue is:open inline debug label:I-slow) is more relevant in that it's about debug-mode performance, but it's about PartialEq::eq.

If using Clone bounds causes a debug-mode performance regression, but has no release mode impact, will it still be considered? If not, I'll try making a rustc patch that at least inlines clone in this example before trying the Copy -> Clone refactoring.

Example:

fn memcpyish_clone<T: Clone>(x: &[T], y: &mut [T]) {
    for (a, b) in x.iter().zip(y) {
        *b = a.clone();
    }
}

fn memcpyish_copy<T: Copy>(x: &[T], y: &mut [T]) {
    for (a, b) in x.iter().zip(y) {
        *b = *a;
    }
}

#[no_mangle]
#[inline(never)]
fn memcpyish_clone_usize(x: &[usize], y: &mut [usize]) {
    memcpyish_clone::<usize>(x, y)
}

#[no_mangle]
#[inline(never)]
fn memcpyish_copy_usize(x: &[usize], y: &mut [usize]) {
    memcpyish_copy::<usize>(x, y)
}

fn main() {
    let x = [1,2,3];
    let mut y = [0,0,0];
    let mut z = [0,0,0];
    memcpyish_copy_usize(&x, &mut y);
    memcpyish_clone_usize(&x, &mut z);
}
/*
$ rustc copy_debug_codegen_20191121.rs
$ objdump -d copy_debug_codegen_20191121 | grep '<_ZN[^>]*memcpyish_[^>]*>:' -A53
$ r2 copy_debug_codegen_20191121
[0x000062b0]> pd 5 @ sym.memcpyish_clone_usize
            ;-- memcpyish_clone_usize:
            0x00006590      50             pushq %rax
            0x00006591      e84afeffff     callq sym.copy_debug_codegen_20191121::memcpyish_clone::hf5036453dec78982
            0x00006596      58             popq %rax
            0x00006597      c3             retq
            0x00006598      0f1f84000000.  nopl 0(%rax, %rax)

The block for _ZN27copy_debug_codegen_2019112114memcpyish_copy17h08e0b47cae3dd97dE is shorter, and does not contain a call to _ZN4core5clone5impls54_$LT$impl$u20$core..clone..Clone$u20$for$u20$usize$GT$5clone17h80760c01c3acfef1E

With -O, rustc recognizes that the optimized function bodies are identical, and emits two symbols for the same block of code:

$ rustc -O copy_debug_codegen_20191121.rs
$ r2 copy_debug_codegen_20191121
[0x000063f0]> s sym.memcpyish_copy_usize
[0x00006520]> pd 14
            ;-- memcpyish_clone_usize:
            ;-- memcpyish_copy_usize:
            0x00006520      4889f0         movq %rsi, %rax
            0x00006523      4839ce         cmpq %rcx, %rsi
            0x00006526      480f47c1       cmovaq %rcx, %rax
            0x0000652a      4885c0         testq %rax, %rax
        ,=< 0x0000652d      7418           je 0x6547
        |   0x0000652f      50             pushq %rax
        |   0x00006530      4889fe         movq %rdi, %rsi
        |   0x00006533      48c1e003       shlq $3, %rax
        |   0x00006537      4889d7         movq %rdx, %rdi
        |   0x0000653a      4889c2         movq %rax, %rdx
        |   0x0000653d      ff15edc72200   callq 0x00006543            ; reloc.memcpy ; [0x232d30:8]=0
        |   0x00006543      4883c408       addq $8, %rsp
        `-> 0x00006547      c3             retq
            0x00006548      0f1f84000000.  nopl 0(%rax, %rax)

Interestingly, even in debug mode, `-C inline-threshold=` does change the binary (in that memcpyish_{copy,clone}_usize contain the actual implementation instead of a 4-instruction stub), but increasing it further doesn't seem to inline Clone
$ rustc -C inline-threshold=5 copy_debug_codegen_20191121.rs
*/
sebcrozet commented 4 years ago

@aweinstock314

If using Clone bounds causes a debug-mode performance regression, but has no release mode impact, will it still be considered?

It will be considered only if the regression is not too significant (up to 10%) in debug mode.

However, we can accept a regression if a the support of non-Copy types requires a feature to be enabled. That way the user will have to enable this feature explicitly before using non-Copy types, and thus explicitly accept this will cause a performance regression in debug mode. Existing users of Copy types won't be affected by the regression because they won't have the feature enabled.

If not, I'll try making a rustc patch that at least inlines clone in this example before trying the Copy -> Clone refactoring.

Yeah, it would be great that if the type is Copy then its .clone() is automatically inlined. Though I'm not sure this is always desirable for debug purposes.

I guess an alternative could be to have a trait like that, and use .inlined_clone() instead of .clone():

trait InlinedClone: Clone {
    fn clone_inlined(&self) -> Self;
}

impl<T: Clone> InlinedClone for T {
    #[inline(always)]
    fn clone_inlined(&self) -> Self {
       self.clone()
    }
}

impl<T: Copy> InlinedClone for T {
    #[inline(always)]
    fn clone_inlined(&self) -> Self {
       *self
    }
}

My guess is that the #[inline(always)] also works in debug mode. Therefore Copy type will use the copy instead of the clone.

Of course this is not possible without specialization (overlapping impl), which is why it may not be the best option.

sebcrozet commented 4 years ago

Yet another option would be to add the inlined_clone method to the existing Scalar trait directly:

pub trait Scalar: PartialEq + Debug + Any {
    #[inline]
    /// Tests if `Self` the same as the type `T`
    ///
    /// Typically used to test of `Self` is a f32 or a f64 with `N::is::<f32>()`.
    fn is<T: Scalar>() -> bool {
        TypeId::of::<Self>() == TypeId::of::<T>()
    }

    fn inlined_clone(&self) -> Self;
}

impl<T: Copy + PartialEq + Debug + Any> Scalar for T {
    #[inline(always)]
    fn inlined_clone(&self) -> Self {
        *self
    }
}

This should still allow the user to implement Scalar for non-copy types because I think an impl like the following won't generate an overlapping impl error:

impl Scalar for MyNonCopyType {
    #[inline(always)]
    fn inlined_clone(&self) -> Self {
        self.clone()
    }
}
Ralith commented 4 years ago

However, we can accept a regression if a the support of non-Copy types requires a feature to be enabled.

How would that be accomplished without risking breaking downstream code when the feature is enabled?

aweinstock314 commented 4 years ago

I tweaked the codegen demo to add the InlinedClone trait approach; it generates a handful of extra movs relative to the Copy version (and a jump to a sequential basic block, the eb00 which ought to be a nop), but no call to clone.
I'm optimistic that this won't cause much of a debug-mode regression, so I'll try implementing the approach of adding inlined_clone to Scalar.

If I'm understanding correctly, adding inlined_clone to Scalar shouldn't break downstream crates because all downstream crates are currently using Copy Scalars, which will be covered by the blanket impl. Removing inlined_clone if a rustc pass gets accepted for this case would be a breaking change. Is this worth using #[doc(hidden)] or #[deprecated] in anticipation of a proper fix upstream, or is that long-term enough that using a major version to remove it is fine?

It's worth noting that MaybeUninit<T> is only Clone, not Copy, even if T: Copy, so a newtype wrapper around MaybeUninit<T> to make it Scalar (to address #556) seems to depend on this. A draft of that approach is at https://github.com/aweinstock314/nalgebra/commit/7d040b7d13e6fb6e143f1c14987bcc2f48dc585b, but it's not wired up to anything because of this dependency.

Codegen demo follows:

trait InlinedClone: {
    fn inlined_clone(&self) -> Self;
}

impl<T: Copy> InlinedClone for T {
    #[inline(always)]
    fn inlined_clone(&self) -> T {
        *self
    }
}

fn memcpyish_inlined_clone<T: InlinedClone>(x: &[T], y: &mut [T]) {
    for (a, b) in x.iter().zip(y) {
        *b = a.inlined_clone();
    }
}

fn memcpyish_clone<T: Clone>(x: &[T], y: &mut [T]) {
    for (a, b) in x.iter().zip(y) {
        *b = a.clone();
    }
}

fn memcpyish_copy<T: Copy>(x: &[T], y: &mut [T]) {
    for (a, b) in x.iter().zip(y) {
        *b = *a;
    }
}

#[no_mangle]
#[inline(never)]
fn memcpyish_inlined_clone_usize(x: &[usize], y: &mut [usize]) {
    memcpyish_inlined_clone::<usize>(x, y)
}

#[no_mangle]
#[inline(never)]
fn memcpyish_clone_usize(x: &[usize], y: &mut [usize]) {
    memcpyish_clone::<usize>(x, y)
}

#[no_mangle]
#[inline(never)]
fn memcpyish_copy_usize(x: &[usize], y: &mut [usize]) {
    memcpyish_copy::<usize>(x, y)
}

fn main() {
    let x = [1,2,3];
    let mut y1 = [0,0,0];
    let mut y2 = [0,0,0];
    let mut y3 = [0,0,0];
    memcpyish_copy_usize(&x, &mut y1);
    memcpyish_clone_usize(&x, &mut y2);
    memcpyish_inlined_clone_usize(&x, &mut y3);
}
/*
ending part of memcpyish_inlined_clone_usize:
|    |  :   ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_inlined_clone::h0c8f347077976026 (0x6f6e)
|    `----> 0x00006f7a      488b8424c800.  movq local_c8h, %rax        ; [0xc8:8]=0
|       :   0x00006f82      488b8c24d000.  movq local_d0h, %rcx        ; [0xd0:8]=0x315b4 segment_end.LOAD0
|       :   0x00006f8a      488b00         movq 0(%rax), %rax
|       :   0x00006f8d      48894c2410     movq %rcx, local_10h
|       :   0x00006f92      4889442408     movq %rax, local_8h
|      ,==< 0x00006f97      eb00           jmp 0x6f99
|      |:   ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_inlined_clone::h0c8f347077976026 (0x6f97)
|      `--> 0x00006f99      488b442410     movq local_10h, %rax        ; [0x10:8]=0x1003e0003
|       :   0x00006f9e      488b4c2408     movq local_8h, %rcx         ; [0x8:8]=0
|       :   0x00006fa3      488908         movq %rcx, 0(%rax)
\       `=< 0x00006fa6      eb92           jmp 0x6f3a
ending part of memcpyish_copy_usize:
|    |  :   ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_copy::h32096c80aa649f39 (0x712e)
|    `----> 0x0000713a      488b8424b800.  movq local_b8h, %rax        ; [0xb8:8]=0
|       :   0x00007142      488b8c24c000.  movq local_c0h, %rcx        ; [0xc0:8]=0
|       :   0x0000714a      488b00         movq 0(%rax), %rax
|       :   0x0000714d      488901         movq %rax, 0(%rcx)
\       `=< 0x00007150      eba8           jmp 0x70fa
ending part of memcpyish_clone_usize:
|    |  :   ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_clone::h36d27fecff3c0f2c (0x704e)
|    `----> 0x0000705a      488bbc24c800.  movq local_c8h, %rdi        ; [0xc8:8]=0
|       :   0x00007062      488b8424d000.  movq local_d0h, %rax        ; [0xd0:8]=0x315b4 segment_end.LOAD0
|       :   0x0000706a      4889442410     movq %rax, local_10h
|       :   0x0000706f      e8acf6ffff     callq sym.core::clone::impls::__impl_core::clone::Clone_for_usize_::clone::h3d9a1e411c34f5b2
|       :   0x00007074      4889442408     movq %rax, local_8h
|      ,==< 0x00007079      eb00           jmp 0x707b
|      |:   ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_clone::h36d27fecff3c0f2c (0x7079)
|      `--> 0x0000707b      488b442410     movq local_10h, %rax        ; [0x10:8]=0x1003e0003
|       :   0x00007080      488b4c2408     movq local_8h, %rcx         ; [0x8:8]=0
|       :   0x00007085      488908         movq %rcx, 0(%rax)
\       `=< 0x00007088      eb90           jmp 0x701a
*/
Ralith commented 4 years ago

Is this worth using #[doc(hidden)] or #[deprecated] in anticipation of a proper fix upstream, or is that long-term enough that using a major version to remove it is fine?

Is there a specific need for external code to call it? If not, #[doc(hidden)] seems like the safest option. In any case, nalgebra breaks compat often enough, and rustc moves slowly enough, that a deprecation cycle should be fine if necessary.

aweinstock314 commented 4 years ago

External code won't need to call it, but it will need to impl it (for Clone Scalars) unless we make use of type-level-bools and associated-type-injectivity to provide a blanket impl for Clone (which is ergonomically better on downstream crates, but I'd have to test that the pattern does compile away in debug mode). I guess #[doc(hidden)] + blanket Clone impl is the better approach (relative to later deprecation) if and only if that trait pattern compiles away in debug.

Ralith commented 4 years ago

External code won't need to call it, but it will need to impl it

Ah, good point. In that case I'd just plan for a deprecation cycle; a blanket impl sounds like a lot of complexity at best.

aweinstock314 commented 4 years ago

I've written https://github.com/rustsim/nalgebra/pull/684. What is the recommended way to benchmark debug performance? IIRC cargo bench only does release mode benchmarks.

sebcrozet commented 4 years ago

@aweinstock314 You can benchmark in "debug mode" by disabling optimization in bench mode. Simply add that to the Cargo.toml of nalgebra:

[profile.bench]
opt-level = 0
lto = false
aweinstock314 commented 3 years ago

Seems to be addressed by https://github.com/dimforge/nalgebra/pull/949 and https://github.com/dimforge/nalgebra/pull/962 .