[no_std] support? - Githubissues

LawrenceEsswood commented 7 months ago

Seems to work fine without Box as both types are sized (which requires alloc):

fn transmute<A, B>(obj: A) -> B {
    use std::hint::black_box;

    // The layout of `DummyEnum` is approximately
    // DummyEnum {
    //     is_a_or_b: u8,
    //     data: usize,
    // }
    // Note that `data` is shared between `DummyEnum::A` and `DummyEnum::B`.
    // This should hopefully be more reliable than spamming the stack with a value and hoping the memory
    // is placed correctly by the compiler.
    enum DummyEnum<A, B> {
        A(Option<A>),
        B(Option<B>),
    }

    #[inline(never)]
    fn transmute_inner<A, B>(dummy: &mut DummyEnum<A, B>, obj: A) -> B {
        let DummyEnum::B(ref_to_b) = dummy else {
            unreachable!()
        };
        let ref_to_b = expand_mut(ref_to_b);
        *dummy = DummyEnum::A(Some(obj));
        black_box(dummy);

        ref_to_b.take().unwrap()
    }

    transmute_inner(black_box(&mut DummyEnum::B(None)), obj)
}

Creative0708 commented 7 months ago

The reason the types are wrapped in boxes is because of compiler optimizations that go with Option<T>. For example, bool and u8 have the same size, but Option<bool> is 1 byte and Option<u8> is 2 bytes. (From testing with compiler explorer, Option<bool> uses 2 to indicate None).

With an implementation without Box in DummyEnum, doing cve_rs::transmute::<bool, u8>(true) panics with called `Option::unwrap()` on a `None` value, while std::mem::transmute::<bool, u8>(true) gives 1. This is bad, because we don't want our memory-safe memory unsafeties to panic.

If there is a way to work around this, please let us know in another comment/issue/pull request! Closing for now as this implementation introduces bugs.

LawrenceEsswood commented 7 months ago

Ah, I see!

I think the correct solution is repr(C) as I think that guarantees the "tagged union" representation of an enum. This ensures your option is a tag followed by value, but then you then have to make sure it "follows" at the same offset.

The solution to that is to wrap again with a struct that contains a zero-length array of the other type so that the alignment of the two wrappers is the same:

https://godbolt.org/z/MKz93zjfK

Embedded systems could really do with memory-safe memory unsafety.

Bright-Shard commented 7 months ago

That's genius! I was working on a new implementation of transmute that exploits a different known compiler bug, but couldn't get it working because of misaligned options. It now works and reliably outperforms the standard library for some reason! xD

Will hopefully PR this soon.

Benchmarks (from an m1 macbook air)

Benchmark names: - `cve_rs transmute`: My new transmute impl - `cve_rs transmute_old`: The current cve-rs implementation of transmute - `cve_rs transmute_ref`: A version of my new transmute implementation that works with references (and should be faster?) - `std transmute`: The standard library's `core::mem::transmute` function Small transmutes on numbers: - `cve_rs transmute`: 425.03ps - `cve_rs transmute_old`: 15.736 ns - `cve_rs transmute_ref`: 423.70 ps - `std transmute`: 425.11 ps

Raw benchmark output

``` cve_rs transmute f32 -> i32 time: [424.06 ps 425.03 ps 426.27 ps] change: [+0.0111% +0.2321% +0.4599%] (p = 0.05 < 0.05) Change within noise threshold. Found 13 outliers among 100 measurements (13.00%) 4 (4.00%) high mild 9 (9.00%) high severe Benchmarking cve_rs transmute_old f32 -> i32: Collecting 100 samples in estimated 5.0000 s (318 cve_rs transmute_old f32 -> i32 time: [15.704 ns 15.736 ns 15.769 ns] change: [+0.0261% +0.2573% +0.4933%] (p = 0.03 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild Benchmarking cve_rs transmute_ref f32 -> i32: Collecting 100 samples in estimated 5.0000 s (12B cve_rs transmute_ref f32 -> i32 time: [423.44 ps 423.70 ps 423.98 ps] change: [-0.1679% -0.0130% +0.1182%] (p = 0.86 > 0.05) No change in performance detected. Found 8 outliers among 100 measurements (8.00%) 4 (4.00%) high mild 4 (4.00%) high severe Benchmarking std transmute f32 -> i32: Collecting 100 samples in estimated 5.0000 s (12B iterat std transmute f32 -> i32 time: [424.19 ps 425.11 ps 426.25 ps] change: [+0.1116% +0.2966% +0.4888%] (p = 0.00 < 0.05) Change within noise threshold. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe ```

Larger transmutes on slices: - `cve_rs transmute`: 1.0478 µs - `cve_rs transmute_old`: 1.5562 µs - `cve_rs transmute_ref`: 1.3612 µs (Note: I dereference the slice in the benchmark, which may cause Rust to copy it, which could make this slower than it should be) - `std transmute`: 1.3622 µs

Raw benchmark output

``` Benchmarking cve_rs transmute [f64; 1024] -> [u8; 8192]: Collecting 100 samples in estimated 5. cve_rs transmute [f64; 1024] -> [u8; 8192] time: [1.0464 µs 1.0478 µs 1.0502 µs] change: [-0.1385% +0.0703% +0.3048%] (p = 0.56 > 0.05) No change in performance detected. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 3 (3.00%) high mild 4 (4.00%) high severe Benchmarking cve_rs transmute_old [f64; 1024] -> [u8; 8192]: Collecting 100 samples in estimate cve_rs transmute_old [f64; 1024] -> [u8; 8192] time: [1.5546 µs 1.5562 µs 1.5580 µs] change: [+0.1917% +0.3941% +0.6279%] (p = 0.00 < 0.05) Change within noise threshold. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe Benchmarking cve_rs transmute_ref [f64; 1024] -> [u8; 8192]: Collecting 100 samples in estimate cve_rs transmute_ref [f64; 1024] -> [u8; 8192] time: [1.3604 µs 1.3612 µs 1.3620 µs] change: [-0.2034% -0.0568% +0.0963%] (p = 0.47 > 0.05) No change in performance detected. Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe Benchmarking std transmute [f64; 1024] -> [u8; 8192]: Collecting 100 samples in estimated 5.006 std transmute [f64; 1024] -> [u8; 8192] time: [1.3604 µs 1.3622 µs 1.3650 µs] change: [-0.0725% +0.1093% +0.3168%] (p = 0.30 > 0.05) No change in performance detected. Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe ```

Note: These benchmarks don't include the changes from #26 - that might make the current implementation faster than these benches show as well.

dekrain commented 7 months ago

I wonder if it could also be achieved using a Cell, as UnsafeCell is excluded from niche optimizations, as it introduces a region of mutable sharable memory, which would interfere with the containing type. (MaybeUnit also prevents layout optimizations, as it is an union and may contain uninitialized data (as in the name), but getting the data out may not be possible).

Speykious / cve-rs

[no_std] support? #13