Closed burrbull closed 2 years ago
Thank you so much for bringing this up. If I understand correctly. The cause is that the comparison functions a using TryInto
to convert the MHz or kHz to H (since to convert H, precision would be lost). This causes a lot more code to be generated than using an Into and promoting the values to u64. I actually, at one time, implemented some auto-promotion, but I wasn't happy with pulling in u64 arithmetic code. I also came to the conclusion that it's not really "the Rust way". I suspect that this increase in code size only the TryInto<Hertz> for Megahertz
implementation being pulled in, and not the calling of it.
There are implementations for conversion such as Into<Hertz<u64>> for Megahertz<u32>
. So I think you could do something like:
let clocks = rcc.cfgr.sysclk(Hertz(16_000_000u64)).pclk1(8.MHz()) .freeze();
I actually, at one time, implemented some auto-promotion, but I wasn't happy with pulling in u64 arithmetic code.
But you are happy with https://docs.rs/num-rational/0.4.0/src/num_rational/lib.rs.html#323-374 ? As I can see compiler is not happy.
You named your crate embedded
, but until it is zero-cost, it can't be embedded.
It's choice between pulling in TryInto code and u64 arithmetic. If I implement it how you suggest, someone would be upset about the pulling in of the u64 arithmetic even if they didn't use any u64.
"Zero-cost" means you don't pay for what you don't use. As I see it, you are using it. I think pulling in the u64 code even if it's not necessary is much less "zero-cost".
If there is a way to improve on num-rational
's comparison method without the u64, I'm all ears.
@eldruin @korken89 Any thoughts on this?
@burrbull Does the option I gave you above not work?
@burrbull Does the option I gave you above not work?
I don't need Into<Hertz<u64>>
. I need Into<Hertz<u32>>
.
The implementation here will optimize to single instruction for the u64
multiplication on all Cortex-M3 and up (UMULL), so there it would be very efficient.
For Cortex-M0+) I'm not sure how it would perform.
But overall, I'd say this to be better than a lot of if
/match
-code of num-rational
.
Math generally optimizes a lot better thanks to it being well defined.
@burrbull could you give it a compile test on a Cortex-M0 and see how much math-code it pulls in?
Hi, I did some comparison and it looks very promising: https://rust.godbolt.org/z/x8drzzc36
The jack-in-the-box is the __aeabi_lmul
, I need to benchmark it to see how much it takes, but since we are doing zero extended multiplication it never needs to pull in u64
arithmetic.
Hi, I did some comparison and it looks very promising: https://rust.godbolt.org/z/x8drzzc36 The jack-in-the-box is the
__aeabi_lmul
, I need to benchmark it to see how much it takes, but since we are doing zero extended multiplication it never needs to pull inu64
arithmetic.
Ideally all those calculations must be in compile time and in table above should be only 1448 (1836).
I tried to move all needed code from rational::Ratio
to fraction
(https://github.com/burrbull/embedded-time/tree/fraction2) and make as much const fn
as I can, but this situation have not changed fundamentally (only some saved code more).
I also tried to go different way and use const generics (https://github.com/stm32-rs/stm32f4xx-hal/blob/embedded-time2/src/time.rs), but result is even worst (looks like compiler often don't optimizes const generics at all).
Yeah, perfect transparency is difficult. The only way to do this that I know of is to perform settings in const generics and consts, where the final const config is fed to a "set all the registers correctly" function.
So you have:
const MY_CLOCK_CONFIG: ClockConfig = ...; // Holds the final register values
write_clock_config_to_regs(&mut reg1, &mut reg2, &mut reg3, MY_CLOCK_CONFIG);
I did some experiments with a system like that, and it worked with perfect inline-ing - but it was not that user friendly then. And it was difficult to get everything to happen in const when I tried it, this is probably easier now.
I did some experiments with a system like that, and it worked with perfect inline-ing - but it was not that user friendly then.
Yeah.
Already forgot about this: https://github.com/stm32-rs/stm32f1xx-hal/pull/337
Indeed, an approach like that is the only one I found that really inline-ed and optimized perfectly.
Though what you have in this PR will help to make embedded-time
more performant - but I don't think embedded-time
can solve this issue.
but since we are doing zero extended multiplication it never needs to pull in
u64
arithmetic
Oh, ok. I think I see. You're multiplying u32s as u64s, but the actual multiplication is still u32, with a u64 product that only gets compared. Is that accurate? If we were then multiplying/dividing the u64 product, that's the point that additional u64 arithmetic would get pulled in?
@PTaylor-us That is accurate! The compiler does see through the operation without issue and does what one would expect.
And you are correct, if you were to then use the resulting u64
in some arithmetic then true u64
math would be needed, but the comparison is simply to first compare the high 32 bits, and then the low 32 bits if the high bits were equal.
That's fantastic. Thanks, @korken89, for setting me straight.
Clippy is complaining: "cderive_hash_xor_eq". I'm assuming we need to explicitly define Hash. @burrbull, @korken89, either of you ever done that? I have not.
I believe we need an explicit Hash impl for Fraction.
Done
@PTaylor-us can you run the CI? I don't have rights to do it.
Ping @PTaylor-us
I've tried to use
embedded-time
forstm32f4xx-hal
(see https://github.com/stm32-rs/stm32f4xx-hal/pull/360), but I see problem. OnlyTryInto<Hertz>
is implemented forKilohertz
/Megahertz
. And compiler can't optimize it as expected.blinky-timer-irq
example fromstm32f4xx-hal
in the table. And code size in release mode. As you can see, this PR doesn't solve the problem, but makes it lesser.rust let clocks = rcc.cfgr.sysclk(16.MHz()) .pclk1(8.MHz()) .freeze();
rust let clocks = rcc.cfgr.sysclk(16_000_000.Hz()).pclk1(8.MHz()) .freeze();
rust let clocks = rcc.cfgr.sysclk(16.MHz()) .pclk1(8_000_000.Hz()).freeze();
rust let clocks = rcc.cfgr.sysclk(16_000_000.Hz()).pclk1(8_000_000.Hz()).freeze();