FluenTech / embedded-time

Time(ing) library (Instant/Duration/Clock/Timer/Period/Frequency) for bare-metal embedded systems
Apache License 2.0
87 stars 17 forks source link

Manual Ord & Eq for Fraction #111

Closed burrbull closed 2 years ago

burrbull commented 2 years ago

I've tried to use embedded-time for stm32f4xx-hal (see https://github.com/stm32-rs/stm32f4xx-hal/pull/360), but I see problem. Only TryInto<Hertz> is implemented for Kilohertz/Megahertz. And compiler can't optimize it as expected.

There is part of blinky-timer-irq example from stm32f4xx-hal in the table. And code size in release mode. As you can see, this PR doesn't solve the problem, but makes it lesser. code before "s" after "s" before "z" after "z"
rust let clocks = rcc.cfgr.sysclk(16.MHz()) .pclk1(8.MHz()) .freeze(); 3364 2948 3972 3780
rust let clocks = rcc.cfgr.sysclk(16_000_000.Hz()).pclk1(8.MHz()) .freeze(); 2380 1448 2644 2452
rust let clocks = rcc.cfgr.sysclk(16.MHz()) .pclk1(8_000_000.Hz()).freeze(); 3140 1448 3924 3732
rust let clocks = rcc.cfgr.sysclk(16_000_000.Hz()).pclk1(8_000_000.Hz()).freeze(); 1448 1448 1836 1836
PTaylor-us commented 2 years ago

Thank you so much for bringing this up. If I understand correctly. The cause is that the comparison functions a using TryInto to convert the MHz or kHz to H (since to convert H, precision would be lost). This causes a lot more code to be generated than using an Into and promoting the values to u64. I actually, at one time, implemented some auto-promotion, but I wasn't happy with pulling in u64 arithmetic code. I also came to the conclusion that it's not really "the Rust way". I suspect that this increase in code size only the TryInto<Hertz> for Megahertz implementation being pulled in, and not the calling of it.

There are implementations for conversion such as Into<Hertz<u64>> for Megahertz<u32>. So I think you could do something like:

let clocks = rcc.cfgr.sysclk(Hertz(16_000_000u64)).pclk1(8.MHz()) .freeze();
burrbull commented 2 years ago

I actually, at one time, implemented some auto-promotion, but I wasn't happy with pulling in u64 arithmetic code.

But you are happy with https://docs.rs/num-rational/0.4.0/src/num_rational/lib.rs.html#323-374 ? As I can see compiler is not happy.

You named your crate embedded, but until it is zero-cost, it can't be embedded.

PTaylor-us commented 2 years ago

It's choice between pulling in TryInto code and u64 arithmetic. If I implement it how you suggest, someone would be upset about the pulling in of the u64 arithmetic even if they didn't use any u64.

"Zero-cost" means you don't pay for what you don't use. As I see it, you are using it. I think pulling in the u64 code even if it's not necessary is much less "zero-cost".

If there is a way to improve on num-rational's comparison method without the u64, I'm all ears.

@eldruin @korken89 Any thoughts on this?

PTaylor-us commented 2 years ago

@burrbull Does the option I gave you above not work?

burrbull commented 2 years ago

@burrbull Does the option I gave you above not work?

I don't need Into<Hertz<u64>>. I need Into<Hertz<u32>>.

korken89 commented 2 years ago

The implementation here will optimize to single instruction for the u64 multiplication on all Cortex-M3 and up (UMULL), so there it would be very efficient. For Cortex-M0+) I'm not sure how it would perform.

But overall, I'd say this to be better than a lot of if/match-code of num-rational. Math generally optimizes a lot better thanks to it being well defined.

@burrbull could you give it a compile test on a Cortex-M0 and see how much math-code it pulls in?

korken89 commented 2 years ago

Hi, I did some comparison and it looks very promising: https://rust.godbolt.org/z/x8drzzc36 The jack-in-the-box is the __aeabi_lmul, I need to benchmark it to see how much it takes, but since we are doing zero extended multiplication it never needs to pull in u64 arithmetic.

burrbull commented 2 years ago

Hi, I did some comparison and it looks very promising: https://rust.godbolt.org/z/x8drzzc36 The jack-in-the-box is the __aeabi_lmul, I need to benchmark it to see how much it takes, but since we are doing zero extended multiplication it never needs to pull in u64 arithmetic.

Ideally all those calculations must be in compile time and in table above should be only 1448 (1836).

burrbull commented 2 years ago

I tried to move all needed code from rational::Ratio to fraction (https://github.com/burrbull/embedded-time/tree/fraction2) and make as much const fn as I can, but this situation have not changed fundamentally (only some saved code more).

I also tried to go different way and use const generics (https://github.com/stm32-rs/stm32f4xx-hal/blob/embedded-time2/src/time.rs), but result is even worst (looks like compiler often don't optimizes const generics at all).

korken89 commented 2 years ago

Yeah, perfect transparency is difficult. The only way to do this that I know of is to perform settings in const generics and consts, where the final const config is fed to a "set all the registers correctly" function.

So you have:

const MY_CLOCK_CONFIG: ClockConfig = ...; // Holds the final register values

write_clock_config_to_regs(&mut reg1, &mut reg2, &mut reg3, MY_CLOCK_CONFIG);

I did some experiments with a system like that, and it worked with perfect inline-ing - but it was not that user friendly then. And it was difficult to get everything to happen in const when I tried it, this is probably easier now.

burrbull commented 2 years ago

I did some experiments with a system like that, and it worked with perfect inline-ing - but it was not that user friendly then.

Yeah.

Already forgot about this: https://github.com/stm32-rs/stm32f1xx-hal/pull/337

korken89 commented 2 years ago

Indeed, an approach like that is the only one I found that really inline-ed and optimized perfectly.

Though what you have in this PR will help to make embedded-time more performant - but I don't think embedded-time can solve this issue.

PTaylor-us commented 2 years ago

but since we are doing zero extended multiplication it never needs to pull in u64 arithmetic

Oh, ok. I think I see. You're multiplying u32s as u64s, but the actual multiplication is still u32, with a u64 product that only gets compared. Is that accurate? If we were then multiplying/dividing the u64 product, that's the point that additional u64 arithmetic would get pulled in?

korken89 commented 2 years ago

@PTaylor-us That is accurate! The compiler does see through the operation without issue and does what one would expect. And you are correct, if you were to then use the resulting u64 in some arithmetic then true u64 math would be needed, but the comparison is simply to first compare the high 32 bits, and then the low 32 bits if the high bits were equal.

PTaylor-us commented 2 years ago

That's fantastic. Thanks, @korken89, for setting me straight.

Clippy is complaining: "cderive_hash_xor_eq". I'm assuming we need to explicitly define Hash. @burrbull, @korken89, either of you ever done that? I have not.

burrbull commented 2 years ago

I believe we need an explicit Hash impl for Fraction.

Done

korken89 commented 2 years ago

@PTaylor-us can you run the CI? I don't have rights to do it.

korken89 commented 2 years ago

Ping @PTaylor-us