Open tronje opened 4 months ago
Paging @Patryk27 (sorry for always bugging you with these topics!)
@Tronje, your best bet is building a minimal reproducer example that still triggers this compiler regression. You can do this by continuously removing code from your application until you cannot remove anything without the regression vanishing.
Then people here can investigate where this problem originates from.
Hi Rahix, thanks for your response.
@tronje, your best bet is building a minimal reproducer example that still triggers this compiler regression. You can do this by continuously removing code from your application until you cannot remove anything without the regression vanishing.
I've done that, you can see it here: https://github.com/tronje/adc-read-compiler-bug
I'm guessing it'd be possible to minimize it even further. For my setup, I have to set up a GPIO expander and pull two of its outputs for my pin of interest to report correct values. I'd guess any other pin should exhibit the same bugged behavior. Let me know if this is anything to go on, or if I should try to find an even simpler example that eliminates the need for the custom hardware I have here...
Hi, thanks for pinging! I'll check it tomorrow 🙂
sorry for always bugging you with these topics!
No worries, glad to be helpful 😄
Alright, I've been comparing the disassembly between the working and broken firmware, and one important change that stands out is the inlining of MCP23017::update_register_bit()
(i.e. it's inlined in the broken firmware).
Now, while on its own inlining shouldn't cause any issues (or rather, if inlining a function breaks the binary, then it's a bug in the codegen), narrowing the scope down is always a step forward.
So, @tronje would you mind compiling your code with this updated config.toml
and reporting back whether it works on the real hardware?
[build]
target = "avr-specs/avr-atmega164pa.json"
rustflags = ["-Zcross-crate-inline-threshold=0"]
[unstable]
build-std = ["core"]
build-std-features = ["compiler-builtins-mangled-names"]
Hey @Patryk27, thanks a lot for taking the time to look into this!
Unfortunately, your suggested change made no difference. I get the correct values with the old toolchain, and mostly 0's with the newer toolchain, just like before.
@tronje does the issue still occur if you compare the values instead of printing them? Like:
loop {
let val = pin.analog_read(&mut adc);
if val >= 128 {
ufmt::uwriteln!(serial, "high").ok();
} else {
ufmt::uwriteln!(serial, "low").ok();
}
delay_ms(100);
}
(trying to narrow it down as to whether it's the measurement that's wrong or rather the printing part)
Your example always prints low
on the new toolchain. I also tried this:
loop {
let val = pin.analog_read(&mut adc);
if val == 0 {
ufmt::uwriteln!(serial, "zero\r").ok();
} else if val < 5 {
ufmt::uwriteln!(serial, "small\r").ok();
} else {
ufmt::uwriteln!(serial, "large\r").ok();
}
delay_ms(100);
}
Which mostly prints zero
, soemtimes small
on the new toolchain, and always large
on the older toolchain.
I see, thanks for checking!
I'll try reproducing the issue in simavr and report back.
I see, thanks for checking!
No, thank you for taking all this time to help me out! I really, really appreciate it.
Alright, I've been testing the firmware under simavr and I can't reproduce the problem there (i.e. everything behaves correctly).
Let's try another angle:
let pin = pins.pa0.into_analog_input(&mut adc);
?* Does the issue still occur if you treat the pin as digital?
Looks that way; is_high
always returns false
on the new toolchain, and always true
on the older one. Interesting.
I've simplified my example and pushed it. I guess the ADC is not involved in the issue then.
* Just a random guess but - could you try with `let pin = pins.pa0.into_analog_input(&mut adc);`?
Doesn't make a difference.
I looked at the assembly, but couldn't find anything obvious. The instruction sequences for setting up and reading from the ADC are the same and the rest of the assembly also looks about the same.
Could you upload your verified working and broken versions in case it's system dependent?
EDIT: that was on the original version, didn't see that you simplified it further
EDIT: that was on the original version, didn't see that you simplified it further
Yep. Anyway, I've dumped the binaries using the following commands:
$ avr-objcopy target/avr-atmega164pa/release/adc-read-compiler-bug.elf -O ihex -j .data -j .text firmware.hex
$ avr-objdump -d -s -m avr5 -j .sec1 firmware.hex
and attached them here.
built-with-nightly-2023-12-28.txt built-with-nightly-2024-03-22.txt
I compared those dumps with mine and they're bit for bit identical.
Status: since there were a couple of codegen bugs uncovered and fixed recently, and since I can't reproduce the issue in simavr, I think it makes sense to wait until https://github.com/Rahix/avr-hal/issues/573 is resolved (which is stalled on https://github.com/llvm/llvm-project/pull/106993) and see on the newest rustc.
@tronje would you mind rechecking on the newest toolchain? (e.g. from today)
Lots of issues got fixed in the meantime and while I can't reproduce the issue locally, my hopes are that maybe one of those fixes happened to cover what was going on here.
Looks like it's still broken. Digital reads still always return false
, and analog reads mostly 0
, sometimes 1
or 2
. The nightly-2023-12-28
still gives correct-looking values in the 800s (analog read) and always true
(digital read) on my hardware here.
$ rustc --version
rustc 1.84.0-nightly (27861c429 2024-10-13)
I did a little binary search and found the first toolchain that introduces the bug. The last working toolchain is nightly-2024-01-05
, and the first bugged toolchain is the one from the day after that, nightly-2024-01-06
.
I'll start comparing assembly outputs and changes between the two toolchains.
Small correction: the channel
parameter in rust-toolchain.toml
doesn't seem to match the output of rustc --version
.
So rustc --version
yields:
rustc 1.77.0-nightly (f688dd684 2024-01-04)
rustc 1.77.0-nightly (595bc6f00 2024-01-05)
edit (don't want to spam with too many comments): both versions use the same LLVM version, 17.0.6. The git log in the rust
project between the two versions also does not show a change to the llvm-project
submodule. So I'm guessing it's fair to conclude that this is somehow caused by a change in rust
itself.
I did a git bisect
on the rust repository, between the two toolchains I mentioned above. Here are the results:
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
8e64fc94d895b570e80db8edc3e3ef265e19db8c
3cd6cde0be33faf0296ead23141e06b2d0740894
c2fd26a115645c92537719b1a04270e1ba727cbf
432fffa8afb8fcfe658e6548e5e8f10ad2001329
We cannot bisect more!
I had to git bisect skip
those commits, as the resulting toolchains were unable to build my example program. They all generated this error (or something similar), which looks pretty fun:
error: expected identifier, found `Some.0`
--> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/mem/mod.rs:1403:62
|
1403 | crate::hint::must_use({builtin # offset_of($Container, $($fields)+)})
| ^^^^^^^ expected identifier
|
::: /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/iter/adapters/filter_map.rs:102:31
|
102 | .byte_add(core::mem::offset_of!(Option<B>, Some.0))
| ---------------------------------------- in this macro invocation
|
= note: this error originates in the macro `core::mem::offset_of` (in Nightly builds, run with -Z macro-backtrace for more info)
error: expected identifier, found `Some.0`
--> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/mem/mod.rs:1403:62
|
1403 | crate::hint::must_use({builtin # offset_of($Container, $($fields)+)})
| ^^^^^^^ expected identifier
|
::: /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/option.rs:783:48
|
783 | (self as *const Self).byte_add(core::mem::offset_of!(Self, Some.0)).cast(),
| ----------------------------------- in this macro invocation
|
= note: this error originates in the macro `core::mem::offset_of` (in Nightly builds, run with -Z macro-backtrace for more info)
error: expected identifier, found `Some.0`
--> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/mem/mod.rs:1403:62
|
1403 | crate::hint::must_use({builtin # offset_of($Container, $($fields)+)})
| ^^^^^^^ expected identifier
|
::: /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/option.rs:839:46
|
839 | (self as *mut Self).byte_add(core::mem::offset_of!(Self, Some.0)).cast(),
| ----------------------------------- in this macro invocation
|
= note: this error originates in the macro `core::mem::offset_of` (in Nightly builds, run with -Z macro-backtrace for more info)
Compiling quote v1.0.32
Compiling avr-hal-generic v0.1.0 (https://github.com/rahix/avr-hal?rev=afdadebd2c8bc0969bd2025f94617b64ca101014#afdadebd)
error[E0522]: definition of an unknown language item: `async_iterator_poll_next`
--> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/async_iter/async_iter.rs:50:32
|
50 | #[cfg_attr(not(bootstrap), lang = "async_iterator_poll_next")]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ definition of unknown language item `async_iterator_poll_next`
error[E0522]: definition of an unknown language item: `into_async_iter_into_iter`
--> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/async_iter/async_iter.rs:148:32
|
148 | #[cfg_attr(not(bootstrap), lang = "into_async_iter_into_iter")]
|
Anyway, about the possible culprit-commits. They all seem to be from the same merge request. 432fff is the merge-commit that merges them, c2fd26 seems to do the primary work, and the other two are smaller fixes.
I'll admit, I have no real idea what's going on here, but I'm guessing it may be time to take this to the main rust project?
For your convenience, here are the commits I mentioned as GitHub links:
edit: this is the pull request were these changes were introduced: https://github.com/rust-lang/rust/pull/118991
So I'm not a compiler guy at all, but I built the current master
state of rust with the following change (this partly reverts the pull request I mentioned above), and the bug seems to be fixed:
diff --git a/compiler/rustc_codegen_llvm/src/abi.rs b/compiler/rustc_codegen_llvm/src/abi.rs
index 2fe5ed32daa..031daa843be 100644
--- a/compiler/rustc_codegen_llvm/src/abi.rs
+++ b/compiler/rustc_codegen_llvm/src/abi.rs
@@ -5,7 +5,7 @@
use rustc_abi::Primitive::Int;
use rustc_abi::{HasDataLayout, Size};
use rustc_codegen_ssa::MemFlags;
-use rustc_codegen_ssa::mir::operand::{OperandRef, OperandValue};
+use rustc_codegen_ssa::mir::operand::OperandValue;
use rustc_codegen_ssa::mir::place::{PlaceRef, PlaceValue};
use rustc_codegen_ssa::traits::*;
use rustc_middle::ty::Ty;
@@ -248,7 +248,7 @@ fn store(
bx.lifetime_end(llscratch, scratch_size);
}
_ => {
- OperandRef::from_immediate_or_packed_pair(bx, val, self.layout).val.store(bx, dst);
+ OperandValue::Immediate(val).store(bx, dst);
}
}
}
That's on top if this state: https://github.com/rust-lang/rust/commit/55b7f8e800a6a43657a8582450323d546297c950
I was just grasping at straws, basically, and got lucky. Can't really say I understand what the change does, though :grimacing:
Thanks for the thorough report!
I'm going to NixCon tomorrow, so I won't be able to take a look at it over the weekend, but I'll try to take a stab the next week. Having all this here should be sufficient to reproduce the issue locally (or at least be able to compare the differences in generated assembly).
If you find anything else, of course please do write it down 😄
Thank you! Have fun at NixCon :slightly_smiling_face:
I have pushed a handful of commits to my example program.
We also observed a few differences in the generated assembly. The two toolchains we compared were both based on https://github.com/rust-lang/rust/commit/55b7f8e800a6a43657a8582450323d546297c950, one with the diff I mentioned above, and one without. So, very similar compilers.
The one with the fix generates a few extra instructions in the code for digital_write
and read_register
(both functions from the mcp23017 crate). At the start of the function:
push r28
push r29
in r28, 0x3d ; 61
in r29, 0x3e ; 62
sbiw r28, 0x01 ; 1
in r0, 0x3f ; 63
cli
out 0x3e, r29 ; 62
out 0x3f, r0 ; 63
out 0x3d, r28 ; 61
And towards the end, seemingly the reverse:
push r28
push r29
in r28, 0x3d ; 61
in r29, 0x3e ; 62
sbiw r28, 0x01 ; 1
in r0, 0x3f ; 63
cli
out 0x3e, r29 ; 62
out 0x3f, r0 ; 63
out 0x3d, r28 ; 61
The toolchain without the fix, so just the clean git state, does not generate this code, but the functions look almost identical otherwise.
The other thing we noticed is that the update_register_bit
(again from the mcp23017 crate) is NOT inlined in the woring toolchain with my fix applied, but IS inlined in the clean toolchain without the fix.
Maybe that's something to go on, no idea.
I'll attached my assembly dumps here. self-built-toolchain-with-fix.txt self-built-toolchain-WITHOUT-fix.txt
While that difference is interesting, I don't think that's the cause of the bug, for 2 reasons:
digital_write
functions are identical in both of those.std Y+1, r24
instruction and never read back. In other words, I think this is just an artifact of a missed optimization.Ah, too bad. Thanks for the insight! :slightly_smiling_face:
Seems like the inlining of update_register_bit
out of the mcp23017
crate is somehow at fault. If I use a local checkout of that crate, and annotate the function with #[inline(never)]
, the bug is fixed on the latest nightly toolchain.
Alright, I'm leaning towards a hardware quirk here - which I know sounds a bit crazy, but hear me out 😄
I've been looking at the firmware, analyzing the assembly by hand - and while the output is a bit different between toolchain versions, that's mostly because LLVM does a better job at inlining constant expressions within update_register_bit()
now (the calls to register_for_pin()
, bit_for_pin()
etc. get computed at compile-time).
Overall the newest binary looks correct, I'd say.
Since static eye-analysis can only get one so far, I've took a moment to add TWI (aka I2C) support to avr-tester
, so that I could simulate the firmware together with a basic, fake mcp23017
(which does basically nothing here besides responding to the register reads and writes):
use avr_tester::AvrTester;
#[test]
fn test() {
let mut avr = AvrTester::atmega164pa()
.with_clock(10_000_000)
.load("/home/pwy/t/projects/adc-read-compiler-bug/target/avr-atmega164pa/release/adc-read-compiler-bug.elf");
avr.twi0().attach_slave_fn({
let mut reg = None;
let mut regs = [0; 256];
move |packet| {
if packet.is_start() {
return Some(packet.respond_ack());
}
if packet.is_stop() {
return Some(packet.respond_ack());
}
if packet.is_write() {
if let Some(dev_reg) = reg.take() {
println!("write_register({}, {})", dev_reg, packet.data);
regs[dev_reg] = packet.data;
} else {
reg = Some(packet.data as usize);
}
return Some(packet.respond_ack());
}
if packet.is_read() {
let reg = reg.take().expect("No register picked");
let reg_val = regs[reg];
println!("read_register({}) = {}", reg, reg_val);
return Some(packet.respond_data(reg_val));
}
None
}
});
avr.run_for_ms(1500);
avr.pins().pa0().set_high();
avr.run_for_ms(1500);
panic!("{}", avr.uart0().read::<String>());
}
... and the output looks exactly the same on both the older and the newer firmware:
test:
write_register(0, 255)
write_register(1, 255)
read_register(0) = 255
write_register(0, 253)
read_register(20) = 0
write_register(18, 2)
read_register(0) = 253
write_register(0, 189)
read_register(12) = 0
write_register(12, 64)
read_register(20) = 0
write_register(18, 64)
uart:
BUG
BUG
BUG
OK
OK
Now, of course it's not a definite test - simavr could be emulating something differently (I know of https://github.com/buserror/simavr/issues/453, for which I'm applying a workaround within avr-tester
), but I suspect that maybe the culprit lies within... timing?
My guess is that because the newer code uses substantially less instructions between the calls to MCP23017::read_register()
and MCP23017::write_register()
, maybe that device doesn't have enough time to retrieve/process the data, which in turn causes bus errors/whatever, which happen to affect the PA0 pin? (not sure how your board looks like)
Another totally random observation is in regards to I2c::write_read()
, called internally inside MCP23017
- it performs write+read in one transaction:
... which I guess makes sense, but for instance Arduino implementations usually (seem to) use two separate transactions:
Maybe it would be worthwhile to check whether we shouldn't be doing two txs as well, like:
fn write_read(
&mut self,
address: u8,
bytes: &[u8],
buffer: &mut [u8],
) -> Result<(), Self::Error> {
self.p.raw_start(address, Direction::Write)?;
self.p.raw_write(bytes)?;
self.p.raw_stop()?;
self.p.raw_start(address, Direction::Read)?;
self.p.raw_read(buffer)?;
self.p.raw_stop()?;
Ok(())
}
That all sounds pretty plausible. Thank you so much for all that effort! I will try to play around with the I2C, I suppose. Will start with your suggestion, then maybe check if I can't get our logic analyzer connected to the hardware. I think the earliest I can look at this will be next week, though.
I don't think the write_read implementation should create any problems as that is explicitly supported. I also don't see why writing immediately after reading should cause any problems, especially because you're only running it at 100kHz while the MCP supports up to 1.7MHz.
So even if the issue is resolved by changing the I2C code, I think it's a good idea to check whether the root cause might be something else (similar to how inlining prevents/causes the issue).
Hi! We're using avr-hal and avr-device with an atmega164pa uC.
We tried bumping our Rust toolchain, as you have done with 2eb28fa3024ecbf9898c8e900e744d1201ceffca. This causes incorrect ADC readings. We've literally changed nothing else, and suddenly an
analog_read
onPin<Analog, PA0>
returns usually0
, sometimes1
, but never a value in the hundreds, as would be expected in our situation.I guess it could be due to some
unsafe
weirdness in our codebase, but I mean the ADC is set up normally, and then it's just a call toanalog_read
, really. Not sure how much of our codebase I can share; I could probably build a minimal example to show the behavior. Just figured I'd ask first if anyone has any ideas about this. Works just fine onnightly-2023-12-28
.Thanks!