Incorrect ADC readings after toolchain bump

tronje commented 4 months ago

Hi! We're using avr-hal and avr-device with an atmega164pa uC.

We tried bumping our Rust toolchain, as you have done with 2eb28fa3024ecbf9898c8e900e744d1201ceffca. This causes incorrect ADC readings. We've literally changed nothing else, and suddenly an analog_read on Pin<Analog, PA0> returns usually 0, sometimes 1, but never a value in the hundreds, as would be expected in our situation.

I guess it could be due to some unsafe weirdness in our codebase, but I mean the ADC is set up normally, and then it's just a call to analog_read, really. Not sure how much of our codebase I can share; I could probably build a minimal example to show the behavior. Just figured I'd ask first if anyone has any ideas about this. Works just fine on nightly-2023-12-28.

Thanks!

Rahix commented 4 months ago

Paging @Patryk27 (sorry for always bugging you with these topics!)

@Tronje, your best bet is building a minimal reproducer example that still triggers this compiler regression. You can do this by continuously removing code from your application until you cannot remove anything without the regression vanishing.

Then people here can investigate where this problem originates from.

tronje commented 3 months ago

Hi Rahix, thanks for your response.

@tronje, your best bet is building a minimal reproducer example that still triggers this compiler regression. You can do this by continuously removing code from your application until you cannot remove anything without the regression vanishing.

I've done that, you can see it here: https://github.com/tronje/adc-read-compiler-bug

I'm guessing it'd be possible to minimize it even further. For my setup, I have to set up a GPIO expander and pull two of its outputs for my pin of interest to report correct values. I'd guess any other pin should exhibit the same bugged behavior. Let me know if this is anything to go on, or if I should try to find an even simpler example that eliminates the need for the custom hardware I have here...

Patryk27 commented 3 months ago

Hi, thanks for pinging! I'll check it tomorrow 🙂

sorry for always bugging you with these topics!

No worries, glad to be helpful 😄

Patryk27 commented 3 months ago

Alright, I've been comparing the disassembly between the working and broken firmware, and one important change that stands out is the inlining of MCP23017::update_register_bit() (i.e. it's inlined in the broken firmware).

Now, while on its own inlining shouldn't cause any issues (or rather, if inlining a function breaks the binary, then it's a bug in the codegen), narrowing the scope down is always a step forward.

So, @tronje would you mind compiling your code with this updated config.toml and reporting back whether it works on the real hardware?

[build]
target = "avr-specs/avr-atmega164pa.json"
rustflags = ["-Zcross-crate-inline-threshold=0"]

[unstable]
build-std = ["core"]
build-std-features = ["compiler-builtins-mangled-names"]

tronje commented 3 months ago

Hey @Patryk27, thanks a lot for taking the time to look into this!

Unfortunately, your suggested change made no difference. I get the correct values with the old toolchain, and mostly 0's with the newer toolchain, just like before.

Patryk27 commented 3 months ago

@tronje does the issue still occur if you compare the values instead of printing them? Like:

loop {
    let val = pin.analog_read(&mut adc);

    if val >= 128 {
        ufmt::uwriteln!(serial, "high").ok();
    } else {
        ufmt::uwriteln!(serial, "low").ok();
    }

    delay_ms(100);
}

(trying to narrow it down as to whether it's the measurement that's wrong or rather the printing part)

tronje commented 3 months ago

Your example always prints low on the new toolchain. I also tried this:

    loop {
        let val = pin.analog_read(&mut adc);

        if val == 0 {
            ufmt::uwriteln!(serial, "zero\r").ok();
        } else if val < 5 {
            ufmt::uwriteln!(serial, "small\r").ok();
        } else {
            ufmt::uwriteln!(serial, "large\r").ok();
        }

        delay_ms(100);
    }

Which mostly prints zero, soemtimes small on the new toolchain, and always large on the older toolchain.

Patryk27 commented 3 months ago

I see, thanks for checking!

I'll try reproducing the issue in simavr and report back.

tronje commented 3 months ago

I see, thanks for checking!

No, thank you for taking all this time to help me out! I really, really appreciate it.

Patryk27 commented 3 months ago

Alright, I've been testing the firmware under simavr and I can't reproduce the problem there (i.e. everything behaves correctly).

Let's try another angle:

Does the issue still occur if you treat the pin as digital?
Just a random guess but - could you try with let pin = pins.pa0.into_analog_input(&mut adc);?

tronje commented 3 months ago

* Does the issue still occur if you treat the pin as digital?

Looks that way; is_high always returns false on the new toolchain, and always true on the older one. Interesting. I've simplified my example and pushed it. I guess the ADC is not involved in the issue then.

* Just a random guess but - could you try with `let pin = pins.pa0.into_analog_input(&mut adc);`?

Doesn't make a difference.

AsmPrgmC3 commented 3 months ago

I looked at the assembly, but couldn't find anything obvious. The instruction sequences for setting up and reading from the ADC are the same and the rest of the assembly also looks about the same.

Could you upload your verified working and broken versions in case it's system dependent?

EDIT: that was on the original version, didn't see that you simplified it further

tronje commented 3 months ago

EDIT: that was on the original version, didn't see that you simplified it further

Yep. Anyway, I've dumped the binaries using the following commands:

$ avr-objcopy target/avr-atmega164pa/release/adc-read-compiler-bug.elf -O ihex -j .data -j .text firmware.hex
$ avr-objdump -d -s -m avr5 -j .sec1 firmware.hex

and attached them here.

built-with-nightly-2023-12-28.txt built-with-nightly-2024-03-22.txt

AsmPrgmC3 commented 3 months ago

I compared those dumps with mine and they're bit for bit identical.

Patryk27 commented 2 months ago

Status: since there were a couple of codegen bugs uncovered and fixed recently, and since I can't reproduce the issue in simavr, I think it makes sense to wait until https://github.com/Rahix/avr-hal/issues/573 is resolved (which is stalled on https://github.com/llvm/llvm-project/pull/106993) and see on the newest rustc.

Patryk27 commented 1 month ago

@tronje would you mind rechecking on the newest toolchain? (e.g. from today)

Lots of issues got fixed in the meantime and while I can't reproduce the issue locally, my hopes are that maybe one of those fixes happened to cover what was going on here.

tronje commented 1 month ago

Looks like it's still broken. Digital reads still always return false, and analog reads mostly 0, sometimes 1 or 2. The nightly-2023-12-28 still gives correct-looking values in the 800s (analog read) and always true (digital read) on my hardware here.

$ rustc --version
rustc 1.84.0-nightly (27861c429 2024-10-13)

tronje commented 1 month ago

I did a little binary search and found the first toolchain that introduces the bug. The last working toolchain is nightly-2024-01-05, and the first bugged toolchain is the one from the day after that, nightly-2024-01-06.

I'll start comparing assembly outputs and changes between the two toolchains.

tronje commented 1 month ago

Small correction: the channel parameter in rust-toolchain.toml doesn't seem to match the output of rustc --version.

So rustc --version yields:

working toolchain: rustc 1.77.0-nightly (f688dd684 2024-01-04)
bugged toolchain: rustc 1.77.0-nightly (595bc6f00 2024-01-05)

edit (don't want to spam with too many comments): both versions use the same LLVM version, 17.0.6. The git log in the rust project between the two versions also does not show a change to the llvm-project submodule. So I'm guessing it's fair to conclude that this is somehow caused by a change in rust itself.

tronje commented 1 month ago

I did a git bisect on the rust repository, between the two toolchains I mentioned above. Here are the results:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
8e64fc94d895b570e80db8edc3e3ef265e19db8c
3cd6cde0be33faf0296ead23141e06b2d0740894
c2fd26a115645c92537719b1a04270e1ba727cbf
432fffa8afb8fcfe658e6548e5e8f10ad2001329
We cannot bisect more!

I had to git bisect skip those commits, as the resulting toolchains were unable to build my example program. They all generated this error (or something similar), which looks pretty fun:

error: expected identifier, found `Some.0`
    --> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/mem/mod.rs:1403:62
     |
1403 |     crate::hint::must_use({builtin # offset_of($Container, $($fields)+)})
     |                                                              ^^^^^^^ expected identifier
     |
    ::: /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/iter/adapters/filter_map.rs:102:31
     |
102  |                     .byte_add(core::mem::offset_of!(Option<B>, Some.0))
     |                               ---------------------------------------- in this macro invocation
     |
     = note: this error originates in the macro `core::mem::offset_of` (in Nightly builds, run with -Z macro-backtrace for more info)

error: expected identifier, found `Some.0`
    --> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/mem/mod.rs:1403:62
     |
1403 |     crate::hint::must_use({builtin # offset_of($Container, $($fields)+)})
     |                                                              ^^^^^^^ expected identifier
     |
    ::: /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/option.rs:783:48
     |
783  |                 (self as *const Self).byte_add(core::mem::offset_of!(Self, Some.0)).cast(),
     |                                                ----------------------------------- in this macro invocation
     |
     = note: this error originates in the macro `core::mem::offset_of` (in Nightly builds, run with -Z macro-backtrace for more info)

error: expected identifier, found `Some.0`
    --> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/mem/mod.rs:1403:62
     |
1403 |     crate::hint::must_use({builtin # offset_of($Container, $($fields)+)})
     |                                                              ^^^^^^^ expected identifier
     |
    ::: /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/option.rs:839:46
     |
839  |                 (self as *mut Self).byte_add(core::mem::offset_of!(Self, Some.0)).cast(),
     |                                              ----------------------------------- in this macro invocation
     |
     = note: this error originates in the macro `core::mem::offset_of` (in Nightly builds, run with -Z macro-backtrace for more info)

   Compiling quote v1.0.32
   Compiling avr-hal-generic v0.1.0 (https://github.com/rahix/avr-hal?rev=afdadebd2c8bc0969bd2025f94617b64ca101014#afdadebd)
error[E0522]: definition of an unknown language item: `async_iterator_poll_next`
  --> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/async_iter/async_iter.rs:50:32
   |
50 |     #[cfg_attr(not(bootstrap), lang = "async_iterator_poll_next")]
   |                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ definition of unknown language item `async_iterator_poll_next`

error[E0522]: definition of an unknown language item: `into_async_iter_into_iter`
   --> /home/tronje/workspace/rustlang/rust-avr-bugged/INSTALL/usr/local/lib/rustlib/src/rust/library/core/src/async_iter/async_iter.rs:148:32
    |
148 |     #[cfg_attr(not(bootstrap), lang = "into_async_iter_into_iter")]
    |

Anyway, about the possible culprit-commits. They all seem to be from the same merge request. 432fff is the merge-commit that merges them, c2fd26 seems to do the primary work, and the other two are smaller fixes.

I'll admit, I have no real idea what's going on here, but I'm guessing it may be time to take this to the main rust project?

For your convenience, here are the commits I mentioned as GitHub links:

main work: https://github.com/rust-lang/rust/commit/c2fd26a115645c92537719b1a04270e1ba727cbf
mergre commit: https://github.com/rust-lang/rust/commit/432fffa8afb8fcfe658e6548e5e8f10ad2001329
first smaller fix: https://github.com/rust-lang/rust/commit/8e64fc94d895b570e80db8edc3e3ef265e19db8c
second smaller fix: https://github.com/rust-lang/rust/commit/3cd6cde0be33faf0296ead23141e06b2d0740894

edit: this is the pull request were these changes were introduced: https://github.com/rust-lang/rust/pull/118991

tronje commented 4 weeks ago

So I'm not a compiler guy at all, but I built the current master state of rust with the following change (this partly reverts the pull request I mentioned above), and the bug seems to be fixed:

diff --git a/compiler/rustc_codegen_llvm/src/abi.rs b/compiler/rustc_codegen_llvm/src/abi.rs
index 2fe5ed32daa..031daa843be 100644
--- a/compiler/rustc_codegen_llvm/src/abi.rs
+++ b/compiler/rustc_codegen_llvm/src/abi.rs
@@ -5,7 +5,7 @@
 use rustc_abi::Primitive::Int;
 use rustc_abi::{HasDataLayout, Size};
 use rustc_codegen_ssa::MemFlags;
-use rustc_codegen_ssa::mir::operand::{OperandRef, OperandValue};
+use rustc_codegen_ssa::mir::operand::OperandValue;
 use rustc_codegen_ssa::mir::place::{PlaceRef, PlaceValue};
 use rustc_codegen_ssa::traits::*;
 use rustc_middle::ty::Ty;
@@ -248,7 +248,7 @@ fn store(
                 bx.lifetime_end(llscratch, scratch_size);
             }
             _ => {
-                OperandRef::from_immediate_or_packed_pair(bx, val, self.layout).val.store(bx, dst);
+                OperandValue::Immediate(val).store(bx, dst);
             }
         }
     }

That's on top if this state: https://github.com/rust-lang/rust/commit/55b7f8e800a6a43657a8582450323d546297c950

I was just grasping at straws, basically, and got lucky. Can't really say I understand what the change does, though :grimacing:

Patryk27 commented 4 weeks ago

Thanks for the thorough report!

I'm going to NixCon tomorrow, so I won't be able to take a look at it over the weekend, but I'll try to take a stab the next week. Having all this here should be sufficient to reproduce the issue locally (or at least be able to compare the differences in generated assembly).

If you find anything else, of course please do write it down 😄

tronje commented 4 weeks ago

Thank you! Have fun at NixCon :slightly_smiling_face:

I have pushed a handful of commits to my example program.

We also observed a few differences in the generated assembly. The two toolchains we compared were both based on https://github.com/rust-lang/rust/commit/55b7f8e800a6a43657a8582450323d546297c950, one with the diff I mentioned above, and one without. So, very similar compilers.

The one with the fix generates a few extra instructions in the code for digital_write and read_register (both functions from the mcp23017 crate). At the start of the function:

push    r28
push    r29
in  r28, 0x3d   ; 61
in  r29, 0x3e   ; 62
sbiw    r28, 0x01   ; 1
in  r0, 0x3f    ; 63
cli
out 0x3e, r29   ; 62
out 0x3f, r0    ; 63
out 0x3d, r28   ; 61

And towards the end, seemingly the reverse:

push    r28
push    r29
in  r28, 0x3d   ; 61
in  r29, 0x3e   ; 62
sbiw    r28, 0x01   ; 1
in  r0, 0x3f    ; 63
cli
out 0x3e, r29   ; 62
out 0x3f, r0    ; 63
out 0x3d, r28   ; 61

The toolchain without the fix, so just the clean git state, does not generate this code, but the functions look almost identical otherwise.

The other thing we noticed is that the update_register_bit (again from the mcp23017 crate) is NOT inlined in the woring toolchain with my fix applied, but IS inlined in the clean toolchain without the fix.

Maybe that's something to go on, no idea.

I'll attached my assembly dumps here. self-built-toolchain-with-fix.txt self-built-toolchain-WITHOUT-fix.txt

AsmPrgmC3 commented 4 weeks ago

While that difference is interesting, I don't think that's the cause of the bug, for 2 reasons:

That prologue (and matching epilogue) doesn't exist in either of the previous versions (including those you posted previously). In fact, the digital_write functions are identical in both of those.
That looks to be prologue moving the stack by 1 byte. I think this one byte is later written to by an std Y+1, r24 instruction and never read back. In other words, I think this is just an artifact of a missed optimization.

tronje commented 4 weeks ago

Ah, too bad. Thanks for the insight! :slightly_smiling_face:

tronje commented 4 weeks ago

Seems like the inlining of update_register_bit out of the mcp23017 crate is somehow at fault. If I use a local checkout of that crate, and annotate the function with #[inline(never)], the bug is fixed on the latest nightly toolchain.

Patryk27 commented 3 weeks ago

Alright, I'm leaning towards a hardware quirk here - which I know sounds a bit crazy, but hear me out 😄

I've been looking at the firmware, analyzing the assembly by hand - and while the output is a bit different between toolchain versions, that's mostly because LLVM does a better job at inlining constant expressions within update_register_bit() now (the calls to register_for_pin(), bit_for_pin() etc. get computed at compile-time).

Overall the newest binary looks correct, I'd say.

Since static eye-analysis can only get one so far, I've took a moment to add TWI (aka I2C) support to avr-tester, so that I could simulate the firmware together with a basic, fake mcp23017 (which does basically nothing here besides responding to the register reads and writes):

use avr_tester::AvrTester;

#[test]
fn test() {
    let mut avr = AvrTester::atmega164pa()
        .with_clock(10_000_000)
        .load("/home/pwy/t/projects/adc-read-compiler-bug/target/avr-atmega164pa/release/adc-read-compiler-bug.elf");

    avr.twi0().attach_slave_fn({
        let mut reg = None;
        let mut regs = [0; 256];

        move |packet| {
            if packet.is_start() {
                return Some(packet.respond_ack());
            }

            if packet.is_stop() {
                return Some(packet.respond_ack());
            }

            if packet.is_write() {
                if let Some(dev_reg) = reg.take() {
                    println!("write_register({}, {})", dev_reg, packet.data);

                    regs[dev_reg] = packet.data;
                } else {
                    reg = Some(packet.data as usize);
                }

                return Some(packet.respond_ack());
            }

            if packet.is_read() {
                let reg = reg.take().expect("No register picked");
                let reg_val = regs[reg];

                println!("read_register({}) = {}", reg, reg_val);

                return Some(packet.respond_data(reg_val));
            }

            None
        }
    });

    avr.run_for_ms(1500);
    avr.pins().pa0().set_high();
    avr.run_for_ms(1500);

    panic!("{}", avr.uart0().read::<String>());
}

... and the output looks exactly the same on both the older and the newer firmware:

test:
write_register(0, 255)
write_register(1, 255)
read_register(0) = 255
write_register(0, 253)
read_register(20) = 0
write_register(18, 2)
read_register(0) = 253
write_register(0, 189)
read_register(12) = 0
write_register(12, 64)
read_register(20) = 0
write_register(18, 64)

uart:
BUG
BUG
BUG
OK
OK

Now, of course it's not a definite test - simavr could be emulating something differently (I know of https://github.com/buserror/simavr/issues/453, for which I'm applying a workaround within avr-tester), but I suspect that maybe the culprit lies within... timing?

My guess is that because the newer code uses substantially less instructions between the calls to MCP23017::read_register() and MCP23017::write_register(), maybe that device doesn't have enough time to retrieve/process the data, which in turn causes bus errors/whatever, which happen to affect the PA0 pin? (not sure how your board looks like)

Another totally random observation is in regards to I2c::write_read(), called internally inside MCP23017 - it performs write+read in one transaction:

https://github.com/Rahix/avr-hal/blob/15080aa41e726bc720adf36ce0fc5a030390c822/avr-hal-generic/src/i2c.rs#L400

... which I guess makes sense, but for instance Arduino implementations usually (seem to) use two separate transactions:

https://github.com/blemasle/arduino-mcp23017/blob/02076b387157db25d3ca30b7e8b089c6da52c563/src/MCP23017.cpp#L153

Maybe it would be worthwhile to check whether we shouldn't be doing two txs as well, like:

fn write_read(
    &mut self,
    address: u8,
    bytes: &[u8],
    buffer: &mut [u8],
) -> Result<(), Self::Error> {
    self.p.raw_start(address, Direction::Write)?;
    self.p.raw_write(bytes)?;
    self.p.raw_stop()?;
    self.p.raw_start(address, Direction::Read)?;
    self.p.raw_read(buffer)?;
    self.p.raw_stop()?;
    Ok(())
}

tronje commented 3 weeks ago

That all sounds pretty plausible. Thank you so much for all that effort! I will try to play around with the I2C, I suppose. Will start with your suggestion, then maybe check if I can't get our logic analyzer connected to the hardware. I think the earliest I can look at this will be next week, though.

AsmPrgmC3 commented 3 weeks ago

I don't think the write_read implementation should create any problems as that is explicitly supported. I also don't see why writing immediately after reading should cause any problems, especially because you're only running it at 100kHz while the MCP supports up to 1.7MHz.

So even if the issue is resolved by changing the I2C code, I think it's a good idea to check whether the root cause might be something else (similar to how inlining prevents/causes the issue).

Rahix / avr-hal

Incorrect ADC readings after toolchain bump #567