SpenceKonde / DxCore

Arduino core for AVR DA, DB, DD, EA and future DU-series parts - Microchip's latest and greatest AVRs. Library maintainers: Porting help and adviccee is available.
Other
187 stars 49 forks source link

i2c slave corruption in sleep modes other than SLEEP_MODE_IDLE #322

Closed ObviousInRetrospect closed 1 year ago

ObviousInRetrospect commented 2 years ago

I've noticed unreliable slave behavior in SLEEP_MODE_STANDBY (and SLEEP_MODE_POWER_DOWN). This seems to be more of an issue when transferring larger buffers. I thought guarding the sleep_cpu() call with !Wire.slaveTransactionOpen() was sufficient. The twi supports wake from STANDBY and even POWER_DOWN but every i2c slave I have written with DxCore and megaTinyCore has been flakey unless set for SLEEP_MODE_IDLE.

The code here: https://github.com/ObviousInRetrospect/DualMode/tree/main/DualModeExample

works as expected

but if SLEEP_MODE_IDLE is changed to SLEEP_MODE_STANDBY the slave starts corrupting data: https://github.com/ObviousInRetrospect/DualMode/blob/10dfca6b5bc6c8fedf1605d2d8c5981e2e4aca34/DualModeExample/DualModeExample.ino#L319

this is most easily seen (especially without an ina3221) by uncommenting the test pattern data fill (L184-188): https://github.com/ObviousInRetrospect/DualMode/blob/10dfca6b5bc6c8fedf1605d2d8c5981e2e4aca34/DualModeExample/DualModeExample.ino#L184

When I raised this in https://github.com/SpenceKonde/DxCore/discussions/316 @MX682X suggested opening a separate issue with more details.

The output on the master https://github.com/ObviousInRetrospect/DualMode/tree/main/DualModeExampleClient in SLEEP_MODE_IDLE is a long string of correct reads:

(the master does a 32-byte read follow by a 25-byte read. the last 4 bytes are a crc32 covering the entire data)

0000000000000000000000000C000000
000033B9A8020000000000000C000000
000033B9000B01000C0000000C0033B9
010033B90078CF670F
ch1 bv:0 sv:0 ua:0 acc0:0/12=0.00mah acc1:0/47411=0.00mah
ch2 bv:680 sv:0 ua:0 acc0:0/12=0.00mah acc1:0/47411=0.00mah
ch3 bv:2816 sv:1 ua:400 acc0:12/12=0.00mah acc1:112947/47411=5.16mah

0000000000000000000000000C000000
00003FB9A0020000000000000C000000
00003FB9000B01000C0000000C003FB9
01003FB900713F2E0F
ch1 bv:0 sv:0 ua:0 acc0:0/12=0.00mah acc1:0/47423=0.00mah
ch2 bv:672 sv:0 ua:0 acc0:0/12=0.00mah acc1:0/47423=0.00mah
ch3 bv:2816 sv:1 ua:400 acc0:12/12=0.00mah acc1:112959/47423=5.16mah

omitting thousands of similar lines with very few errors. In an overnight run ran this overnight, 99.2% of transfers were clean, 0.8% corruption due to race condition around crc calculation where data looks fine but the data changed during crc calculation.

when the slave is changed to SLEEP_MODE_STANDBY, the output changes to:

[content in square brackets are comments I added]

[this is the slave being updi programmed and not responding]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
ch1 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
ch2 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
ch3 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah

[the first read succeeds]
00000000000000000000000006000000
00000600F00200000000000006000000
00000600F80A01000600000006000600
0000060000CEC895DF
ch1 bv:0 sv:0 ua:0 acc0:0/6=0.00mah acc1:0/6=0.00mah
ch2 bv:752 sv:0 ua:0 acc0:0/6=0.00mah acc1:0/6=0.00mah
ch3 bv:2808 sv:1 ua:400 acc0:6/6=0.00mah acc1:6/6=0.00mah

[subsequent read starts failing. the second line is pretty clearly a fourth line given the CRC looking thing in the middle followed by FFFFFs which are an underrun]
rcv_crc:38ACA57E calc:A3E7FCC4
bad crc, will retry 2 times
00001300F80A01001300000013001300
00001300007EA5AC38FFFFFFFFFFFFFF
00001300F80A01001300000013001300
00001300007EA5AC38
rcv_crc:38ACA57E calc:A3E7FCC4
bad crc, will retry 1 times
00001300F80A01001300000013001300
00001300007EA5AC38FFFFFFFFFFFFFF
00001300F80A01001300000013001300
00001300007EA5AC38

00001300F80A01001300000013001300
00001300007EA5AC38FFFFFFFFFFFFFF
00001300F80A01001300000013001300
00001300007EA5AC38
ch1 bv:2808 sv:1 ua:400 acc0:19/19=0.00mah acc1:19/19=0.00mah
ch2 bv:32256 sv:-21339 ua:-15920 acc0:-200/65535=-0.01mah acc1:65535/19=2.99mah
ch3 bv:2808 sv:1 ua:400 acc0:19/19=0.00mah acc1:19/19=0.00mah
rcv_crc:9B14924C calc:2ED9D8AC
bad crc, will retry 2 times
00001F00F80A01001F0000001F001F00
00001F00004C92149BFFFFFFFFFFFFFF
00001F00F80A01001F0000001F001F00
00001F00004C92149B
rcv_crc:9B14924C calc:2ED9D8AC
bad crc, will retry 1 times
00001F00F80A01001F0000001F001F00
00001F00004C92149BFFFFFFFFFFFFFF
00001F00F80A01001F0000001F001F00
00001F00004C92149B

00001F00F80A01001F0000001F001F00
00001F00004C92149BFFFFFFFFFFFFFF
00001F00F80A01001F0000001F001F00
00001F00004C92149B
ch1 bv:2808 sv:1 ua:400 acc0:31/31=0.00mah acc1:31/31=0.00mah
ch2 bv:19456 sv:5266 ua:9248 acc0:-101/65535=-0.00mah acc1:65535/31=2.99mah
ch3 bv:2808 sv:1 ua:400 acc0:31/31=0.00mah acc1:31/31=0.00mah
rcv_crc:E2A3E693 calc:64CCCBD4
bad crc, will retry 2 times
00002A00F80A01002A0000002A002A00
00002A000093E6A3E2FFFFFFFFFFFFFF
00002A00F80A01002A0000002A002A00
00002A000093E6A3E2
rcv_crc:E2A3E693 calc:64CCCBD4
bad crc, will retry 1 times
00002A00F80A01002A0000002A002A00
00002A000093E6A3E2FFFFFFFFFFFFFF
00002A00F80A01002A0000002A002A00
00002A000093E6A3E2

00002A00F80A01002A0000002A002A00
00002A000093E6A3E2FFFFFFFFFFFFFF
00002A00F80A01002A0000002A002A00
00002A000093E6A3E2
ch1 bv:2808 sv:1 ua:400 acc0:42/42=0.00mah acc1:42/42=0.00mah
ch2 bv:37632 sv:-23578 ua:5984 acc0:-30/65535=-0.00mah acc1:65535/42=2.99mah
ch3 bv:2808 sv:1 ua:400 acc0:42/42=0.00mah acc1:42/42=0.00mah
rcv_crc:36000000 calc:E4E62545
bad crc, will retry 2 times
0B0036000000360000AAD47928FFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
00000C00000000003600F00200000000
000036000000000036
rcv_crc:00000002 calc:CBBEEF8E
bad crc, will retry 1 times
FFFFFFFF00000000FFFF00000D000000
00003700F00200000000000037000000
FFFFFFFF00000000FFFF00000D000000
00003700F002000000

FFFFFFFF00000000FFFF00000D000000
00003700F00200000000000037000000
FFFFFFFF00000000FFFF00000D000000
00003700F002000000
ch1 bv:0 sv:0 ua:0 acc0:65535/13=2.99mah acc1:0/55=0.00mah
ch2 bv:752 sv:0 ua:0 acc0:0/55=0.00mah acc1:-65536/65535=-2.99mah
ch3 bv:0 sv:0 ua:0 acc0:65535/13=2.99mah acc1:0/55=0.00mah

FFFFFFFF00000000FFFF000018000000
00004200F00200000000000042000000
00004200F80A01001700000017004200
000042000060EC0D24
ch1 bv:0 sv:0 ua:0 acc0:65535/24=2.99mah acc1:0/66=0.00mah
ch2 bv:752 sv:0 ua:0 acc0:0/66=0.00mah acc1:0/66=0.00mah
ch3 bv:2808 sv:1 ua:400 acc0:23/23=0.00mah acc1:66/66=0.00mah
rcv_crc:149A5594 calc:969842F3
bad crc, will retry 2 times
FFFFFFFF00000000FFFF000024000000
00004E00F0020000000000004E000000
00004F00F80A01002400000024004F00
00004F000094559A14
rcv_crc:149A5594 calc:464D6721
bad crc, will retry 1 times
00004F00F80A01002400000024004F00
00004F000094559A14FFFFFFFFFFFFFF
00004F00F80A01002400000024004F00
00004F000094559A14

00004F00F80A01002400000024004F00
00004F000094559A14FFFFFFFFFFFFFF
00004F00F80A01002400000024004F00
00004F000094559A14
ch1 bv:2808 sv:1 ua:400 acc0:36/36=0.00mah acc1:79/79=0.00mah
ch2 bv:37888 sv:-26027 ua:9424 acc0:-236/65535=-0.01mah acc1:65535/79=2.99mah
ch3 bv:2808 sv:1 ua:400 acc0:36/36=0.00mah acc1:79/79=0.00mah

This is with test pattern data in which each byte contains its own address

[expected, device being programmed and not responding]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
ch1 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
ch2 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
ch3 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah

[3 correct transmissions, varies]
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

[corruption starts, this one isn't an obvious pattern]
rcv_crc:2221201F calc:C272FEB8
bad crc, will retry 2 times
2C2D2E2F3031323334A5C3A244FFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
0A0B0C0D0E0F10111213141516171819
1A1B1C1D1E1F202122

[]
rcv_crc:44A2C3A5 calc:3F00FCF8
bad crc, will retry 1 times
0A0B0C0D0E0F10111213141516171819
1A1B1C1D1E1F20212223242526272829
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

0A0B0C0D0E0F10111213141516171819
1A1B1C1D1E1F20212223242526272829
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:3854 sv:4368 ua:-22272 acc0:353637138/5910=16149.43mah acc1:454695192/7452=20764.41mah
ch2 bv:7966 sv:8480 ua:-15872 acc0:623125282/10022=28456.06mah acc1:555755816/8994=25379.52mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

[back to a couple correct ones]
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

[but now its aliasing the first transmission]
rcv_crc:18171615 calc:E8257576
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
000102030405060708090A0B0C0D0E0F
101112131415161718

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah
rcv_crc:18171615 calc:E8257576
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
000102030405060708090A0B0C0D0E0F
101112131415161718

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah
MX682X commented 2 years ago

I've started with debugging right now and I stumbled upon 3 things so far that might create problems:

  1. Do you make sure that Serial finishes before putting the CPU into sleep? - It's not impacting in IDLE, but will probably in the deeper sleep states, since USART can wake the MCU only on SOF. the MCU might go to sleep with bytes still left in the buffer or maybe a half-finished transmitted byte.
  2. I noticed that a BUSERR is thrown by the TWI SLAVE in the last byte of a transaction when in STANDBY (and I expect POWER_DOWN). I think this is kind of a silicon bug. When in Standby or Power Down, all High-Frequency clocks are disabled, and thus the clock for TWI MASTER. But the TWI MASTER has to be enabled for BUSERR to work. So, the BUSERR circuitry does not register the Start condition, but only a STOP condition and thus creates the BUSERR. This leads the Wire library to virtually throw away the whole transaction. I'll have to think about how to work around that, if someone has any suggestions, like a "this interrupt has woken up the CPU flag", if something like this exists, feel free to write a comment.
  3. I'm not clearing the BUSERR flag, so any consecutive transactions will be lost too
SpenceKonde commented 2 years ago

But the datasheet says it can wake on a i2c start condition from deeper modes, even classic avrs could


Spence Konde Azzy’S Electronics

New products! Check them out at tindie.com/stores/DrAzzy GitHub: github.com/SpenceKonde ATTinyCore: Arduino support for almost every ATTiny microcontroller Contact: @.***

On Fri, Aug 12, 2022, 15:48 MX682X @.***> wrote:

I've started with debugging right now and I stumbled upon 2 things so far that might create problems:

  1. Do you make sure that Serial finishes before putting the CPU into sleep? - It's not impacting in IDLE, but will probably in the deeper sleep states, since USART can wake the MCU only on SOF. the MCU might go to sleep with bytes still left in the buffer or maybe a half-finished transmitted byte.
  2. I noticed that a BUSERR is thrown by the TWI SLAVE in the last byte of a transaction when in STANDBY (and I expect POWER_DOWN). I think this is kind of a silicon bug. When in Standby or Power Down, all High-Frequency clocks are disabled, and thus the clock for TWI MASTER. But the TWI MASTER has to be enabled for BUSERR to work. So, the BUSERR circuitry does not register the Start condition, but only a STOP condition and thus creates the BUSERR. This leads the Wire library to virtually throw away the whole transaction. I'll have to think about how to work around that, if someone has any suggestions, like a "this interrupt has woken up the CPU flag", if something like this exists, feel free to write a comment

— Reply to this email directly, view it on GitHub https://github.com/SpenceKonde/DxCore/issues/322#issuecomment-1213460801, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTXEW6VU2ABOVBEAYCT2XLVY2TAJANCNFSM56E5IQBQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ObviousInRetrospect commented 2 years ago
  1. in my real code I have a serial.flush() prior to sleep; I should add it to the sample code.
  2. given that I am already checking Wire.slaveTransactionOpen() before sleep I have no problem either passing a flag warning that I am sleeping or similar or calling some new Wire.sleep() which waits until a slave transaction is not open and then sleeps (setting a flag). although that probably won't work with other wakeup sources. FWIW line 309 is
    TCB2.CTRLA |= 1 << TCB_RUNSTDBY_bp;

    which I think keeps the HF clock running in standby (and millis keeping time)

ObviousInRetrospect commented 2 years ago

should have said this up-front but I am doing my testing on 1.4.10 (latest release) against an avr64db32 (which had less errata than the 128's or the da's).

fwiw I am 80% sure the same issue exists on the latest megaTinyCore release (on a tiny1614), I can try to build a repro if it would be helpful.

in theory the chips will wake from sleep-mode-power-down on i2c slave address match. I've thought about trying to screw with the clock prescaler to make sleep mode idle suck less but I'd need to hack a revert-the-clock into all the ISRs

Appreciate you looking at it. These parts would make awesome slaves if they could do so from deep sleep. A tiny412 is cheaper than lots of special purpose i2c devices (want an extra dac? ...).

MX682X commented 2 years ago

Since I sometimes fail to convey my thoughts properly because I might leave out some knowledge I assume other people already have:

For clarification: a TWI transaction consists of

START, Address, optional Data and a STOP 
              ^(Addr IRQ)   ^(Data IRQ) ^(STOP IRQ)
              ^wake-up happens here, with a little clock stretch for the oscillator

But the datasheet says it can wake on a i2c start condition from deeper modes, even classic avrs could

It can and it does. (I only got to test STANDBY so far though.) The problem is a bit different. (No pun intended) Basically, it is just the BUSERR detection that behaves unintended. Since it gets no clock in STANDBY (my theory), it misses the START condition, and it's job is

The BUSERR flag indicates that an illegal bus operation has occurred. Illegal bus operation is detected if a protocol violating the Start (S), repeated Start (Sr), or Stop (P) conditions is detected on the TWI bus lines. A Start condition directly followed by a Stop condition is one example of a protocol violation.

But the library is checking for the BUSERR bit and cancels any transaction that occured if it is set. I think the most straightforward solution would be to check if the SMODE in the sleep register is not IDLE. In that case we ignore the BUSERR, but still clear it

which I think keeps the HF clock running in standby (and millis keeping time)

It does, but to conserve as much power as possible, only enabled peripherals get the clock. the TWI_MASTER has no RUNSTBY.

given that I am already checking Wire.slaveTransactionOpen() before sleep I have no problem either passing a flag warning that I am sleeping or similar or calling some new Wire.sleep() which waits until a slave transaction is not open and then sleeps (setting a flag). although that probably won't work with other wakeup sources.

I think I'll do it by checking the selected sleepmode

MX682X commented 2 years ago

twi.c.pdf (too lazy to archive, just remove the .pdf after downloading)

@ObviousInRetrospect Try to replace your twi.c with this one. It seems to work on my end, at least it is reading the 120 byte buffer I prepared correctly. On the 128DA48 at least. (didn't use your sketch for testing).

In the end I just decided to yeet the BUSERR check. There doesn't seem to be a case that would disturb the library even if there was a wrong order of Start/RepStart/Stop.

SpenceKonde commented 2 years ago

But the library is checking for the BUSERR bit and cancels any transaction that occured if it is set. I think the most straightforward solution would be to check if the SMODE in the sleep register is not IDLE. In that case we ignore the BUSERR, but still clear it

I agree 1000%

I came to the same conclusion as you did in the shower about what was causing it.

Dang, those were some substantial changes (merged)

ObviousInRetrospect commented 2 years ago

Still getting basically 100% corruption in SLEEP_MODE_STANDBY with the new twi.c. Tried deleting twi.c in the middle and verified I got a bunch of linker errors (until I dropped the new one in) so some evidence it is picking up the new code.

While testing this I was wondering if there is a way to make it error out if you call enableDualMode and have set master or slave in the menu - somehow my boards menu got reset which took me a while to track down. Perhaps just #ifdef that function if you would never call it in master or slave.

rcv_crc:00000002 calc:FA25D711
bad crc, will retry 2 times
00000000000000000000000008000000
00000800F00200000000000008000000
00000000000000000000000008000000
00000800F002000000
rcv_crc:00000002 calc:6E9EBE02
bad crc, will retry 1 times
00000000000000000000000009000000
00000900F00200000000000009000000
00000000000000000000000009000000
00000900F002000000

00000000000000000000000009000000
00000900F00200000000000009000000
00000000000000000000000009000000
00000900F002000000
ch1 bv:0 sv:0 ua:0 acc0:0/9=0.00mah acc1:0/9=0.00mah
ch2 bv:752 sv:0 ua:0 acc0:0/9=0.00mah acc1:0/0=0.00mah
ch3 bv:0 sv:0 ua:0 acc0:0/9=0.00mah acc1:0/9=0.00mah
rcv_crc:00000002 calc:D2E7F7AF
bad crc, will retry 2 times
00000000000000000000000015000000
00001500F00200000000000015000000
00000000000000000000000015000000
00001500F002000000
rcv_crc:00000002 calc:D2E7F7AF
bad crc, will retry 1 times
00000000000000000000000015000000
00001500F00200000000000015000000
00000000000000000000000015000000
00001500F002000000

00000000000000000000000015000000
00001500F00200000000000015000000
00000000000000000000000015000000
00001500F002000000
ch1 bv:0 sv:0 ua:0 acc0:0/21=0.00mah acc1:0/21=0.00mah
ch2 bv:752 sv:0 ua:0 acc0:0/21=0.00mah acc1:0/0=0.00mah
ch3 bv:0 sv:0 ua:0 acc0:0/21=0.00mah acc1:0/21=0.00mah
rcv_crc:00000002 calc:30E79125
bad crc, will retry 2 times
00000000000000000000000021000000
00002100F00200000000000021000000
00000000000000000000000021000000
00002100F002000000
rcv_crc:00000002 calc:30E79125
bad crc, will retry 1 times
00000000000000000000000021000000
00002100F00200000000000021000000
00000000000000000000000021000000
00002100F002000000

00000000000000000000000021000000
00002100F00200000000000021000000
00000000000000000000000021000000
00002100F002000000
ch1 bv:0 sv:0 ua:0 acc0:0/33=0.00mah acc1:0/33=0.00mah
ch2 bv:752 sv:0 ua:0 acc0:0/33=0.00mah acc1:0/0=0.00mah
ch3 bv:0 sv:0 ua:0 acc0:0/33=0.00mah acc1:0/33=0.00mah
rcv_crc:00000002 calc:E5DF0BA7
bad crc, will retry 2 times
0000000000000000000000002C000000
00002C00F0020000000000002C000000
0000000000000000000000002C000000
00002C00F002000000
rcv_crc:00000002 calc:F36675D3
bad crc, will retry 1 times
0000000000000000000000002C000000
00002C00F0020000000000002C000000
0000000000000000000000002D000000
00002D00F002000000

0000000000000000000000002C000000
00002C00F0020000000000002C000000
0000000000000000000000002D000000
00002D00F002000000
ch1 bv:0 sv:0 ua:0 acc0:0/44=0.00mah acc1:0/44=0.00mah
ch2 bv:752 sv:0 ua:0 acc0:0/44=0.00mah acc1:0/0=0.00mah
ch3 bv:0 sv:0 ua:0 acc0:0/45=0.00mah acc1:0/45=0.00mah
rcv_crc:00000002 calc:275B1F14
bad crc, will retry 2 times
00000000000000000000000038000000
00003800F00200000000000038000000
00000000000000000000000038000000
00003800F002000000
rcv_crc:00000002 calc:B3E07607
bad crc, will retry 1 times
00000000000000000000000039000000
00003900F00200000000000039000000
00000000000000000000000039000000
00003900F002000000

00000000000000000000000039000000
00003900F00200000000000039000000
00000000000000000000000039000000
00003900F002000000
ch1 bv:0 sv:0 ua:0 acc0:0/57=0.00mah acc1:0/57=0.00mah
ch2 bv:752 sv:0 ua:0 acc0:0/57=0.00mah acc1:0/0=0.00mah
ch3 bv:0 sv:0 ua:0 acc0:0/57=0.00mah acc1:0/57=0.00mah
rcv_crc:00000002 calc:6E15A9E1
bad crc, will retry 2 times
00000000000000000000000045000000
00004500F00200000000000045000000
00000000000000000000000045000000
00004500F002000000
rcv_crc:00000002 calc:6E15A9E1
bad crc, will retry 1 times
00000000000000000000000045000000
00004500F00200000000000045000000
00000000000000000000000045000000
00004500F002000000

00000000000000000000000045000000
00004500F00200000000000045000000
00000000000000000000000045000000
00004500F002000000
ch1 bv:0 sv:0 ua:0 acc0:0/69=0.00mah acc1:0/69=0.00mah
ch2 bv:752 sv:0 ua:0 acc0:0/69=0.00mah acc1:0/0=0.00mah
ch3 bv:0 sv:0 ua:0 acc0:0/69=0.00mah acc1:0/69=0.00mah
ObviousInRetrospect commented 2 years ago

mind sharing the test code you have working?

I wonder if it is related to the 32-byte chunking I am doing

SpenceKonde commented 2 years ago

Oh shit the code I merged didn't work?

ObviousInRetrospect commented 2 years ago

In my testing it didn't resolve the problem I am seeing. It seems to work similarly to the prior version. It probably fixed at least part of the problem in @MX682X 's testing so merging it might not be a mistake. I've generally found twi slave via wire to be mega-flakey when standby+ modes are used so there might be more than one issue.

MX682X commented 2 years ago

sender: attiny1614 with modded TWI buffer size:

#include "Wire.h"

void setup() {
  pinMode(PIN_PA3, INPUT_PULLUP);
  Wire.usePullups();
  Wire.swap(0);
  Wire.begin();

}

void loop() {
  if (digitalRead(PIN_PA3) == LOW) {
    Wire.beginTransmission('A');
    for (uint8_t i = 0; i < 120; i++) {
      Wire.write(i);
    }
    Wire.endTransmission(true);

    while (digitalRead(PIN_PA3) == LOW);
  }
  delay(100);
}

Receiver: (AVR128DA48 Curiosity nano):

 #include "Wire.h"
#include <avr/sleep.h>

volatile uint8_t numRx = 0;
volatile uint8_t rxBuffer[128];
volatile uint8_t sleep_en = 0;

void setup() {
  TWI0.DBGCTRL = 1;
  USART1.DBGCTRL = 1;
  //set_sleep_mode(SLEEP_MODE_PWR_DOWN);
  //set_sleep_mode(SLEEP_MODE_IDLE);
  set_sleep_mode(SLEEP_MODE_STANDBY);

  sleep_enable();

  pinMode(PIN_PC6, OUTPUT);
  digitalWrite(PIN_PC6, HIGH);

  Wire.usePullups();

  Wire.onReceive(srmw);
  Wire.begin('A');

  Serial1.begin(500000);
  while (Serial1.read() == -1) {};
  Serial1.println("Hello World");
}

void loop() {
  // put your main code here, to run repeatedly:
  while (Wire.slaveTransactionOpen() != 0) {}

  if (numRx > 0) {
    Serial1.print("Rx: ");
    for (uint8_t i = 0; i < numRx; i++) {
      Serial1.printHex(rxBuffer[i]);
    }
    Serial1.println();
    numRx = 0;
  }

  Serial1.flush();

  if (sleep_en) {
    digitalWrite(PIN_PC6, LOW);
    sleep_cpu();
  }

  digitalWrite(PIN_PC6, HIGH);
}

void srmw(int num) {
  Serial1.print("IRQ ");
  for (uint8_t i = 0; i < num; i++) {
    rxBuffer[i] = Wire.read();
  }
  numRx = num;
}

I was also monitoring the PIN_PC6 on my Oscilloscope, so I knew when the chip was asleep. Used a debugger to change sleep_en on runtime

MX682X commented 2 years ago

make it error out if you call enableDualMode and have set master or slave in the menu

This would be a misleading error, since you can use the dual mode with the slave only.

ObviousInRetrospect commented 2 years ago

so fwiw it is possible that doing a single write pattern (filling the modified-size-buffer) instead of sequential reads and writes of different is masking the problem.

using my code with the test pattern data (which you can totally run without an ina3221 - the ina3221 code will silently read 0xFFs one of the most common corruption patterns I see is the read of the second half returns a duplicate set of first half data when sleeping in standby:

bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
000102030405060708090A0B0C0D0E0F
101112131415161718
rcv_crc:18171615 calc:E8257576
bad crc, will retry 1 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
000102030405060708090A0B0C0D0E0F
101112131415161718

whereas in idle it consistently gets:

000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
MX682X commented 2 years ago

Alright, so seems like I've introduced a bug that automatically called the onReceive function even though no data was available. I should add that I wanted to go step by step and fiirst figure out onReceive, then onRequest. That's where the problem above might have come from. Could you try this version out? should have the other parts fixed aswell as the getBytesRead() function src.zip P.S.: @ Spence this is a prototype, do not merge it yet. (Edit: seems like I accidentelly @ somone unrelated, if you see this message, SORRY)

ObviousInRetrospect commented 2 years ago

[edit: not sure what happened in the testing from the prior version of this post]. Looks like it is working.

standby:

CRC valid
fail: 0 pass: 1310
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ObviousInRetrospect commented 2 years ago

Did more testing. For these tests I sped up the reads to read at 10hz instead of 0.2hz. This is with test pattern data for which I took out the recalculation race condition.

sleep_mode_standby: (prev >50% corruption rate)

CRC valid
fail: 0 pass: 10576
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

also seeing perfect behavior in SLEEP_MODE_PWR_DOWN:

CRC valid
fail: 0 pass: 2506
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

will do additional testing removing some of the various delay()'s that I used to use to make it behave slightly better.

ObviousInRetrospect commented 2 years ago

so first off, THANK YOU, this is hugely better. The failure rate is down by a factor of at least 2500x.

Testing in prior post and this post was conducted with this version of the code: https://github.com/ObviousInRetrospect/DualMode/commit/0499bbe852f923fd49f5aff3fac51d8ff9a7a92f

As you can see most of the competing code is IFDEF'ed out.

Slave was running on an avr64dd32 on pins c2/c3 (10mhz) Master was running on a tiny3224 (4mhz) - not built with your new library, not entirely on purpose just didn't drop it into megaTinyCore yet. Both boards are well-decoupled 2 layer pcbs (running both 0.1 and 1uf on all power pins on the slave). Each board has 4.7k pull-ups to 2.8v vcc, so 2.35k.

A longer test of sleep_mode_pwr_down hit two types of problem. After 2770 successes the slave stopped responding until the reset button was hit, yielding 681 failures. In this failure mode the slave's main look keeps running. (The counts were not reset at this point). After 8179 successes another failure was hit in which the first transmission had incorrect bytes at the front and 0xFFs for about half.

This appears to be reproducible although the number of transactions in that the crash happens is variable.

CRC valid
fail: 0 pass: 2770
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah
rcv_crc:FFFFFFFF calc:C530B076
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
fail: 2 pass: 2770
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
... <omitting ~679 duplicate failures>
fail: 680 pass: 2770
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
ch1 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
ch2 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
ch3 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 2 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

<RESET SLAVE via RST pin>
CRC valid
fail: 681 pass: 2771
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

...

rcv_crc:44A2C3A5 calc:A55704A9
bad crc, will retry 2 times
2E2F3031323334A5C3A244FFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
CRC valid
fail: 682 pass: 8179
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
...
<note the bad run is the end of a 2nd chunk + 
buffer underrun after which it proceeded normally>

fail: 682 pass: 14496

below this line I changed CHUNK_SZ on the MASTER ONLY to 16. That is because the exactly matched reads and writes is less abusive to the library than real-world. The counters reset when the slave is reprogrammed.

in this run I had 2 corrupt messages and 2 crashes of the slave

CRC valid
fail: 0 pass: 989
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah
rcv_crc:44A2C3A5 calc:FF2E1BCB
bad crc, will retry 2 times
0102030405060708090A0B0C0D0E0F10
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
CRC valid
fail: 1 pass: 990
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

<the error was the first chunk read addresses 1-16 
instead of 0-15, that is relevant because it isn't a single 
bit-flip which could be noise or hardware. the two 0x01's 
in a row mean the problem is with the slave not the master.>

...
rcv_crc:FFFFFFFF calc:4BFBA3F3
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
fail: 3 pass: 5546

<reset sw>

fail: 69 pass: 7019
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah
rcv_crc:44A2C3A5 calc:FF2E1BCB
bad crc, will retry 2 times
0102030405060708090A0B0C0D0E0F10
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
CRC valid
fail: 70 pass: 7020
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

<this is the same corruption pattern seen earlier>

fail: 70 pass: 9662
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
ch1 bv:1284 sv:1798 ua:-1696 acc0:185207048/3340=8457.79mah acc1:286265102/4882=13072.77mah
ch2 bv:5396 sv:5910 ua:4704 acc0:454695192/7452=20764.41mah acc1:555753246/8994=25379.40mah
ch3 bv:9508 sv:10022 ua:11104 acc0:724183336/11564=33071.04mah acc1:825241390/13106=37686.02mah
rcv_crc:FFFFFFFF calc:9E7FF7D1
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
fail: 72 pass: 9662

<this is another slave crash + reset>

fail: 112 pass: 9662
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
ch1 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
ch2 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
ch3 bv:65535 sv:-1 ua:-400 acc0:-1/65535=-0.00mah acc1:-1/65535=-0.00mah
CRC valid
fail: 112 pass: 9663
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

...

CRC valid
fail: 112 pass: 23895
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

<reprogram to raise clock of slave to 10mhz, 
clean up the output, and take out the delay()>

CRC valid
fail: 0 pass: 4297
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

rcv_crc:FFFFFFFF calc:4BFBA3F3
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
fail: 2 pass: 4297

<slave is not responding over i2c. 
slave main loop is still running>
ObviousInRetrospect commented 2 years ago

running the master at 10x the speed of the slave (10mhz/1mhz), my perception is the crash rate goes up: (this is still 16 chunk read on the master and 32 chunk write on the slave)

CRC valid
fail: 0 pass: 3199
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

rcv_crc:FFFFFFFF calc:0493FDE6
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
20FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

...
<reset>

CRC valid
fail: 284 pass: 3441
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 2 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

<note this was only ~200 good>

CRC valid
fail: 538 pass: 11336
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 2 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE

<but this was ~8k>

CRC valid
fail: 592 pass: 14374
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 2 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

<another reasonably fast failure>

CRC valid
fail: 795 pass: 15507
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

rcv_crc:FFFFFFFF calc:0493FDE6
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
20FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
<reset>
...
RC valid
fail: 873 pass: 19928
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

rcv_crc:FFFFFFFF calc:0493FDE6
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
20FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
<reset>
...

CRC valid
fail: 932 pass: 21103
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 2 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
...
<reset>
CRC valid
fail: 1019 pass: 25165
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

CRC valid
fail: 1019 pass: 26367
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

rcv_crc:FFFFFFFF calc:0493FDE6
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
20FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
ObviousInRetrospect commented 2 years ago

And back to sleep_mode_standby, same slow (1mhz) slave fast (10mhz) master, no errors in 65k operations!!! that means this is totally usable in standby.

code version: https://github.com/ObviousInRetrospect/DualMode/commit/5f6f3e75b4ad620d8185442521d569b4a2592c6e:

CRC valid
fail: 0 pass: 65060
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244
SpenceKonde commented 2 years ago

You got this shit sorted?! You da man!!!I wouldn have had no idea how to deal with and "won't fix'ed" it because i didn't have a clue where to begin.

I'm sitting here futzing up my python code.

Oh byway panic major verion issue in DxCore is likely 1.5.1 not 1.5.0, to piss on everyone's parade

ObviousInRetrospect commented 2 years ago

@MX682X got it sorted, I'm just testing. Now to write an example that turns a t412 into an i2c dac while making correct use of sleep.

@SpenceKonde this fix should get a mention in the release notes after its merged ("problems with i2c slave mode in sleep_mode_standby resolved", unless @MX682X has a better note). I had given up on i2c slaves being able to sleep below idle as causing flakey behavior and stuck i2c buses, I am probably not the only one so calling out the change would be helpful.

I don't critically need sleep_mode_pwr_down if you don't feel like debugging that mode (in which case we should put a note in the documentation recommending against using that mode with i2c slaves) but I am happy to help run it down if you want to.

meanwhile the test hit 107k without error in sleep_mode_standby:

CRC valid
fail: 0 pass: 107039
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
3031323334A5C3A244

I'll stop it for now but try to run it overnight.

SpenceKonde commented 2 years ago

Thank you, I was clueless here.

Does this also need to bported to mTC?

There is generally insane demand for low power from users, with that being the number one complaint (lack of powermanagmentnt library Which Ill write once I fix my client's product iimplement the boards.txt generation fullly, release mTC 2.6, DxC1.5, design 6 more PCBs. amd get tje lights working to impress this lovely individual who'se good at input (while my lights struggle with input but will scream at output (I have a way to output WS2812 signals witout bitbanginging, I think)

ObviousInRetrospect commented 2 years ago

yes this also applies to mTC.

I'm not totally sold on the grand unified power management library. I think some examples and docs might cover quite a bit of it.

the biggest ROI for a library imo is a sleepy version of delay().

FWIW I also think Serial works less well in sleep mode standby than it should but I haven't characterized the problem yet. In theory it should support wake on serial but get a lot of corruption

MX682X commented 2 years ago

Yes, most definetely impacts mTC. Spence, you mentioned once you know someone from Microchip. Maybe it's worth pointing out that out so they might add it to their errata? That when waking up from sleep might have the BUSERR flag set when receiving the STOP condition.

A longer test of sleep_mode_pwr_down hit two types of problem. After 2770 successes the slave stopped responding until the reset button was hit, yielding 681 failures. In this failure mode the slave's main look keeps running. (The counts were not reset at this point). After 8179 successes another failure was hit in which the first transmission had incorrect bytes at the front and 0xFFs for about half.

Are this problems reproducable? Because if they aren't it will be really hard to find them

ObviousInRetrospect commented 2 years ago

Both problems are reproducible but variable time to reproduce. There are a few test traces above. always seems to happen within 10k cycles, usually much faster.

Of the two, the slave stopping responding entirely until reset is by far the worse bug and happens more often. Occasional data corruption can be worked around with a crc.

if you run my test code it will trigger both failures. no need for an ina3221. that code path is now mostly ignored if you #define TEST_PATTERN and the parts that aren't ignored fail harmlessly.

Meanwhile in overnight testing:

CRC valid
fail: 0 pass: 3476989

Re the errata, I have had better luck with Microchip support than I expected (as a non-entity). If you open a documentation case asking where the behavior is documented you might actually get somewhere.

MX682X commented 2 years ago

Well... I've let your code run on a AVR128DA48 on Curiosity Nano and a Attiny1614, also connected to a debugger, and had no problems so far with SLEEP_MODE_POWER_DOWN. I tried the maximum CPU Frequency aswell as 10MHz on DA, 4 MHz on 1614. It's 100k+ passed CRCs. I had more problems getting the correct UART BAUD rate on the 1614 at 4MHz then anything else. To be fair, I was using a slightly modified version of the Wire library. Let's hope that's the reason why I can't reproduce your error, because otherwise, I don't think I will be able to find the error.

src.zip

SpenceKonde commented 2 years ago

Is that a version of the wire library that I should pull into the core?

ObviousInRetrospect commented 2 years ago

reverse the frequency: run the slave (DA) at 1mhz and the master (1614) at 10mhz

ObviousInRetrospect commented 2 years ago

New code give me:

Users/user/Library/Arduino15/packages/DxCore/hardware/megaavr/1.4.10/libraries/Wire/src/twi_pins.c:468:68: error: 'PORTB' undeclared (first use in this function); did you mean 'PORT0'?
     PORT_t *port = (PORT_t *)((portmux == PORTMUX_TWI1_ALT2_gc) ? &PORTB : &PORTF);
                                                                    ^~~~~
                                                                    PORT0
/Users/user/Library/Arduino15/packages/DxCore/hardware/megaavr/1.4.10/libraries/Wire/src/twi_pins.c:468:68: note: each undeclared identifier is reported only once for each function it appears in

So repro attempt 1 (old code):

CRC valid
fail: 0 pass: 199

CRC valid
fail: 0 pass: 200

rcv_crc:FFFFFFFF calc:20BEDCF9
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F
30FFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

repro attempt 2:

CRC valid
fail: 0 pass: 1447

CRC valid
fail: 0 pass: 1448

CRC valid
fail: 0 pass: 1449

rcv_crc:FFFFFFFF calc:C530B076
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
fail: 2 pass: 1449
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

these were running both sides at 10mhz

ObviousInRetrospect commented 2 years ago

and with the new code with the following patch applied:

--- /Users/user/Downloads/src-2/twi_pins.c  2022-08-17 15:34:06.000000000 -0700
+++ /Users/user/Library/Arduino15/packages/DxCore/hardware/megaavr/1.4.10/libraries/Wire/src/twi_pins.c 2022-08-17 19:43:12.000000000 -0700
@@ -464,9 +464,10 @@

 void TWI1_usePullups() {
   #if defined(TWI1)
-    uint8_t portmux = PORTMUX.TWIROUTEA & PORTMUX_TWI1_gm;
-    PORT_t *port = (PORT_t *)((portmux == PORTMUX_TWI1_ALT2_gc) ? &PORTB : &PORTF);
-
+    //uint8_t portmux = PORTMUX.TWIROUTEA & PORTMUX_TWI1_gm;
+    //PORT_t *port = (PORT_t *)((portmux == PORTMUX_TWI1_ALT2_gc) ? &PORTB : &PORTF);
+    uint8_t temp = PORTMUX.TWIROUTEA & PORTMUX_TWI1_gm;
+    PORT_t *port = portToPortStruct(temp == 8?PB:PF);
     port->OUTCLR = 0x0C;  // bits 2 and 3
     port->PIN2CTRL |= PORT_PULLUPEN_bm;
     port->PIN3CTRL |= PORT_PULLUPEN_bm;

try 1:

CRC valid
fail: 0 pass: 2002

CRC valid
fail: 0 pass: 2003

CRC valid
fail: 0 pass: 2004

rcv_crc:FFFFFFFF calc:FEA3EF34
bad crc, will retry 2 times
000102030405060708FFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

try 2:

CRC valid
fail: 0 pass: 5448

CRC valid
fail: 0 pass: 5449

rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 2 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

try 3: (timed at about 4 seconds)

CRC valid
fail: 0 pass: 481

rcv_crc:FFFFFFFF calc:8088E2CE
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1BFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
ObviousInRetrospect commented 2 years ago

just to verify: you applied this patch to DualModeExample.ino:

@ -326,7 +326,7 @@ void setup() {
  pinMode(P_LED, OUTPUT);
  digitalWriteFast(P_LED, 1);
  RTC_init();
-  set_sleep_mode(SLEEP_MODE_STANDBY);
+  set_sleep_mode(SLEEP_MODE_PWR_DOWN);
  sleep_enable();
#ifdef ENABLE_SCAN     // i2c scan
  byte error, address; // variable for error and I2C address

And are running the slave (DualModeExample.ino) in Master & Slave (dual mode) on the DA using DxCore 1.4.10 and a master on a t1614?

Have you tried it without the debugger enabled?

Is the Curiosity Nano a DM164151? (https://www.digikey.com/en/products/detail/microchip-technology/DM164151/12177481) - if so I will pick up a copy. I'm sure the 1614 vs 3224 doesn't matter but have 1614s lying around (ironically I just desoldered the 1614 from the board acting as a master and replaced it with a 3224).

MX682X commented 2 years ago

I did some changes in the setup function :

void setup() {
  memset(reg.r, 0, sizeof(ewdt_regs_t));
  memset(bak.r, 0, sizeof(ewdt_regs_t));
  TCB2.CTRLA |= 1 << TCB_RUNSTDBY_bp;

  Wire.usePullups();
  Wire.enableDualMode(false);

  Wire.onReceive(receiveHandler);
  Wire.onRequest(requestHandler);
  Wire.pins(PIN_PA2, PIN_PA3);
  Wire.begin();
  Wire.begin(0x42);

  Serial1.begin(115200);
  Serial1.println("boot");

  ina_reset();
  ina_config();

  pinMode(P_LED, OUTPUT);
  pinMode(PIN_PC6, OUTPUT);  (Built in LED)
  digitalWriteFast(P_LED, 1);
  digitalWriteFast(PIN_PC6, LOW);

  RTC_init();
  set_sleep_mode(SLEEP_MODE_PWR_DOWN);
  sleep_enable();

#ifdef ENABLE_SCAN     // i2c scan
  ... is the same
#endif

  #ifdef TEST_PATTERN
    rd_ina3221();
  #endif
}

also:

while (!wake && !Wire.slaveTransactionOpen()) {
    Serial1.flush();
    digitalWriteFast(PIN_PC6, LOW);
    sleep_cpu();
    digitalWriteFast(PIN_PC6, HIGH);
    //delay(1);
  }

then in the DualModeExampleClient

void setup() {
  // put your setup code here, to run once:
 Wire.usePullups();
  Wire.swap(0);
  Wire.begin();

  Serial.begin(115200);

  byte error, address; // variable for error and I2C address
  int nDevices;

  do{
  //rest is same

Yes, I'm using two DM164151 to debug both chips. (You'll need to change some settings in Microchip studio to debug any other chips). DA is running at 4MHz. Core didn't want to work at 1MHz - a bug? 1614 is running at 10MHz (didn't check the actual frequency, I trust the Board settings)

Also I didn't expect any chip with TWI1 to not have a PORTB, so that's where the error came from,

Without a debugger running, I only got CRC errors on startup, then it run error free 100k+ times. (I've used the Debugger to reset them before, didn't bother to add a pin IRQ driven software Reset.

ObviousInRetrospect commented 2 years ago

I have a DM164151 coming tomorrow mostly to see if it reproduces on a 128DA48 - wouldn't shock me if there were differences (and I have a couple ICE3s lying around if I need to watch more than one MCU). I was able to repro it with a debugger attached although I am have not a clue what I should be looking for. On AVR I mostly do printf debugging because I am usually on a Mac and mplab-x is horribly slow and crashes too often for me to use it. But if there is something you want me to inspect I can.

MX682X commented 2 years ago

I was able to repro it with a debugger attached although I am have not a clue what I should be looking for.

I had my Oscilloscope connected to the data lines and a breakpoint when the CRC did not match, so I would see what was really transmitted (best way to see where the corruption originated). If you don't have a way to monitor the datalines, the best way would be to have both chips hit a break point when the CRC didn't match and compare the buffers.

ObviousInRetrospect commented 2 years ago

So I just repro'ed it 3x on a DM164151 very quickly. So much for my theory this is DB only or something wrong with my hardware.

Have you tried running the code as-is (set to PWR_DOWN and with #define TEST_PATTERN without the debugger and without the gpio modifications just to see if you can repro the failure?

I am using dual mode and 2x master and slave (C2/C3 is where the slave that is failing is).

I could rig up a logic analyzer but I was assuming that the slave is just not responding and the master is thus clocking in 0xFF's as it would when talking to a non-existent slave. I just went ahead and set a flag in request handler and receive handler that triggers a print in the loop and can confirm that neither handler is being invoked once the failure happens.

I am not sure how you are setting a breakpoint on the slave when the master detects corruption.

CRC valid
fail: 0 pass: 226

rcv_crc:FFFFFFFF calc:6B399483
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
101112131415161718191A1B1C1D1E1F
202122232425262728292AFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
fail: 2 pass: 226
CRC valid
fail: 0 pass: 257

rcv_crc:FFFFFFFF calc:9E7FF7D1
bad crc, will retry 2 times
000102030405060708090A0B0C0D0E0F
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

CRC valid
fail: 1936 pass: 2056

rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 2 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE

I assume you are 100% sure you are going into SLEEP_MODE_PWR_DOWN and something about the debugger setup isn't keeping it from sleeping?

ObviousInRetrospect commented 2 years ago

DxCore.megaavr.avrda.zip is a built copy for the DM164151 - that might be useful to figure out why mine crashes constantly and yours doesn't - if you write that does it crash? This is set to 10mhz and the slave is on PC2/PC3. Pull-ups are not enabled.

DualModeExample.zip is the exact inputs that generated it

MX682X commented 2 years ago

I think I found the problem. Could you confirm it by uncomenting TWI_TIMEOUT_ENABLE in twi.h for the device that runs the DualModeExampleClient sketch and tell me if you still have the problems

MX682X commented 2 years ago

The problem was actually in your code. EDIT: just double checked, the reason the code on my end worked and not on yours, was because you had

if(printflag){
    Serial.print("+");
    printflag=0;
  }

But it was missing in the github version

This created a race condition with

while (!wake && !Wire.slaveTransactionOpen()) { 
    Serial.flush(); 
    sleep_cpu();
    //delay(1);
  }

RigolDS3

Basically, your code prepares to sleep before onRequest is started, and sleeps after it was started. You have to move Serial.flush out of this while loop. Or Increase the speed of Serial.

MX682X commented 2 years ago

furthermore, I recommend to do check, if the expected amount of bytes were received, like this: if ( len != Wire.requestFrom(devaddr, (size_t)len) )

ObviousInRetrospect commented 2 years ago

Really appreciate all the help with this issue. I think you found the root cause but I don't think we have a working fix.

I actually added the code you flagged:

if(printflag){
    Serial.print("+");
    printflag=0;
  }

Between the second and third repro (see my comment above "I just went ahead and set a flag in request handler and receive handler that triggers a print in the loop and can confirm that neither handler is being invoked once the failure happens."), it wasn't required for the problem or any of the earlier traces reported on previous days (but did make it happen much faster). It was added to confirm that once the slave stopped responding the handlers also were no longer being invoked.

Unfortunately, the change you proposed:

  while (!wake && Wire.slaveTransactionOpen()) {
    delay(1);
  }
  if(!wake){
    Serial.flush();
  }
  while (!wake && !Wire.slaveTransactionOpen()) {
    sleep_cpu();
    //delay(1);
  }

Still lets me repro the problem, although it cuts the incidence rate by many orders of magnitude. I would have called it fixed if I hadn't gotten (un?)lucky with the first failure happening very quickly:

CRC valid
fail: 0 pass: 2957

rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 2 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF

because the second failure took 144k cycles to happen:

CRC valid
fail: 3840 pass: 144212

CRC valid
fail: 3840 pass: 144213

rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 2 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
rcv_crc:FFFFFFFF calc:33E718CE
bad crc, will retry 1 times
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
fail: 3842 pass: 144213

I'll note that this new version still has a much narrower version of the same race condition in that an interrupt can fire between !Wire.slaveTransactionOpen() and the sleep_cpu() call. That was sort of bugging me when I wrote it but I have no clue how to solve that issue as it seems like the usual solution of adding cli()/sei() would just cause the mcu to sleep forever. Based on what you wrote above this is probably the source of the problem - the slave stops responding forever in this case.

Is there a safe way to write this check? If not, would adding Wire.sleepIfSlaveTransactionNotOpen() at least shrink the race condition further? (by removing the function call return overhead between the check and the sleep). I assume an assembly version of this that did a skip-if-bit-set followed by a sleep is about as good as we can get.

If there isn't a way to remove this race condition, is there a way on the slave side to determine this has happened and fix things? I assume that any fix would require a second wake source beyond the now-defunct twi interrupt, probably PIT, which will greatly increase power use for some applications.

I think the other fixes you are proposing are on the master side and would cause the master to either detect the slave not responding or hang and don't solve the issue of once the slave hits this case it dies.

Including another zip so exactly what I am running is clear. DualModeExample.zip

Thanks again!

ObviousInRetrospect commented 2 years ago

Edit: Currently testing

  while (!wake && !Wire.slaveTransactionOpen()) {
    cli();
    if(!Wire.slaveTransactionOpen()){
      sei();
      sleep_cpu();
    }
    else{
      sei();
    }
    //delay(1);
  }

http://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf says the instruction following SEI will be executed before any pending interrupts so assuming the compiler generated reasonable code this should actually be a fix!

For clarity I have not made the proposed changes to DualModeExampleClient in the above failures - I agree they would stop it from detecting corruption and instead cause it to hang or detect no data, that doing so might be better, and that DualModeExample would still rarely go from being an i2c slave to a paperweight when the race condition is triggered if run in SLEEP_MODE_PWR_DOWN.

Do you have a version of the library that can be linked to from the docs (that handles the TWI1 without PORTB issue)? I kind of want to do a documentation update saying that in versions prior to 2.6/1.5 either use sleep_mode_idle or upgrade your library to ?

MX682X commented 2 years ago

if(!wake){ try it without this. always make sure that there is no ongoing serial transaction, then make sure no flag is set.

If that won't work, you could try to replace the function with this:

uint8_t TwoWire::slaveTransactionOpen() {
   if (vars._module->SSTATUS & (TWI_DIR_bm | TWI_AP_bm) return 2;
   if (vars._module->SSTATUS & TWI_AP_bm) return 1;
   return 0;
}
ObviousInRetrospect commented 2 years ago

actually trying:

  while (!wake && !Wire.slaveTransactionOpen()) {
    cli();
    if(!Wire.slaveTransactionOpen()){
      sei();
      sleep_cpu();
    }
    else{
      sei();
    }
    //delay(1);
  }

http://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf says the instruction following SEI will be executed before any pending interrupts so assuming the compiler generated reasonable code this should actually be a fix!

ObviousInRetrospect commented 2 years ago

Rather than assuming anything about avr-gcc I inspected the assembly. It looks right. This is actually a fix.

    171e:   f8 94           cli
    1720:   0e 94 81 01     call    0x302   ; 0x302 <TwoWire::slaveTransactionOpen() [clone .constprop.6]>
    1724:   81 11           cpse    r24, r1
/Users/user/Documents/GitHub/DualMode/DualModeExample/DualModeExample.ino:392
    cli();
    if(!Wire.slaveTransactionOpen()){
    1726:   03 c0           rjmp    .+6         ; 0x172e <main+0x6e8>
    1728:   78 94           sei
    172a:   88 95           sleep
    172c:   fb ce           rjmp    .-522       ; 0x1524 <main+0x4de>

I'm certainly happy to throw together an example and some documentation but I think it would be better if Wire provided an easy to use sleepUnlessSlaveTransactionOpen() function that encapsulated this ugliness. I2C slaves are often sleepy. They also are sufficiently Wire-centric that it doesn't feel wrong to delegate the sleep call to Wire.

MX682X commented 2 years ago

weird. My compiler does not generate a call instruction... It inlines everything. Do you use 1x Wire or 2x Wire?

ObviousInRetrospect commented 2 years ago

2x at the moment. Might be why I was having an easier time reproducing it.

ObviousInRetrospect commented 2 years ago

so far I'm around 300k without a failure with the cli/sei added, I think its a fix but will obviously run it for many hours / overnight.

ObviousInRetrospect commented 2 years ago

Do you have a version of the library that can be linked to from the docs (that handles the TWI1 without PORTB issue)? I kind of want to do a documentation update saying that in versions prior to 2.6/1.5 either use sleep_mode_idle or upgrade your library to (link here)

MX682X commented 2 years ago

Nope. I don't have a standalone library, but the TWI1/PORTB issue is already fixed locally.