ermtl / Open-Source-Ventilator

Complete control software for an emergency medical ventilator.
GNU General Public License v3.0
52 stars 30 forks source link

Program / EEPROM memory integrity check library #6

Open ermtl opened 4 years ago

ermtl commented 4 years ago

Sometimes, mostly after a few years (or less in case of hardware defect), processor program memory or it's EEPROM start to lose their content, sometimes in elusive / voltage and/or temperature dependant ways. When that happens, the program will behave erratically, hang, and fail to execute it's intended function. Same can happen if the EEPROM is corrupted.

In a project where the machine's proper functioning is critical, this can't be allowed to happen.

Newer AVR processors (and others) offer hardware program / EEPROM memory integrity check.

A complete description can be found in the ATMEGA4809 datasheet starting at page 370.

For processors that don't have it, the function can be made using software. The basic function is to scan the whole program memory and calculate a CRC. The computed CRC is compared to several copies of the previously calculated value stored in multiple locations on the EEPROM. If it fails, the program acts accordingly (generally it should just write an error code to EEPROM and stop. Upon the next start, an error message should be displayed to alert the user that the machine could be unreliable. Same can be done with EEPROM (must be separated as the operation should be done after each EEPROM update).

I could not find a library that implement this. Would anyone be interested in programming one (as a separate, independent project) ?

ZakCodes commented 4 years ago

EEPROM corruption is probably the most likely failure to happen, since the MCU could shut down while writing to the EEPROM and corrupt it.

However, here's what is said in the ATMEGA328P's datatsheet about data retention:

Reliability qualification results show that the projected data retention failure rate is much less than 1 PPM over 20 years at 85°C or 100 years at 25°C.

Since this project is more for emergency and that the MCU shouldn't get used for 20 years, we can expect even less than 1PPM to fail. It would have been nice from Atmel/Microchip to give us a graph of the number of PPM to fail based on their age and the temperature.

Blimpyway commented 4 years ago

I recall talks on arduino forum where people tested arduino's eeprom failing at up to 1M write cycles. The specs is 100k cycles. So.. I wouldn't worry too much. If power fails there is a more immediate problem than what happens when the arduino reboots.

Blimpyway commented 4 years ago

A blank eeprom is all 0xFF I think. Power failure exactly in that sub- millisecond which arduino needs to update a value is very unlikely.

ermtl commented 4 years ago

Most Arduino projects are controlling gizmos that blink. This one is about people's lives. a bug that results is over pressure in the lung can kill the patient or damage their lungs and make them short breathing for life. If you were the patient, would you take "I wouldn't worry too much" for an answer ? That's why automotive and medical designs require these kind of checks.

ZakCodes commented 4 years ago

@Blimpyway I agree. It is really unlikely, especially considering that data is only written to the EEPROM whenever a user is changing its configuration. However, it shouldn't be too hard to compute a CRC of the EEPROM every time we write to it, so I think it's worth it if it can prevent an accident.

Blimpyway commented 4 years ago

Ok. Simple xor-ing 255x32bits int (or even adding) chunks and writing the result on the 256-th should be more than enough. On reboot xor-again, if the sum does not match just beep an alarm and do nothing. Whoever operates it will have to dial in all settings. If that is ok for you I'll write it

ermtl commented 4 years ago

There are reasons why CRCs exist, and that's because they are much more resistant than basic XOR. When memory corruption errors occur, that's often in clusters. If 2 bytes are affected, a CRC check won't see it. That's why CRC16 or CRC32 is needed and the check has to be done on the program memory as well as the EEPROM.

The mindset when developing safety critical devices is very different from the maker mindset where the job is considered done as soon as the main function kind of works. Here, we must be sure that it will always work as intended.

https://en.wikipedia.org/wiki/Cyclic_redundancy_check https://en.wikipedia.org/wiki/Hamming_distance

Blimpyway commented 4 years ago

failtest.zip

Here-s a zip file with

And yes it's using a slightly modified CRC32 sum I picked from arduino.cc examples. The change just skips the EEPROM memory slot where the crc itself needs to be saved. This address can be redefined in ee_failsafe.ino e.g.

define SKIP_ADDRESS 16

Blimpyway commented 4 years ago

feel free to ask me if anything isn't clear.

Blimpyway commented 4 years ago

If we-re to share thoughts about critical systems guidelines, then people lifes should not depend on software/mechanics developed in <4 weeks no matter how good designers are.

ermtl commented 4 years ago

That was fast ! I'll test it tonight. Can you also add the ability to scan the program memory ? The test should be done only until the end of the program, else, previously written programs will appear as random garbage and the value will not be the same from one device to the next.

I agree with you about the short time, unfortunately we don't get to decide about that, that's why it's an emergency. Normally such stuff takes months if not years for a whole team to develop.

Blimpyway commented 4 years ago

If you really want it to be safe, take it functionally. I mean have a function that checks correct pressure swings are registered and reset the wdt in the same place. Medics already want to be alerted if pressure gets too low or too high. Add an alert for pressure not changing. This covers pressure sensor / I2C failure, air leakage, bag exploded stepper unplugged and whatever other part of code not going right. If pressure swings are recorded at the mouth of the patient then chances are s/he is not dead yet. If sensor contamination is acceptable (it might be autoclavable since it is made to resist soldering) it can even sense exhalation temperature or, at least it can tell the patient isn't entirely cool.

mawaisbadri commented 4 years ago

One solution to resolve EEPROM write problem is, soft shutdown circuit, So MCU can shutdown whole system.

2nd precaution is to save EEPROM values on multiple locations at same times. Means one value is being saved in three EEPROM locations at same time. On reading, read three locations and check if three are same?

3rd one is add separate safe guard circuits for critical things. Hope it helps.