Safecast / onyxfirmware

20 stars 8 forks source link

Serial race condition? #16

Open LucidOne opened 11 years ago

LucidOne commented 11 years ago

I am looking at adding support for accessing cpm over serial 1

void cmd_cpm(char *line) {
  char str[16];
  sprintf(str,"%.3f",system_geiger->get_cpm());
  serial_write_string(str);
}

If I run this command multiple times the device will lock up. KEYVALDUMP also as the same problem, if I run it over and over again the device locks up.

Any ideas where to start looking to fix this?

new299 commented 11 years ago

I don't think @IHeartRobotics pull request was supposed to fix this bug, so I'm reopening for the moment until we can verify that it's fixed.

elafargue commented 11 years ago

This is not fixed as far as I can tell: I just added "live view" in my Onyx viewer app, and the device locks up consistently after a minute or two of doing "GETCPM" once a second.

LucidOne commented 11 years ago

I wonder if it is locking when the geiger receives a count interrupt at the same time a read is happening.

elafargue commented 11 years ago

I did further tests on the latest dev branch: the counter consistently locks up within 2 minutes when sending GETCPM once per second, and sooner if I send the command faster.

IMHO this needs to be investigated pretty quick, since it influences reliability both on the CLI front, and also logging reliability.

new299 commented 11 years ago

Trying to track this down now.

new299 commented 11 years ago

I think the issue is with calls to the mpr121 (captouch controller) over i2c. These appeared to /sometimes/ hanging when the serial port was put under load, I'm not sure of the exact cause.

I've removed the non-interrupt driven calls to the mpr121, they were only required for faulty captouch devices we had in a couple of prototypes. I've tested GETCPM once a second and it no longer seems to hang on my device (with about 15mins testing currently).

Let me know if this resolves the issue for you guys.

elafargue commented 11 years ago

Just tested, it does seem to do the trick for me! Thanks Nava.

elafargue commented 11 years ago

After further tests: it looks like this change did something to the button on the back of the counter too, it does not seem to work properly on that version. The counter battery just starts to drain, the screen is dark, and the counter does not go into sleep mode properly. Reverting to the "12.12-devel" version solves this issue.

Does that sound like something that could have been introduced by this change?

new299 commented 11 years ago

@elafargue after some debugging it looks like this might be an issue with the changes in commit 071314bdfdf68a242b75c7a8ed0b547ef8e31003 "improve log format: Hall-effect sensor, initial fix". It could be the read mag sensor is initialized differently in logging mode, or the change in log sizes causes an issue. I've commented out the relevant changes and will revisit this later.

elafargue commented 10 years ago

This issue is now fixed as far as I can tell? No lockups for a long time on my device. Any other tester can confirm?