fesch / CanZE

Take a closer look at your ZE car
http://canze.fisch.lu
Other
123 stars 70 forks source link

Dongle sometimes stops responding, usually with a fast blinking or not at all blinking Host LED #17

Closed yoh-there closed 9 years ago

yoh-there commented 9 years ago

On my S3, when I run the driving screen, things freeze (BT led not flashing on the dongle for quite a while, at least 10 seconds or more). Could this be a time out? I have a feeling we need to include a time out value defined at al filters as some raw packets flow very fast, others very slow.

I could do the analysis per frame type if we agree?

ISO-TP sequences should have a time out in the order of 100 ms.

fesch commented 9 years ago

The ELM327 class has a read-timeout of 500ms. The method "sendAndWaitForAnswer" implements this.

Maybe, and this will be harder to debug while driving around, we should check the if a timeout occures, the next field is being queried (increase fieldIndex). I will try to move the corresponding code into a "finally" section ...

yoh-there commented 9 years ago

Oh yes, the debugging! 500 ms does not correspond at all with what I am seeing. Is this also for the free frames? Or are you simply waiting for something to come along? Let me know if you think it is needed to analyse the expected interval per frame type (for free frames) please.

fesch commented 9 years ago

Actually the method does this:

I used 500ms to be on the safe side. I once monitored the elapsed time a response need to come along the serial line and yes, that was never more that 50ms or 60ms ...

yoh-there commented 9 years ago

Got it, thanks for the explanation. I will add some debugging, see it is is one specific frame type being the culprit.

yoh-there commented 9 years ago

The driving screen is very problematic, but I only pull data, just like any other field and just like the charging screen, which works pretty ok. I don't get it. Maybe you can run it for a short while? After a few cycles either the traffic LED goes off (while the App is still running, see the debug line, the last line of the screen, there is no activity), or there is what seems like very long traffic. Almost as if the ELM is pushed over the edge.

What is your method of debugging this? In Studio I see so insanely much debugging also from the phone itself, I must overlook some features I guess.

In the C-code, I implemented "restore order". When the code sees things going wrong, it resets the ELM through a fast reboot (atz is the slowest as it includes LED testing). atws is faster, it is an atz without the LED test. And then there is atd, which is somewhat like a reset but without going to the default settings.

yoh-there commented 9 years ago

Not entirely sure, but it seems the problems went away when I removed the queries to Pedal and ccPedal. Since we have never had an issue with Pedal, it's either how I use the progressBars (tbh, I doubt that, seems pretty straightforward to me?) or the query to ccPedal. If I were a betting man I'd put my money on the latter. I will do some more testing.

Best practice for debugging would of course still be appreciated.

yoh-there commented 9 years ago

I am painstakingly removing and adding filters to the driving screen to crack this. It doesn't seem to be the App, the debug line . While charging is fairly stable, driving keeps giving me a headache. It seems one of the filters is pushing the ELM over the edge and I suspect it has to do with timing and/or a specific order in which the filter loop is run. Stay tuned.

solmoller commented 9 years ago

Hi Jeroen,

Fyi My debug line spends considerable time on xxx206.24.. (...)

yoh-there commented 9 years ago

It seems it is the dongle itself that hangs. Order is not restored properly. Still not under control, removing listeners and adding more and more debug lines :-(

Am I the only one? Did one of you guys tried to drive around for a few minutes with the "driving" screen? Curious if it is my specific dongle.

solmoller commented 9 years ago

I've done testing on a two hour commute, but that was last week :-) I had no issues apart from odometer looking strange now and then

Yesterday I tested for half an hour with no major problems

I'll probaly test again Wednesday or Thursday

Henrik

2015-09-21 20:11 GMT+02:00 yoh-there notifications@github.com:

It seems it is the dongle itself that hangs. Order is not restored properly. Still not under control, removing listeners and adding more and more debug lines :-(

Am I the only one? Did one of you guys tried to drive around for a few minutes with the "driving" screen? Curious if it is my specific dongle.

— Reply to this email directly or view it on GitHub https://github.com/fesch/CanZE/issues/17#issuecomment-142064388.

yoh-there commented 9 years ago

Thank you Henrik! That makes me worry less. It is probably either the dongle, or the Zoe overloading it more than the Fluence.

fesch commented 9 years ago

I had the same problem today: no more data from the dongle. I restarted CanZE to force a reinitialisation, but no response until I pulled out the dongle and connected it back. Afterwards, using the same screen, I did not have any problems ...

yoh-there commented 9 years ago

Idea to test: In the optimization, the delay time between frames for multi-frame ISO-TP frames was somewhere changed from an (arbitrary) 32 ms to 0 ms. Since the ELM has very limited RAM resources and can easily be flooded on the output side, I will put this back in and see if things improve. If they do improve, I will hunt for a reasonable optimum.

fesch commented 9 years ago

Correct if I am wrong: with ISO-TP frames there is no flooding as there will be not more answer frames as the request answer will be long ... unlike with the atma command for free frames where they will stream in until stopped.

yoh-there commented 9 years ago

Only true if flow control is performed on a per frame basis. Which is impossible to do with the ELM327 unless we figure out a way to avoid at fc xxx, .

Note that a really long answer, ie cell voltages, is simply pushed out full speed by the ecu after the 300000 reply to the FIRST frame. So, I think we need to maybe throttle the NEXT frames a little bit. I will investigate if this is solves this problem today.

I am changing the subject title as I am reasonably convinced now it is the dongle and not the app going bezerks.

yoh-there commented 9 years ago

Did a very quick test after merging the ELM optimizations, and it seems that the somewhat more lenient issuing of the NEXT frames gives the dongle indeed enough breathing room to continue. Too early to tell, but I am hopeful, as earlier, the dongle quit on my "driving" activity after less than a minute, even when parked. I will add the progress bars back and drive a bit longer tonight.

yoh-there commented 9 years ago

Making this change and doing ISO-TP queries exclusively improved stabiity, but not good enough. I think this will hamper us in the long run so I have a request to all: am I the only one? In my case it seems to be prevalent in the driving screen, so that is the best one to "test". That might be coincidence though.

Further, I need to go back to the drawing board. Debugging this on my phone is really too tedious, so the coming days I will probably experiment more with the laptop, re-implementing what optimizations we did, and replaying the exact queries that seem to create the problem. At least ONE time I saw the dongle spewing ATMA data and the software not recognizing that and thus unable to restore order.

While I write this down I realize a few things:

solmoller commented 9 years ago

I just pulled and compiled and I can confirm that I also get minute long hangs of the queries.

In addition Tacho and X10 screens crash immediately

Henrik

2015-09-25 7:31 GMT+02:00 yoh-there notifications@github.com:

Only true if flow control is performed on a per frame basis. Which is impossible to do with the ELM327 unless we figure out a way to avoid at fc xxx, .

Note that a really long answer, ie cell voltages, is simply pushed out full speed by the ecu after the 300000 reply to the FIRST frame. So, I think we need to maybe throttle the NEXT frames a little bit. I will investigate if this is the problem today.

Op vr 25 sep. 2015 06:42 schreef Bob Fisch notifications@github.com:

Correct if I am wrong: with ISO-TP frames there is no flooding as there will be not more answer frames as the request answer will be long ... unlike with the atma command for free frames where they will stream in until stopped.

— Reply to this email directly or view it on GitHub https://github.com/fesch/CanZE/issues/17#issuecomment-143125411.

  • Sent from my phone, so please pardon any terseness or typos -

— Reply to this email directly or view it on GitHub https://github.com/fesch/CanZE/issues/17#issuecomment-143130721.

fesch commented 9 years ago

X10 screen is intended to only work if you select X10 as car. I use this for testing purposes to not interfere with the other fields.

BTW:

fesch commented 9 years ago

Ok, two things:

yoh-there commented 9 years ago

Great, will try that. If it is indeed the culprit, I think we need to make the timeout a Field property, since we know now per frame what it should be.

Edit: Duh, I completely misread the code. The wait is a WAIT BEFORE sending the command and of course that should not be 1500. So thank you for that catch!

The RESPONSE-timeout is a constant. There I disagree, but as it is a performance issue I will open a new issue for that.

More testing!

fesch commented 9 years ago

I wait for some testing results before closing this issue ...

yoh-there commented 9 years ago

Not solved unfortunately. The new braking screen (technically very straightforward) locked up quite often. What was interesting is that the debug line paused for a very long (several seconds) time too.

solmoller commented 9 years ago

Did some long term testing today, got stuck twice, at 7bb.6104.128 and 7ec.623206.24

I might have used the phone during the test - but doing that while driving the car would kinda have me not looking at the app meanwhile.

Strangely, on the way back I used the text based driving info and had no issues for the half hour I tested.

yoh-there commented 9 years ago

Thanks @solmoller , good info.

FYI, the debug info shows the last successful query, so we usually need to look to what is supposed to come after it. It drives me nuts!

Seems like we need to work towards a strategy of better detecting when things go wrong and then trying to reset as much as we can.

yoh-there commented 9 years ago

I just might have fixed this issue. Short explanation: when building the firmware screen, I noticed every time a single ECU did not respond (and 3 ECU's we have the addresses of NEVER respond at all; they are either firewalled or they don't exist), CanZE quit. One could see the dongle was still being queried, but no data was processed and it never reached the screen's activity.

I added a lot of error checking in the ELM module and now it resets not only the dongle but also downstream processing when things go wrong (either time out or unexpected data). See this commit for details https://github.com/yoh-there/CanZE/commit/91ebca311637294c429d7722e7a4808febd960c5

Please do check on all your dongles. It works on my KV902, but..... you never know what I broke.

yoh-there commented 9 years ago

I think I found another possible instability. Please check my thinking. I noticed when switching screens or exiting the app, sometimes the host LED kept flashing fast, even when the bluetooth link was disconnected (blue LED off). This leads me to believe the Bluetooth thread is simply killed. If this thinking is correct, an improvement might be to post a "stop" signal to the ELM device class instead, which would break the query loop. The device class could report back a "ok, I am done". In other words, once an ISO-TP or free frame sequence is started, it should be finished before killing the BT.

It is my impression the improved resetting already mitigates the issues, but it a crude way of doing things.

fesch commented 9 years ago

When an activity closes, the following is happening:

If the device reading thread is inside a query at that moment, it will finish it. The worst that could happen is that no listeners will be fired and the answers will not be considered at that right moment. In fact, the BT loop will and should be active as long as CanZE is running. Switching between different activities should not stop it in any way. That's the reason by "fields" is synchronized and each access to that field should respect it.

Another thing: When I last was hit by the "dongle stuck" problem, I did not switch between activities at all. Just started the app, opened a given activity, waited a bit and Bang! hot a hang with strange result. I retried yesterday, but was not able to reproduce it.

yoh-there commented 9 years ago

I understand. Are you sure the same "safe stopping" happens when starting "settings" and exiting CanZe? I did end up out of the App with an ATMA data still blurping data to a now defunct Bluetooth link. It happened one time before and at that time I was smart enough to connect with a Bluetooth terminal. Indeed, the LED flickering was ATMA.

I can understand how the queryNextFilter would finish when an activity is closing. Where I am not sure is those occasions where the Bluetooth connection is stopped (a this point, that should only be when opening settings or closing CanZE. Since that would always happens from the main activity, it would always be doing a speed query (5d7.0), using ATMA. Is that truly being allowed to finish before the actual Bluetooth stream is destroyed?

On the bright side, I drove 90 km today, mostly with the driving screen on and that was really stable. Less stable with the braking screen, so I will have another look at that code, if I did things different there (I don't think so?)

fesch commented 9 years ago

Okay, I see what you mean. I have something on the tips of my fingers and will test later on, as soon as I could get some time to go to the car ... (an ELM simulator would be nice ;-))

yoh-there commented 9 years ago

How about a Zoe Canbus emulator? ;-) We could probably make a Due do that :-P

On Sun, Oct 4, 2015 at 5:32 PM, Bob Fisch notifications@github.com wrote:

Okay, I see what you mean. I have something on the tips of my fingers and will test later on, as soon as I could get some time to go to the car ... (an ELM simulator would be nice ;-))

— Reply to this email directly or view it on GitHub https://github.com/fesch/CanZE/issues/17#issuecomment-145358742.

solmoller commented 9 years ago

Nerd index: over 9000 and increasing Den 04/10/2015 17.40 skrev "yoh-there" notifications@github.com:

How about a Zoe Canbus emulator? ;-) We could probably make a Due do that :-P

On Sun, Oct 4, 2015 at 5:32 PM, Bob Fisch notifications@github.com wrote:

Okay, I see what you mean. I have something on the tips of my fingers and will test later on, as soon as I could get some time to go to the car ... (an ELM simulator would be nice ;-))

— Reply to this email directly or view it on GitHub https://github.com/fesch/CanZE/issues/17#issuecomment-145358742.

— Reply to this email directly or view it on GitHub https://github.com/fesch/CanZE/issues/17#issuecomment-145359215.

fesch commented 9 years ago

Is this still an issue?

yoh-there commented 9 years ago

Even if not, I am closing this item. Too many things have changed and the original issue I tried to describe above has been addressed in another issue.