Closed aaeegg closed 1 year ago
hey @aaeegg , long time no see. Do you remember what the reported failure was ?
Here's the log:
scantool/850> connect aw50
diag_os_unix.c:438: WARNING: Not running as superuser; things may not work correctly
diag_os_unix.c:451: sched_setscheduler failed: Operation not permitted.
clock_gettime(): using CLOCK_MONOTONIC_RAW
clock_nanosleep(): using CLOCK_MONOTONIC
diag_os_gethrt() resolution <= 2us, avg ~1us
diag_os_getms() resolution <= ~1ms, avg ~1ms
testing diag_os_millisleep(), this will take a moment...
diag_os_chronoms() : initial time 3358872304ms; resolution: ~1ms
diag_l0_elm.c:410: Sending ATI to ELM32x at 38400...
diag_l0_elm.c:410: Sending ATI to ELM32x at 9600...
diag_l0_elm.c:410: Sending ATI to ELM32x at 115200...
Official ELM found, v1.3a
Connected to AW50-42 transmission.
scantool/850> dtc
Stored DTCs:
AT-??? (08)
scantool/850> cleardtc
Are you sure you wish to clear the Diagnostic Trouble Codes (y/n) ? y
Failed
scantool/850> dtc
Couldn't retrieve DTCs.
scantool/850> disconnect
diag_l0_elm.c:335: ELM returned error : ?
diag_l0_elm.c:336: Unspecified Error.
Disconnected from AW50-42 transmission.
scantool/850> connect aw50
diag_l0_elm.c:410: Sending ATI to ELM32x at 38400...
diag_l0_elm.c:410: Sending ATI to ELM32x at 9600...
diag_l0_elm.c:410: Sending ATI to ELM32x at 115200...
Official ELM found, v1.3a
Connected to AW50-42 transmission.
scantool/850> dtc
No stored DTCs.
you should know that code better than me since you wrote it P) Looks like the error is coming from here : https://github.com/fenugrec/freediag/blob/master/scantool/scantool_850.c#L1420
Suggest adding some DIAG_DBGM(...
lines around that area and the last function it calls.
Agreed timeouts may need to be higher since this is an ELM, but without more debug info or a bus trace it will be hard to guess.
In response to the clearDiagnosticInformation request (AF 01), the TCM sends a series of routineNotCompleteOrServiceInProgress responses (7E xx 23) at ~52ms intervals, then after a total of nearly a full second, a success response (EF 01).
Raw communication as seen when connecting to the interface (OBDLink SX) with a terminal program:
>AF 01
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
84 13 6E 7E AF 23 55
83 13 6E EF 01 F4
>
Scope shot of overall timing (readDiagnosticTroubleCodes request at G1, response at G2, immediately followed by clearDiagnosticInformation request at x, final response at o):
Scope shot of routineNotCompleteOrServiceInProgress repetition rate:
The behavior of the TCM is in contrast to the fuel management ECU, which immediately sends a success response. scantool handles the latter correctly. But when faced with the TCM's series of responses, scantool only looks at the first routineNotCompleteOrServiceInProgress response and doesn't wait for the interface to send the rest of the responses and the > prompt. It interprets the routineNotCompleteOrServiceInProgress response as a failure and ignores the rest of the responses.
scantool output with debugging turned on:
freediag/850> cleardtc
Are you sure you wish to clear the Diagnostic Trouble Codes (y/n) ? y
You haven't read the DTCs yet. Are you sure you wish to clear them (y/n) ? y
diag_l2.c:523: _request dl2c=0x55ea1f51e220 msg=0x7ffe4dac6e10 called
diag_l2.c:499: diag_l2_send 0x55ea1f51e220 msg 0x7ffe4dac6e10 msglen 2 called
diag_l0_elm.c:1192: getflags link 0x55ea1f51d9f0 proto 1 flags 0x146A1
diag_l1.c:151: _send: len=5 P4=5 l0flags=0x146A1; 0x83 0x6E 0x13 0xAE 0x01
diag_l0_elm.c:943: ELM: sending 5 bytes
diag_l0_elm.c:288: elm_sendcmd: ATSH 83 6E 13
diag_l0_elm.c:310: received 5 bytes (OK
>
); hex: 0x4F 0x4B 0x0A 0x0A 0x3E
diag_l0_elm.c:983: ELM: (sending string 836E13AE01^M)
diag_l1.c:247: _recv request len=66, timeout=1100;diag_l0_elm.c:1040: Expecting 3*66 bytes from ELM, 1100 ms timeout(+400)...got 7 bytes; 0x84 0x13 0x6E 0xEE 0x01 0x39 0x2D
diag_l2.c:530: _request returns 0x55ea1f51ef80, err 0
diag_l2.c:523: _request dl2c=0x55ea1f51e220 msg=0x7ffe4dac6e90 called
diag_l2.c:499: diag_l2_send 0x55ea1f51e220 msg 0x7ffe4dac6e90 msglen 2 called
diag_l0_elm.c:1192: getflags link 0x55ea1f51d9f0 proto 1 flags 0x146A1
diag_l1.c:151: _send: len=5 P4=5 l0flags=0x146A1; 0x83 0x6E 0x13 0xAF 0x01
diag_l0_elm.c:943: ELM: sending 5 bytes
diag_l0_elm.c:983: ELM: (sending string 836E13AF01^M)
diag_l1.c:247: _recv request len=66, timeout=1100;diag_l0_elm.c:1040: Expecting 3*66 bytes from ELM, 1100 ms timeout(+400)...got 7 bytes; 0x84 0x13 0x6E 0x7E 0xAF 0x23 0x55
diag_l2.c:530: _request returns 0x55ea1f51ef20, err 0
Failed
freediag/850>
Contrary to my original comment, scantool is still communicating with the TCM at this point. However, a subsequent disconnect command fails. scantool sends a stopDiagnosticSession request and gets a success response, then it tries to send an ATPC command to disconnect the interface. Normally, by the time scantool sends the ATPC command, the interface has already returned to command mode. But in this case, the previous series of routineNotCompleteOrServiceInProgress responses caused the interface's adaptive timing algorithm to increase the response timeout. The interface is still listening for more responses and hasn't yet sent the > prompt, but scantool considered the stopDiagnosticSession request/response to be done and didn't wait for the > prompt before sending the ATPC command. This caused the interface to abort the still-in-progress receive and send a STOPPED error.
scantool output with debugging turned on:
freediag/850> disconnect
diag_l2.c:499: diag_l2_send 0x55f54e1fc220 msg 0x7ffdcd422400 msglen 1 called
diag_l0_elm.c:1192: getflags link 0x55f54e1fb9f0 proto 1 flags 0x146A1
diag_l1.c:151: _send: len=4 P4=5 l0flags=0x146A1; 0x82 0x6E 0x13 0xA0
diag_l0_elm.c:943: ELM: sending 4 bytes
diag_l0_elm.c:288: elm_sendcmd: ATSH 82 6E 13
diag_l0_elm.c:310: received 5 bytes (OK
>
); hex: 0x4F 0x4B 0x0A 0x0A 0x3E
diag_l0_elm.c:983: ELM: (sending string 826E13A0^M)
diag_l1.c:247: _recv request len=66, timeout=1100;diag_l0_elm.c:1040: Expecting 3*66 bytes from ELM, 1100 ms timeout(+400)...got 5 bytes; 0x82 0x13 0x6E 0xE0 0xE3
diag_l2.c:291: Entered diag_l2_close for dl0d=0x55f54e1fb9f0;
closing dl2link 0x55f54e1fc1f0.
diag_l2.c:199: l2_closelink 0x55f54e1fc1f0 called
diag_l1.c:112: entering diag_l1_close: dl0d=0x55f54e1fb9f0
diag_l0_elm.c:288: elm_sendcmd: ATPC
diag_l0_elm.c:310: received 14 bytes (STOPPED
>?
>
); hex: 0x53 0x54 0x4F 0x50 0x50 0x45 0x44 0x0A 0x0A 0x3E 0x3F 0x0A 0x0A 0x3E
diag_l0_elm.c:334: ELM returned error : ?
diag_l0_elm.c:335: Unspecified Error.
diag_l0_elm.c:205: link 0x55f54e1fb9f0 closing
Disconnected from AW50-42 transmission.
freediag/850>
According to Richard H. Jones's document on error codes, in addition to the AW50-42 TCM sending routineNotCompleteOrServiceInProgress while clearing DTCs, the airbag module does the same thing. Additionally, when a readDataByLocalIdentifier request is sent with a final byte of 02 rather than the usual 01 (scantool never does this), the ECU will send a continuous data stream alternating between routineNotCompleteOrServiceInProgress and readDataByLocalIdentifier responses.
So there are two problems here:
A possible side issue is that the diag_l1 timeouts are totally uncoordinated with the interface's timeouts. So the upper layer code might know how long a response should take and use a specific timeout, and that's the timeout used by scantool to communicate with the interface, but the timeout used by the interface to communicate with the ECU is totally different. Also, if we know what the protocol's timing constraints are, it would probably be better to set the interface to fixed timeout (ATAT 0, ATST hh) instead of letting it use adaptive timing.
wow, excellent analysis. ELM "user-friendliness" strikes again. I wonder if the cheap elm327 clones support ATAT / ATST at all ? I get the feeling that they don't, but I don't have an ELM anymore and not at all interested in spending time on them.
Do I understand that for this issue, the only clean way to deal with it is to disable adaptive timing ? I can think of a few approaches :
I think this could break on a fast enough computer even if adaptive timing is disabled. We should be setting a flag when an operation is in progress, and clear it when we see a > character. On entry to elm_send and elm_sendcmd, if the flag is set and there's not already a > in the input buffer, we should send a space character to abort the operation, then wait for the >. Note that we should only wait for the > and not explicitly check for the STOPPED response, because there's a race condition if the operation completes just as we're trying to abort it. Also note that we send a space rather than a CR, because if a CR arrives after the operation completes, the interface will interpret this as a request to repeat the last command. I tested leading space characters on the OBDLink SX and they're safe. They should also be OK on original ELM327 according to the datasheet. I don't know what other clones will do, but even if they don't handle it correctly, we won't be hosed any worse than if we had done nothing.
I think this could break on a fast enough computer even if adaptive timing is disabled.
Why would it break ? When you request that sendcmd from D2 code, I understand there's a good notion of what kind of timeout is appropriate, so it should be a simple matter of passing the request with that timeout ? I don't know if I want to change the ELM layer into some kind of partially async code.
The problem is, you send your request, elm_send returns immediately. The car responds, you call elm_recv with whatever timeout, elm_recv gets the response and returns. But the interface is still waiting for more potential responses from the car. If you actually expect multiple responses, you call elm_recv again and get the next response (or a timeout), etc. If you've finished with this transaction and proceed to send another request or command - that's when it breaks.
I split off the ELM interface problem into #87 because it's a separate issue from the misinterpretation of the TCM's response to clearDiagnosticInformation.
When clearing DTCs on AW50-42 transmission using D2 protocol, DTCs are cleared, but scantool reports "Failed" and loses connection to the TCM. Maybe a timeout needs to be increased.