eclipse-cdt-cloud / cdt-gdb-adapter

CDT GDB Debug Adapter
Eclipse Public License 2.0
27 stars 40 forks source link

Ensure 'interrupt' signal is sent before 'disconnect' signal. #292

Closed XingMicrochip closed 10 months ago

XingMicrochip commented 10 months ago

Hi @jonahgraham,

We tested @parisa-mchp's recent change (PR #289) by pulling version 0.0.102 from open-vsx.org, but we still encountered issues: Stopping the debugger while target is running doesn't disconnect the debugger properly. We found that this chunk of code is buggy. This happens in file GDBTargetDebugSession.ts, within method disconnectRequest:

            if (this.targetType === 'remote') {
                if (this.gdb.getAsyncMode() && this.isRunning)
                    this.gdb.sendCommand('interrupt');
                await this.gdb.sendCommand('disconnect');
            }

In the case where this.gdb.getAsyncMode() && this.isRunning === true, the 'disconnect' signal always gets sent before the 'interrupt' signal. The change in this PR is required to fix this issue.

jonahgraham commented 10 months ago

Thanks for the PR.

I don't think this solves the problem as the promise of interrupt resolves once it is issued, not once it suspends. You may need to actually wait for the interrupt to be successful as mentioned in https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/pull/287#issuecomment-1692616953

XingMicrochip commented 10 months ago

I compiled this PR change and tested within our IDE. It did fix the issue. But is there a solution you were thinking?

jonahgraham commented 10 months ago

Please see the referenced comment as @AdhamRagabMCHP was facing the same issue. A slight delay was needed and the linked code in that comment shows how to wait for the interrupt to complete.

I don't mind the change you provide and I'll merge it once you consider the previous discussion on this issue (but I would prefer it written as await disconnect rather than using then)

XingMicrochip commented 10 months ago

Using await was actually my initial solution. So the source code would look like this:

            if (this.targetType === 'remote') {
                if (this.gdb.getAsyncMode() && this.isRunning)
                    await this.gdb.sendCommand('interrupt');
                await this.gdb.sendCommand('disconnect');
            }

But after running "yarn" to compile the code, the compiled JavaScript code uses "yield":

            if (this.targetType === 'remote') {
                if (this.gdb.getAsyncMode() && this.isRunning)
                    yield this.gdb.sendCommand('interrupt');
                yield this.gdb.sendCommand('disconnect');
            }

When testing within our IDE, this doesn't actually guarantee that 'interrupt' is sent before 'disconnect' - about 1 out of 4 runs, 'interrupt' signal isn't sent before 'disconnect'. That was why I changed to using an explicit "then" clause. I can't explain why this happens. I would expect using two "await" guarantees the first "await" is finished before the second "await". By using "then" clause did work consistently so I'd like to merge this change.

jonahgraham commented 10 months ago

Well that is certainly very unexpected, I don't understand how the interrupt is sent to gdb before disconnect with await. The implication of that is serious as the assumption is throughout the code that the first await completes and behaves the same as then would.

I also didn't understand initially what the issue was, my confusion was probably between signal vs. command. My previous comments were about waiting to get interrupt signal from gdb before sending disconnect command.

XingMicrochip commented 10 months ago

I can provide more context. At the end of @parisa-mchp's change, the code looks like this:

            if (this.targetType === 'remote') {
                if (this.gdb.getAsyncMode() && this.isRunning)
                    this.gdb.sendCommand('interrupt');
                await this.gdb.sendCommand('disconnect');
            }

GDBBackend.sendCommand method is called twice if the remote target is still running (once otherwise): first time to send an 'interrupt' signal, second time to send a 'disconnect' signal. This change was needed because if we were to just send the 'disconnect' signal without the 'interrupt' signal first, we receive an error "Cannot execute this command while the target is running. Use the 'interrupt' command to stop the target and then try again.". But when we pulled @parisa-mchp's change from open-vsx, it didn't work as expected. This PR is a follow-up fix. Do you still have other concerns/questions?

jonahgraham commented 10 months ago

Thanks no concerns on the PR, I don't understand why then works and await doesn't as that implies something else fundamentally wrong.

Do you have anything else, or are you ready for me to publish it once merged?

XingMicrochip commented 10 months ago

This is the only change. You can publish it once merged. Thanks.

XingMicrochip commented 10 months ago

Hi @jonahgraham. Do you have a timeline in mind on when this PR can be merged and a new version published?

jonahgraham commented 10 months ago

It should be today, I didn't have time Friday afternoon to do it

XingMicrochip commented 10 months ago

"await" was introduced in ES2017. In tsconfig.json, it targets ES2015. I'm testing if that was the issue that caused "await" to not be compiled correctly. I'll test this and update here. UPDATE: This fix didn't work. I'll keep my original change.

jonahgraham commented 10 months ago

I'm going to hold off merging this until your conclusion - it sounds like we have some errors in our setup that may affect more than just this case.

jonahgraham commented 10 months ago

@colin-grant-work - FYI on https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/pull/292#issuecomment-1723581289 perhaps you have some extra insights on this as you have the memory inspector set to ES2020

jonahgraham commented 10 months ago

@XingMicrochip An additional thing, we need you to sign the ECA (as @parisa-mchp did in response to https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/pull/289#issuecomment-1701173962)

colin-grant-work commented 10 months ago

@colin-grant-work - FYI on #292 (comment) perhaps you have some extra insights on this as you have the memory inspector set to ES2020

I haven't checked the sequences in question, but on my understanding of JavaScript, any of these three should all result in sending the interrupt command before the disconnect command:

this.gdb.sendCommand('interrupt');
this.gdb.sendCommand('disconnect');
await this.gdb.sendCommand('interrupt');
await this.gdb.sendCommand('disconnect'); // or without await
this.gdb.sendCommand('interrupt').then(() => this.gdb.sendCommand('disconnect'));

because there is no asynchronous code in sendCommand, so we should reach this line

https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/blob/c5f6d40a366ac6f7edd54478cd4138b1c8080fc9/src/GDBBackend.ts#L245

in the order we call sendCommand whether we await the result or not, and that should mean that GDB receives those commands in order. Whether GDB has handled the interrupt and is ready to handle the disconnect command when it receives it is an entirely separate question. We could make even surer of that by moving this.out.write outside of the Promise callback, but it should work as expected anyway.

If we're really seeing the command issued in the wrong order, then it does potentially mean a much bigger problem. If it only happens when compiling to ES2015 and the TS compiler's funky generator rewrite of asynch code, then the problem is in the TS compiler; if not, then we're in deeper trouble.

@XingMicrochip

I would suggest trying to rewrite the sendCommand function like this:

    public sendCommand<T>(command: string): Promise<T> {
        if (!this.out) { return Promise.reject(new Error('GDB is not running.')); }
        const token = this.nextToken();
        logger.verbose(`GDB command: ${token} ${command}`);
        const result = new Promise<T>((resolve, reject) => {
            this.parser.queueCommand(token, (resultClass, resultData) => {
                switch (resultClass) {
                    case 'done':
                    case 'running':
                    case 'connected':
                    case 'exit':
                        resolve(resultData);
                        break;
                    case 'error':
                        reject(new Error(resultData.msg));
                        break;
                    default:
                        reject(
                            new Error(
                                `Unknown response ${resultClass}: ${JSON.stringify(
                                    resultData
                                )}`
                            )
                        );
                }
            });
        });
        this.out.write(`${token}${command}\n`);
        return result;
    }

That should make extra super sure that the write call happens in the tick we think it should. If you continue to run into timing problems, put breakpoints in the transpiled source rather than the TypeScript source and confirm that these statements are being hit in the order you expect

            if (this.targetType === 'remote') {
                if (this.gdb.getAsyncMode() && this.isRunning)
                    yield this.gdb.sendCommand('interrupt');
                yield this.gdb.sendCommand('disconnect');
            }

If they aren't being hit as you expect, then the problem is TypeScript (somehow).

And then if they are being hit in the order you expect, put breakpoints on the this.out.write call and make sure that it's being hit with the arguments you expect in the order you expect. If it isn't then V8 is wrong (somehow).

XingMicrochip commented 10 months ago

@jonahgraham In tsconfig.json, switching to targeting "es2017" doesn't make a difference for me. I'll let you decide whether you want to switch to using it.

I think at this point it is a TypeScript issue. I did some testing on different ways of writing the following code snippet. Here are the results.

            if (this.targetType === 'remote') {
                if (this.gdb.getAsyncMode() && this.isRunning)
                    this.gdb.sendCommand('interrupt');
                this.gdb.sendCommand('disconnect');
            }

Version 1: This works all the time. @colin-grant-work It seems you are right that sendCommand doesn't need an await.

            if (this.targetType === 'remote') {
                if (this.gdb.getAsyncMode() && this.isRunning)
                    this.gdb.sendCommand('interrupt');
                await this.gdb.sendCommand('disconnect');
            }

Version 2: This is the current master. This never works.

            if (this.targetType === 'remote') {
                if (this.gdb.getAsyncMode() && this.isRunning)
                    await this.gdb.sendCommand('interrupt');
                await this.gdb.sendCommand('disconnect');
            }

Version 3: This works 70% of the time.

            if (this.targetType === 'remote') {
                if (this.gdb.getAsyncMode() && this.isRunning) {
                    this.gdb.sendCommand('interrupt').then(() => this.gdb.sendCommand('disconnect'));
                } else {
                    await this.gdb.sendCommand('disconnect');
                }
            }

Version 4: This is my current PR change. This always works.

I am not really an expert in TypeScript, so I don't think it's worth the time to debug the issue. I think I can stick to version 4, or if you guys think version 1 is cleaner, I can switch to that as well. @jonahgraham @colin-grant-work

jonahgraham commented 10 months ago

Can you share (at least a snippet) of the log file when it does and doesn't work please?

XingMicrochip commented 10 months ago

There's nothing specific when there's no errors.

jonahgraham commented 10 months ago

Thanks @colin-grant-work for the thorough analysis.

@colin-grant-work / @XingMicrochip All these code sequences on paper produce the same behaviour (i.e. gdb receives "interrupt", "disconnect" and then gdb exit commands). So I don't mind merging this change, but I think all that is really happening is slight timing changes that lead to different behaviour. I continue to be very concerned that there is something else about this behaviour that is problematic. I would love to see in the logs that the disconnect is sent before interrupt as this seems likely to hit us again in the future (for example sendCommands just does a bunch of sendCommand back to back and if the order changes that would be bad)

jonahgraham commented 10 months ago

There's nothing specific when there's no errors. For code version 2 and 3,:

I mean turning on the full logs, set verbose = true and place everything in a file so it doesn't have to go via the DAP:

https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/blob/250881c64415e076d3b66068a3c7057ab9a3130f/src/GDBDebugSession.ts#L45-L46

jonahgraham commented 10 months ago

@XingMicrochip An additional thing, we need you to sign the ECA (as @parisa-mchp did in response to #289 (comment))

I tried revalidating and the ECA check still fails:

image

As for he Ubuntu test failures, if you rebase on main that should work as it was fixed in #294

XingMicrochip commented 10 months ago

I just signed the ECA.

XingMicrochip commented 10 months ago

debug-project-emulation.log Here's the log file.

jonahgraham commented 10 months ago

That log shows typescript is working correctly (i.e interrupt is sent before disconnect is sent) and that the issue is the adapter needs to wait for the interrupt to complete, here is the snippet of the log that is key:

[18:47:49.618 UTC] From client: disconnect({"restart":false})
[18:47:49.618 UTC] GDB command: 40 interrupt
[18:47:49.618 UTC] GDB command: 41 disconnect
[18:47:49.618 UTC] To client: {"seq":0,"type":"event","event":"output","body":{"category":"log","output":"interrupt\n"}}
[18:47:49.618 UTC] GDB result: 40 done
[18:47:49.618 UTC] To client: {"seq":0,"type":"event","event":"output","body":{"category":"log","output":"disconnect\n"}}
[18:47:49.618 UTC] To client: {"seq":0,"type":"event","event":"output","body":{"category":"log","output":"Cannot execute this command while the target is running.\n"}}
[18:47:49.618 UTC] To client: {"seq":0,"type":"event","event":"output","body":{"category":"log","output":"Use the \"interrupt\" command to stop the target\n"}}
[18:47:49.618 UTC] To client: {"seq":0,"type":"event","event":"output","body":{"category":"log","output":"and then try again.\n"}}
[18:47:49.618 UTC] GDB result: 41 error,msg="Cannot execute this command while the target is running.\nUse the \"interrupt\" command to stop the target\nand then try again."
[18:47:49.619 UTC] To client: {"seq":0,"type":"response","request_seq":16,"command":"disconnect","success":false,"message":"Cannot execute this command while the target is running.\nUse the \"interrupt\" command to stop the target\nand then try again.","body":{"error":{"id":1,"format":"Cannot execute this command while the target is running.\nUse the \"interrupt\" command to stop the target\nand then try again.","showUser":true}}}
[18:47:49.619 UTC] To client: {"seq":0,"type":"event","event":"output","body":{"category":"stdout","output":"\nProgram"}}
[18:47:49.620 UTC] To client: {"seq":0,"type":"event","event":"output","body":{"category":"stdout","output":" received signal SIGTRAP, Trace/breakpoint trap.\n"}}
[18:47:49.620 UTC] To client: {"seq":0,"type":"event","event":"output","body":{"category":"stdout","output":"_fini () at /home/xingl213/real_workspace/real_sw2/platform/miv_rv32_hal/miv_rv32_init.c:23\n"}}
[18:47:49.620 UTC] To client: {"seq":0,"type":"event","event":"output","body":{"category":"stdout","output":"23\t    while(1) { ; } // Keep the CPU in a loop to not run main() again\n"}}
[18:47:49.620 UTC] GDB exec async: stopped,reason="signal-received",signal-name="SIGTRAP",signal-meaning="Trace/breakpoint trap",frame={addr="0x800009a0",func="_fini",args=[],file="/home/xingl213/real_workspace/real_sw2/platform/miv_rv32_hal/miv_rv32_init.c",fullname="/home/xingl213/real_workspace/real_sw2/platform/miv_rv32_hal/miv_rv32_init.c",line="23",arch="riscv:rv32"},thread-id="1",stopped-threads="all"
[18:47:49.620 UTC] To client: {"seq":0,"type":"event","event":"stopped","body":{"reason":"SIGTRAP","allThreadsStopped":true,"threadId":1}}

If you have a log showing it working I would be interested to see - but the above log shows in this case it is taking only 2ms for the interrupt to take effect, so small changes to the timing can easily delay the disconnect randomly by that 2ms.

I ask that you please review the conversation in this comment linked where @AdhamRagabMCHP and I discussed this exact issue: https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/pull/292#issuecomment-1721679590 - note that @AdhamRagabMCHP's version of the PR had a 5ms sleep between the interrupt and the disconnect for this purpose:

// Set a 5 ms timeout to give time for GDB to pause.
await new Promise((t) => setTimeout(t, 5));

Perhaps we should loop in @parisa-mchp as their PR (https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/pull/289) dropped the pause.

What would help you in this case is if there was a consistent API that allowed code to pause the target and wait until it was paused. The current waitPausedNeeded code is a bit special purpose to the code that currently uses it.

XingMicrochip commented 10 months ago

Hi @jonahgraham. Thanks for looking at the logfile. I'd like to look at the logfile of the correct behavior too, but I don't have the hardware/server access for now - my coworkers unplugged the board. I'd also really like to merge the working change and test the new version hopefully long before our upcoming release cycle. I can provide the other logfile and look into why the other approaches didn't work at a later time.

jonahgraham commented 10 months ago

In the case where this.gdb.getAsyncMode() && this.isRunning === true, the 'disconnect' signal always gets sent before the 'interrupt' signal. The change in this PR is required to fix this issue.

This conclusion was found to be incorrect, the log file (see https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/pull/292#issuecomment-1724257278) does indeed show that interrupt is sent before disconnect

jonahgraham commented 10 months ago

The latest version containing this change has been published to https://www.npmjs.com/package/cdt-gdb-adapter/v/0.0.28 and the vscode has been updated to consume this version in https://github.com/eclipse-cdt-cloud/cdt-gdb-vscode/commit/1859fa19108cbb5b882f6e52f9784ef987c57ea0

The version number of vscode extension has been updated https://github.com/eclipse-cdt-cloud/cdt-gdb-vscode/commit/7d975621dfc7306ffa07bda9075f4d6176f894f2 and that version should be auto-published to open vsx overnight.

I look forward to future contributions from you and your colleagues at Microchip!