Managing breakpoints in Escargot

zherczeg commented 4 years ago

I would like to create a mechanism which allows inserting breakpoints into the byte code in some way. It would be great if you could provide some feedback about the most efficient way to do it.

There are two ways to implement breakpoints:

1) Inserting DisabledBreakpoint instructions into the code, which can be turned into EnabledBreakpoint instructions when needed. Advantage: with these instructions, no need to check EnabledBreakpoints before each instruction, and implementing "step-in", "next" is easy since we just need to stop at the next Breakpoint instruction (regardless it is enabled or not). Disadvantage: executing DisabledBreakpoint instructions takes some time, which slows down the execution.

2) Maintaining a list of all available and enabled breakpoints. Advantage: when no breakpoints are enabled, and no need to perform a "step-in", "next" etc. command, the execution is as fast as normal execution. Disadvantage: when breakpoints are enabled, we need to check them before executing a byte code. This can slow down the execution.

Which one do you prefer? Or do you have a third suggestion?

Also it seems I cannot directly insert nodes into the statement lists, since parseStatementListItem returns with an AST node, and does not receive a statement list. Is it possible to access the list in some way? If no which one do you prefer: 1) Return with a Breakpoint node (a new node type), and the StatementNode should be the child node of Breakpoint node. This way we can extend any node with breakpoint information in the AST. 2) Extend StatementNode with breakpoint information. It seems to me that the "loc.index" information is often invalid for statement nodes, so I need to fix them or just save the whole ExtendedLoc.

Which one do you prefer?

clover2123 commented 4 years ago

Thank you for your great suggestions! First of all we have no experience about debugging tool development, so our opinion might not be the best solution. If you have any other idea, please feel free to discuss it with us.

Implement breakpoints We think that inserting Breakpoint instructions for every necessary code points would be desirable because developers usually switch or recompile to debug mode for debugging and performance slowdown in debug mode is not a critical issue. We can add additional build flag or enable breakpoints only for debug mode. Is there any issue other than performance? (We currently consider the debugging situation for standalone escargot or linked-with-node environment)
Insert additional nodes into the statement list parseStatementListItem parses and generates each StatementNode. StatementContainer holds the statement list (actually each StatementNode is linked with each other like linked-list). parseStatementListItem is repeatedly invoked and its result(StatementNode) is stored in StatementContainer. If breakpoint is supported on the basis of JS statement, we can add each breakpoint node for each parseStatementListItem call. And if breakpoint information should be inserted for each StatementNode, expanding the StatementNode would be more suitable. "loc.index" represents the code position from the start of the source code. ExtendedNodeLOC has additional location info such as line and column. I think we should choose one based on which location info debugger required for. BTW we have to fix "loc.index" if some nodes have incorrect index information.

zherczeg commented 4 years ago

Thank you for the reply @clover2123

I would also prefer to insert Breakpoint instructions. We also do this in JerryScript, it makes the implementation much easier. Code size increase is also smaller, since we don't need to add a check before each instruction (Escargot use computed gotos, so we directly jump to the next instruction, and there is no common loop header which is executed before a vm instruction).

Fixing "loc.index" seems difficult. Consider the following JS example:

a = 123;
b = 234;

After a = 123; is parsed, we get an ExpressionNode, and a new ExpressionStatementNode is created. The "loc.index" of the ExpressionStatementNode is set here:

https://github.com/Samsung/escargot/blob/master/src/parser/esprima_cpp/esprima.cpp#L3976

However, at this point the current startMarker points to the start of the b = 234;, so the ExpressionStatementNode gets this index. Since no byte code is generated for ExpressionStatementNode only for the expression inside, this invalid "loc.index" index has no effect. Fixing it would require to save the "loc.index" and update the node after finalize but it does not looks nice. And a lot of code would be need to refactored, which is beyond the scope of the debugger. I would just save the necessary information for now.

clover2123 commented 4 years ago

@zherczeg Thanks for the analysis. IMHO the incorrect "loc.index" example you mentioned above could be solved by calling this->createNode() at the start of parseLabelledStatement and using its result meta node for finalize. You can also confirm this by the latest esprima source code (it seems that the bug is fixed in the latest esprima). Having incorrect index info is definitely not a ideal case, so we should fix it. I will also search and fix this issue ASAP.

zherczeg commented 4 years ago

@clover2123 Hi, I have another issue. Due to the lazy instantiation, byte code for functions are generated the first time when they are executed. However, I think this is very inconvenient for debugging, since users cannot put a breakpoint into a function until it is executed. To fix this, I have experimenting with generating byte code when the source code is parsed. Here is my patch:

https://github.com/zherczeg/escargot/commit/936e2fd3a4fe84a8f80eeb24d7a6946b3fb290b5

Q1) Is it ok for you to generate byte code at script parsing time (this increase the memory consumption at short term)? Q2) Is my code ok?

Thank you for the help.

zherczeg commented 4 years ago

I have two more questions:

Q3) ScriptParser has a needByteCodeGeneration flag. When this flag is not set? Shall we generate byte code regardless of this flag when the debugger is available?

Q4) Eval passes eval input as file name. However eval input is a valid file name as well. Is this ok, or shall we change this constant to an invalid file name, e.g. *eval_input* since * is not allowed in any file name regardless of OS.

clover2123 commented 4 years ago

Here's my opinions. Q3) ScriptParser has a needByteCodeGeneration flag. When this flag is not set? Shall we generate byte code regardless of this flag when the debugger is available? => needByteCodeGeneration flag is disabled only for special cases, especially generating function object. When generating a new function object by Function builtin method, escargot first combines code pieces and then parses it without bytecode generation to check if there is any parsing error. For debugger case, it should always generate the bytecode with flag set.

let sum = new Function('a', 'b', 'return a + b');

Q4) Eval passes eval input as file name. However eval input is a valid file name as well. Is this ok, or shall we change this constant to an invalid file name, e.g. eval_input since is not allowed in any file name regardless of OS. => I think the default file name for eval is fine for now. (until now there is no issue related to file name). But if you want to avoid any possible conflict, change the name to what you want (add looks good). Thanks.

zherczeg commented 4 years ago

Thank you for the reply. Please check Q1 and Q2.

clover2123 commented 4 years ago

My opinion continued, Q1) Is it ok for you to generate byte code at script parsing time (this increase the memory consumption at short term)? => I agree with you. Yes, we should generate all bytecode for the whole code. It would rapidly increase the memory consumption, but as I said before, (IMO) performance and memory are not critical issue in debugger mode.

Q2) Is my code ok? => your code looks fine to me, but I have several suggestions. How about add a new initializeScript method that parses and generates bytecode for the whole source code to clearly seperate it from the base code? Or you can consider the existing CodeBlock traversing in ScriptParser::generateCodeBlockTreeFromASTWalkerPostProcess. To be specific, adding bytecode generation module at the end of ScriptParser::generateCodeBlockTreeFromASTWalkerPostProcess could simplify the code for recursive bytecode generation.

There are a few more things that you have to consider together.

stack limit : remained stack size is checked in each parsing process. Parsing the whole source code may exceed the limit.
flush ByteCodeBlock : VMInstance::gcEventCallback is called for every GC call event. When GC mark phase starts, it checks the total size of generated bytecode and if it exceeds the threshold then removes all ByteCodeBlock referencees which whould be collected by following GC sweep phase. So, debugger mode should keep the whole generated bytecode not to be removed by GC.

zherczeg commented 4 years ago

Thank you.

Regarding initializeScript, would you duplicate the current code and add a debugger specific variant, and the caller should decide which one should be called? Or the current initializeScript should call the other if the debugger is enabled?

Regarding stack limit, it seems it only exists when ExecutionState is present. When the main source code is parsed in the Shell.cpp, the VM is not created so no ExecutionState is available. Hence initializeScript has no stack limit, and correct me if I wrong, but it seems this rule applies for parsing evals (they ignore stack limit). As for debugger, if I generate code in initializeScript, I have no stack limit info as well. How should it work?

I couldn't find a Context pointer to VMInstance, so I cannot check whether a debugger is attached in gcEventCallback. Shall I add a Debugger pointer to VMInstance as well? The VMInstance needs to know that the debugger is available and the link between the debugger and debugger client still exists.

clover2123 commented 4 years ago

implement initializeScript : feel free to choose any one that you prefer more. I recommended a new initializeScript method just to make the code look clear.
stack limit : actually VM instance is created by VMInstanceRef::create, so is Context in main function. IMO the stack should be unlimited by setting the stackSizeRemain as SIZE_MAX for debugger.
VMInstance : Context has a VMInstance pointer, but VMInstance doesn't have one. So, VMInstance needs to know the debugger info. How about just adding a boolean member that represents debugger status? Thanks.

zherczeg commented 4 years ago

Hi @clover2123,

I replaced the sha1 algorithm with a new one. This code has no conflicting license.

Now I try to solve the wrong line info problem. I found two issues, and I hope you could help me:

a) lineNumber(((length > 0) ? 1 : 0) + startLine) https://github.com/Samsung/escargot/blob/master/src/parser/Lexer.cpp#L675

This code increase the startLine for nested functions, so all lines are shifted down by 1, and debugger stops one line below the real statement.

I changed ExtendedNodeLOC startLoc = ExtendedNodeLOC(0, 0, 0)) to ExtendedNodeLOC startLoc = ExtendedNodeLOC(1, 0, 0)) here and removed the increase: https://github.com/Samsung/escargot/blob/master/src/parser/esprima_cpp/esprima.cpp#L202

Is this a good idea?

b) In the following code we set this->scanner->lineNumber to childBlock->functionStart().line; https://github.com/Samsung/escargot/blob/master/src/parser/esprima_cpp/esprima.cpp#L483

The problem is we should set this to the end location of the function, not the start location, which is stored m_bodyEndLOC but only in debug mode. How can I fix this?

An alternative would be to drop this "keeping the line info somewhere" and use the source code index to figure out the line. It seems something similar happens when backtrace is created. Would this be a better direction for resolving debugger line info? Can we use that line info search during byte code generation?

clover2123 commented 4 years ago

@zherczeg I'll take a look at your issues deeply. Please hold it for a while.

During the time, could you explain the overall operation or structure of the debugger briefly? For example, I have following questions.

How to link the debugger with existing tools such as VSCode?
I think the debugger should also support presenting information like call stack or variables. To support these features, any form of data structure (e.g. JSON) needs to be transferred and there is also any standard form of it?

zherczeg commented 4 years ago

The brief structure of the debugger is the following:

It has two major components: the debugger server inside Escargot, and a debugger client, which can be implemented in any language. The patch contains a simple python client implementation, but the client can be implemented in JavaScript (for VSCode) as well. The two components are connected through a communication channel. Currently WebSocket over TCP/IP is used (JavaScript natively supports it), but later it can be USB, or anything which supports reliable packet transfer.

The communication uses fixed packets, and longer data (e.g. source code) may be divided into multiple packets. The debugger also uses a state machine which limits allowed packet types. For example, when Escargot parses a source code, both the debugger server and client enter into a "parser mode", and only packets related to parsing are accepted in both directions. Another modes are "stopped" and "running". For example, backtrace will only be supported in execution "stopped" mode. Receiving source code can be another mode which we can implemented later.

Regarding VSCode, we will implement a debugger stub for VSCode, which will be pure JavaScript code. It uses WebSockets and Promises to communicate with both Escargot and VSCode respectively. This is the expected way to create a debugger support for VSCode.

In JerryScript we used simple binary messages to transmit call stack, or variable lists. They both converted to a list of strings first, and then the strings are sent through network. I would like to use a similar format for Escargot as well. Longer strings are split into multiple packets. Usually we use two opcodes for a string, one represents the last packet, and another represents an intermediate packets. When an intermediate packet is received, the state machine enters into "string receiving" mode, and only intermediate or last packets are accepted, so multiplexing various data is not possible. In practice we never need it.

clover2123 commented 4 years ago

@zherczeg here is my opinion.

a) lineNumber(((length > 0) ? 1 : 0) + startLine) I changed ExtendedNodeLOC startLoc = ExtendedNodeLOC(0, 0, 0)) to ExtendedNodeLOC startLoc = ExtendedNodeLOC(1, 0, 0)) here and removed the increase: => Yes, this approach looks great for me.

b) In the following code we set this->scanner->lineNumber to childBlock->functionStart().line; There are some solutions for this issue.

disable tryToSkipFunctionParsing module (simple) I think that this approach would be the simplest solution. tryToSkipFunctionParsing is invoked to skip the parsing of an inner function which we encounter during the parsing of a callee function. If this optimization is disabled, we can get a correct line info.
enable m_bodyEndLOC for debugger mode (simple) In our scenario, we do not consider the performance seriously for debugger mode. So, we can exploit m_bodyEndLOC info to calculate the correct line info by consuming some more memory.
calculate the line info from the index info (FYI) when an exception occurred backtrace info is printed. Each trace info is calculated using the ExecutionState::m_programCounter and ByteCodeBlock::m_locData. ExecutionState::m_programCounter holds the index of current ByteCode. ByteCodeBlock::m_locData contains the pair of each ByteCode index and its relevant code index. So, we can get a ByteCode index by ExecutionState::m_programCounter and then use it to find the code index from ByteCodeBlock::m_locData. The code index is finally passed to ByteCodeBlock::computeNodeLOC in which the exact location info is calculated from the source code. It seems that we can still resolve the line info during the bytecode generation with the above approach. But in this case, we may need to run the above process for each ByteCode generation which would be great overhead.

IMO I recommend you the second approach because it is simple for implementing with a little more memory consumption. Of course release mode will consistently have incorrect location info but there have been no critical issue so far. We may fix it later if there is another common solution.

zherczeg commented 4 years ago

I choose option 2, and added support for lots of statement types in the code. Now empty functions, empty scripts, variable initializations and other things work correctly.

zherczeg commented 4 years ago

Hi @clover2123

I would like to work on evaluating JavaScript commands by the debugger, and I would like to hear your valuable suggestions first.

1) My first question is about evaluating commands directly. Debugger commands are same as direct evals, and I suspect this affects the parser (I saw an is_eval flag). Shall I set this flag to true when the debugger is enabled? Is there anything else I need to do?

2) What is the right way to do this direct eval? Shall I do the same thing as in GlobalObject::eval?

2) How can I check if the eval is returned normally or an exception is thrown? A toString will be used on the returned value, but it is good if the debugger client knows this string represent a successful or failed code.

3) Shall I add a "throw" command besides "eval"? This would evaluate the command in the same way as "eval", but if the command is successful, it "throws" the returned value. How can I throw this value?

ksh8281 commented 4 years ago

Hi @zherczeg {1,2} ->IMO, making evaluating commands directly in debugger, we need to set every CodeBlock::m_hasEval as true if debugger is enabled. thus we can call GlobalObject::evalLocal function every where is think {3} -> you can wrap GlobalObject::evalLocal with c++ native try-catch for catch exception {4} -> should we throw exception with direct input on debugger? IMO we would not throw exception when user inputs direct eval

zherczeg commented 4 years ago

The next feature I need to work on is getting the key/value pairs of the scope chain for VSCode. It seems some environment records has IdentifierRecordVector. I suspect I can use this for getting the names of identifiers. However others don't have this vector. How can I get the names in that case? Eval is forced for all byte code, so some records might not be possible.

ksh8281 commented 4 years ago

In eval mode, You can use m_recordVector of {*EnvironmentRecordNotIndexed} for getting names on scopes. Indexed record could not be used on eval mode.

zherczeg commented 4 years ago

Thank you for all help.

Samsung / escargot

Managing breakpoints in Escargot #549