Lightweight debug mode - Githubissues

evandisoft commented 5 years ago

For bugs that cannot be reproduced, we can't currently retroactively turn on debugging.

Also, for sufficiently busy programs, doing IO on all opcodes is too slow to use even if you can reproduce the bug.

Solution: Have a queue that stores the last N opcodes. Don't save it to disk until user requests that. (Or if there is a crash of the vehicle) Have it save a "toString" of each opcode to disk if user types in "saveopcodes." and/or add a button to save it. Save the logs in a separate file so the user can upload/view it easily. (Path to KSP directory)/logs/kOS/opcode.log ?

Another benefit of this is that it can be used to make CPU.cs code simpler if you remove all profiling code from there and instead just add timestamps with the opcodes to the queue.

This may not effect execution speed. Enqueue is a fast operation, and it could replace a lot of always-on profiling-related things in "ExecuteInstruction"

The opcodes are going to exist for the life of the programcontext, so all it is storing is a reference per opcode and (optionally) a timestamp. However, it would be holding references inside old programcontexts. But since it's a queue, eventually all those references would be pushed out. How would this work with garbage collection? I'm sure that regardless, it would be better than IO.

The queue should probably hold all opcodes from all CPU's so that it can be easily automatically limited to a total of N opcodes. This means we may also want to store an id for each CPU with each instruction to disambiguate.

However, we could alternatively allow separate, on-the-fly enabling of this feature per CPU and make the N be stored per CPU that enabled debugging. Then have them store debugging information in a file that is named uniquely based on the part/vessel whenever the user clicks a "core dump" button.The button could also be on the part instead of the terminal. And, assuming the part blew up, have a separate button on the main kOS window to store all currently active debug information.

Optionally: Given that traces can show the source line of an explicit bug for a given instruction, could we process the entire OpcodeQueue and produce a list of all the kerboscript source lines that were executed, in order, as well as the values of all simple variables (numbers and small strings) next to instances of their use?

Example: Maybe it can show the following?

local a is 3.
...
print(a==3).

Optionally: Store timestamps along with the opcodes for profiling

Optionally: Process the queue prior to dumping it in the file. Add extra information, (e.g. "total execution time" from timestamps) or even construct a "scope executiontime" for each scope.

Optionally: remove a few fields like "ProfileTicksElapsed" and "ProfileExecutionCount" from "opcode" as that information could be reconstructed from the queue. remove all profiling logic from CPU.cs

Optionally you could just store the "toString" of the opcodes instead of the opcodes, but that would be slower.

Optional: Make it optional! Allow user to set the 'N'

Optional: Save references to programcontexts while the opcodes they contain are still in the OpcodeQueue so that more debug information can be reconstructed, when required.

evandisoft commented 5 years ago

I tested a basic version.

While flying there was no noticeable lag. After I got into orbit I triggered the "save" functionality and it hanged there for a few minutes for all the IO. Finally it printed out the opcodes to the log. I had N set to 1 million. The program only used 863,000 opcodes.

So I think this is pretty cool. It's a complete record of all the instructions that were executed in a complicated, long-runing program, and the significant overhead of storing debug info is optional and occurs after the program was finished.

This is a stripped down version of KSP with most parts not included. It was using 2,242 megabytes of ram. Seems normal.

ghost commented 5 years ago

I think that jump/branch table would be more than enough. See my tester. I created it for myself to understand how kOS works under the hood. That jumpBox shows every single jump/branch or call instruction, because you can fill up the blanks from the source code, so, if those instructions would have references to source code (file:line:column), you could track your code back and see what it was doing before crashing.

Call-stack would be more limited, but often helps quite a lot as well (chain of functions called), jump-table (including all if, else, until etc.) is better. You probably won't be able to store values (unless primitive types) of the calls, as that could be time and space consuming (slowing down, using memory).

Anyway, I support any attempt of improving debugging :)

evandisoft commented 5 years ago

The thing about my way is that it's bonehead simple to implement (just the basic part of storing and then displaying the opcodes), and I was looking for something like that. It helped me debug what I was working on.

But I see no reason why your way wouldn't be better. You could store a lot more information with the same sized queue.

However, I also eventually added the storing of a to-string of what was in the stack in a queue at each step, and I was able to fix bugs without even re-running the program. It's at least way better than storing in a text file. We can store lots of information in a queue and have it effect performance minimally compared to text logging. Possibly you could also limit the to-string to the last 5 items in the stack.

I tested this with a big program that loads libraries and gets to orbit and it did not slow things down considerably. I had size of queue at only about 1000 I think, and that's not going to increase memory much even with storing some stack contents.

(A lot of the motivation for this was that I'm not using Windows so I cannot run a debugger, so I needed more information about the state of the system to catch bugs. And adding a bunch of print statements everywhere and then removing them to commit is not a good idea).

evandisoft commented 5 years ago

In any case, we should have a better system for debugging and testing. Storing stuff directly in text does not work in any program that is running for more than a few seconds (slows everything down).

We can make it so that users can have debugging on all the time but only dump that to text when they encounter a problem. Then they can give that to us, and we'll at least have some data to work with even if we cannot replicate the bug.

This can be done for any sort of heavy logging that we're doing. The last part of the logs is generally by far the most useful.

evandisoft commented 5 years ago

@firda-cze I think we can store a lot of the argument stack values, and not just the jump information, because people are probably generally used to the default 200 instructions per update. That is not very fast. And so you can to-string a lot of information and store the last n lines of info without slowing the system much per each instruction (each instruction is already very slow because of the default 200 instructions per update limit).

I've tested this and I could smoothly run a program with lots of triggers getting to orbit and store a to-string of the entire stack (and you only really need the first few items on the stack). You could optimize this even more, not storing the stack again if it does not change and a number of other things, but I chose not to optimise this as the program was already running smoothly.

There could be different options for verbosity. Jump information is the minimum that still gives us a lot of information about what happened. Storing some argument values would be a step up that might not considerably slow down most programs, but would give you enough information to completely narrow down a bug a lot of the time.

I guess you might only have to store a value when something is being added to or taken from an argument stack.

My approach to this was to do things in brain-dead simple ways until I found I needed to optimize them more. Currently my system is working ok with literally storing a to-string of all stack information and each opcode in a queue of about 1000 (And a lot of other verbose logging in other queues). You would probably want something more efficient in the main branch, but it runs smoothly for me until the moment that information is triggered to be stored to disk (on an error or whenever).

evandisoft commented 5 years ago

@firda-cze Are you on the kos slack?

https://ksp-kos.slack.com

KSP-KOS / KOS

Lightweight debug mode #2359