dnSpyEx / dnSpy

Unofficial revival of the well known .NET debugger and assembly editor, dnSpy
GNU General Public License v3.0
6.95k stars 456 forks source link

User-supplied code comments / dnSpy project files #239

Open gsuberland opened 1 year ago

gsuberland commented 1 year ago

Problem Description

Right now there is no way for a user to add comments to the code within the tool. It would be helpful to have this feature to assist with reverse engineering efforts.

Proposal

The general outline of the feature would be as follows:

The primary challenge I foresee here is with the persistent storage. The principle details that would need to be resolved are:

  1. The general mechanics of storing comments somewhere (e.g. comment database format, file location, discovery, etc.)
  2. Devising a scheme for mapping a stored comment to the correct decompiled line upon loading, accounting for alterations to decompiler configuration.
  3. Handling of comment database files created with an older version of dnSpy, as decompilation results may change between versions.
  4. Handling of modifications to the executable.

For item 1, one option would be to add the notion of a project file. A project file could store the assembly list and additional metadata such as the comment database. This would offer further opportunities for persistent storage in other future features, too. This simplifies the file location and discovery issues, too, because the project file would be manually opened and saved by the user to any arbitrary location.

Mapping comments back to the correct decompiled lines is tricky, and probably not something I can strongly reason about with my current level of knowledge of dnSpy's internals. Line numbers alone are a poor choice because changes to decompiler configuration would change the line numbers. If there's an internal identifier for a decompiled code line that can be used, that might be a good starting point. I suspect that additional metadata (e.g. source data hashing) might be desirable to handle items 3 and 4.

The simplest approach to solving item 3 would be to warn users that opening an older project (or comment database) in a newer version may result in some loss of comments, then do a best effort re-mapping onto the decompiled results, e.g. by storing some sort of options-agnostic decompiled line hashcode with each comment and then only restoring comments where that hashcode still matches.

This same appraoch could be used for item 4. It would also be wise to warn users if they're trying to load comments onto a file that has been modified since the original database was created (e.g. store a hash and compare).

These implementation details are just high-level suggestions, of course, and I'd be open to other options and input from folks who are familiar with dnSpy's codebase.

Alternatives

One alternative would be to just not persist the comments, but that severely diminishes the usefulness of the feature and would be extremely annoying in the case of a crash.

mitchcapper commented 1 year ago

@gsuberland certainly a tricky problem, dnSpyEx actually has this support somewhat currently. There are bookmarks, you can give the bookmark both a name and a group label. You can easily jump to bookmarks from the bookmark pane, search them, import and export them. I think there may be a few places I would improve:

If I get some free time I may try to add some of those.

Otherwise you may want to look at: https://github.com/HoLLy-HaCKeR/dnSpy.Extension.HoLLy

it is a dnSpyEx extension that allows for some customization to aide in reversing that non-destructively stores all the changes in an XML file. It doesn't precisely do comments, but it would seem much easier to extend that project that might be more inlined with the goals mentioned.

ElektroKill commented 1 year ago

Hi,

A feature that allows users to add comments to the decompiled output in dnSpy is something that has been mentioned to me time and time again in various places and it is something that should be present in dnSpy. It's a feature that people expect from advanced reverse engineering suites.

I've considered the implementation of such a feature several times before but I'm mostly running into two roadblocks when it comes to implementing it, both of which are mentioned in your issue.

One of these roadblocks is figuring out a reliable way to map the comments to the actual decompiled code. We cannot use line numbers as those are volatile and can change easily with different decompiler settings. Currently, the best option I know of would be to use the IL code offset of the IL code responsible for a given expression/statement in the decompiler. This would also allow these comments to later be displayed in the IL/VB/etc. decompiler too. However, the IL offset approach comes with its own fair share of challenges to overcome. How do we handle bodies that have changes in the IL making the previous offset invalid (pointing to the middle of an instruction or maybe even out of bounds of the body) or pointing to a different expression? If we do end up using IL offset mapping (like bookmarks) we'd also have the problem of storing which method contains the comment. Storing such information about methods is also tricky. We cannot use the metadata tokens are they are too volatile and could change by just rewriting the assembly. Names would also work but then we are faced with the difficulties of obfuscated member names and handling them.

Another difficulty is designing a versatile yet not super complicated project format for dnSpy which would need to facilitate storing comments in method bodies, comments attached to just member definitions, and in the future features like non-invasive renaming of members and variables and anything else that might be required later on.

Actually implementing code to render these comments in the decompiled source code is probably the easiest part.

This feature is not something I've been really thinking a lot about in recent times due to lack of time and other priorities picking up but I'm open to discussing implementation details further for such a feature in the near future.

gsuberland commented 1 year ago

I've been thinking about this some more.

One of these roadblocks is figuring out a reliable way to map the comments to the actual decompiled code. We cannot use line numbers as those are volatile and can change easily with different decompiler settings. Currently, the best option I know of would be to use the IL code offset of the IL code responsible for a given expression/statement in the decompiler. This would also allow these comments to later be displayed in the IL/VB/etc. decompiler too.

IL offsets (or indices) would make a lot of sense.

However, the IL offset approach comes with its own fair share of challenges to overcome. How do we handle bodies that have changes in the IL making the previous offset invalid (pointing to the middle of an instruction or maybe even out of bounds of the body) or pointing to a different expression?

If the IL is changed externally and the file is reloaded, I think it's entirely justifiable that comments inside the changed method body will never be restorable. There's just too much that could change.

If the IL is changed internally, e.g. due to a user modifying a method, there are some options. For example, if the code is being edited (rather than manual IL tweaks) then the comments can be put into the code editor view and translated back to the decompiled code. For manual IL tweaks it would be trickier, but if comments are matched based on source IL range and a simple hash of the IL bytes (e.g. CRC32) then it would mostly work to re-match all comments that weren't affected by the tweak.

To improve UX, each comment could be stored with a copy of the source line that the comment was attached to. When a comment cannot be matched back to the correct position, it could be displayed as a "ghost" comment (e.g. simply at the top of the method body), with a little info doodad to show the original line of code that the comment was matched to (e.g. on mouse hover). This feature would also be handy when opening a project file from another version of dnSpyEx, because you could check if the decompilation resulted in identical source being generated, and render the comment as a "ghost" (but in the same position where possible) if it doesn't match, with an info doodad that says "this comment was made on a different version and the decompilation changed since; here's what the original line of code looked like".

We cannot use the metadata tokens are they are too volatile and could change by just rewriting the assembly. Names would also work but then we are faced with the difficulties of obfuscated member names and handling them.

That's easily solved by ordinal comparison; just treat the method name as a byte array and compare it. If the bytes are different, it's a different method.

Another difficulty is designing a versatile yet not super complicated project format for dnSpy which would need to facilitate storing comments in method bodies, comments attached to just member definitions, and in the future features like non-invasive renaming of members and variables and anything else that might be required later on.

If it were me, I'd probably go with something like BSON. It's binary-safe, so would be usable in contexts like storing obfuscated names verbatim, it's compact and lightweight, it can be translated back into JSON for human readability during debugging, and it's well-supported in all major languages and has serialisation support too (and it doesn't fall into arbitrary deserialisation vulnerabilities like BinaryFormatter did). It also compresses nicely for storage on disk. CBOR and MessagePack also serve similar roles and have essentially identical properties as BSON.

mitchcapper commented 1 year ago

If it were me, I'd probably go with something like BSON.

I would shy away from binary formats and stick to json or similar. If you look at VS projects they have moved to nearly everything in text format. If a name needs to be encoded any serializer would generally handle that anyway (or if you really cared one could always base64 encode names). This allows you to always directly access the raw data as desired if needed. I don't think this sort of thing would have perf concerns to make BSON worthwhile for that reason.