Interest in Krakatau 2?

Storyyeller commented 2 years ago

@KOLANICH @janmm14 @QwertyYtPl @samczsun @lab313ru @Dmunch04

I've been thinking about doing a complete ground-up redesign and modernization of Krakatau, but I'm not sure if there is enough interest to justify the effort, so I was curious if anyone would be interested in such a project. One particular problem is that I haven't been active in Java reverse engineering myself since 2015 or so, so I would be reliant on users to do all the testing. What do you think?

lab313ru commented 2 years ago

Really, don't sure about any future of this project... I prefer to use CFR during last years.

What would be interesting is a Kotlin specific decompiler (which uses metadata, deals with lambdas, etc.)

Thank you anyway for your awesome project!

Janmm14 commented 2 years ago

I have a bunch of obscure samples for potential testing. I think besides support for basic and better display of custom/complex invokedynamic, an option for it to do unambiguous imports helps much in user-friendlyness.

I don't know about krakatau internals but I liked its peephole analysis/optimizations it did, as its unique to krakatau.

What is your goal for rewriting Krakatau? Better code? Python 3? Fun side project?

XenoAmess commented 2 years ago

I would help if yourewrite it in java. If still python, then I could do nearly nothing.

XenoAmess

From: Janmm14 @.> Sent: Sunday, April 24, 2022 6:06:26 AM To: Storyyeller/Krakatau @.> Cc: Subscribed @.***> Subject: Re: [Storyyeller/Krakatau] Interest in Krakatau 2? (Issue #185)

I have a bunch of obscure samples for potential testing. I think besides support for basic and better display of custom/complex invokedynamic, an option for it to do unambiguous imports helps much in user-friendlyness.

I don't know about krakatau internals but I liked its peephole analysis/optimizations it did, as its unique to krakatau.

— Reply to this email directly, view it on GitHubhttps://github.com/Storyyeller/Krakatau/issues/185#issuecomment-1107656647, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEFFR2JLVANMCPXV5XSDBOTVGRX6FANCNFSM5UFGUI6A. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Storyyeller commented 2 years ago

If I'm going to write anything new, it will be in Rust. But I don't need help with the coding anyway. What I need help with is testing, and in particular identifying interesting samples of obfuscated applications, figuring out where the decompiler works well or not, highlighting features that would be useful to add, etc.

KOLANICH commented 2 years ago

@Storyyeller, Krakatau was the only Java decompiler I know that has reliably decompiled the code that was not originally in Java (I use it on the project written in Scala called Kaitai Struct, I have written the kinda python bindings to the compiler using JVM, but I cannot share it because KS is under GPL, since Scala methods calls map to Java methods calls in obscure ways, I had to decompile the binary to know how exactly I should call which methods) somehow correctly. Other decompilers usually either throw exceptions on such code or emit incorrect code. So yes, there is interes in Krakatau, if one is interacting to Scala code from other languages.

CFR is cool but it often doesn't decompile Scala code correctly.

One particular problem is that I haven't been active in Java reverse engineering myself since 2015 or so, so I would be reliant on users to do all the testing.

Why not to use the existing test suites?

Storyyeller commented 2 years ago

Status update and some questions for you all:

The new disassembler is mostly finished and in the testing and polish stage. I haven't started the assembler or decompiler yet. I expect the assembler to be a bit longer than the disassembler and the decompiler to take longer than both put together.

Question 1:

Currently, the plan is to support three output modes:

Zip/jar output - create a zip archive with a separate file per class
Single file output - output to file with the specified filename
Directory output - create a separate file per class under the specified output directory

Directory output is the currently recommended way to use Krakatau, but it is problematic because there is no guarantee that the class names will result in valid filenames. There may be errors trying to create files with the corresponding names, or even worse, output from one class might silently overwrite another on case insensitive filesystems, e.g. Windows.

Currently, Krakatau has complicated name mangling logic to try to work around this, but this has the downsides that a) it adds a lot of complexity, and b) there is no way for users to predict where the output for a given class will actually be written to anyway.

Therefore, my plan is to remove all the name mangling logic in v2 and just say that directory output is "use at your own risk" and recommend that you use zipfile output instead. Is this ok with everyone? @Morgon

Question 2

Deployment - I never bothered to set anything fancy up packaging-wise for v1, but now that I'm rewriting it in Rust, I figure it's a good time to try to find out what people think the best distribution strategy would be.

tagging @anthraxx, @MartinThoma, @toddATavail as well since you asked about packaging issues before

KOLANICH commented 2 years ago

a) it adds a lot of complexity

As long as this complexity is isolated, it is OK.

there is no way for users to predict where the output for a given class will actually be written to anyway.

One can just create a tsv file with the pairs mangled name <-> original name.

Therefore, my plan is to remove all the name mangling logic in v2 and just say that directory output is "use at your own risk" and recommend that you use zipfile output instead. Is this ok with everyone?

I thknk it is extremily inconvenient. To analyse the source I usually unpack the zip archives of the decompilers emitting them and it has always felt weird that they emit an archive, not a dir. Thank you for clarifying that aspect. While it is likely no binaries I had analysed had such paths, I think the potential inability to get decompilation results in a form of file tree is not good.

I figure it's a good time to try to find out what people think the best distribution strategy would be.

I guess one can try to create GitHub Actions pipeline emitting GitHub Pages, which will generate a repo that can be consumed by native repository managers like apt and dnf. I already did the things like that, but on GitLab (but for apt only). Since you are going to use rust, https://github.com/mmstick/cargo-deb can be helpful.

Janmm14 commented 2 years ago

Krakatau is currently used in some guis for decompilation. They usually ask for decompilation/disassemble of a single class file. To make this easy for potential bad-named class files (multiple issues with that in the past) I'd suggest to allow single class file name input and defined output file name where the name of the input file is ignored.

Storyyeller commented 2 years ago

I'm already planning to do that. The question is whether directory output is also necessary.

xxDark commented 2 years ago

I just heard about that Krakatau is being re-written in Rust, would it be possible to add JNI bindings for easier use? That would be very handy Thank you for your work!

KOLANICH commented 2 years ago

@xxDark, you may want to try GraalVM with GraalPython module. It allows to use not only python from java and java from python, but also other languages, like JS.

xxDark commented 2 years ago

@xxDark, you may want to try GraalVM with GraalPython module. It allows to use not only python from java and java from python, but also other languages, like JS.

I never actually looked into how Graal itself works, I might give it a shot just to try and see how it goes But ultimately we would gladly wait for Krakatau 2 :)

Storyyeller commented 2 years ago

I briefly looked at the Recaf repository, and it looks like it is mostly focused on disassembly and reassembly rather than decompilation, correct? I expect it to take much longer to rewrite the Krakatau decompiler than the assembler and disassembler, so I was wondering if you would be interested in trying it out once I finish the assembler, even without the decompiler being rewritten. It seems like just having access to the Krakatau assembler and disassembler would already be very useful for you.

As for JNI support, that's something I might consider later, but it's not an immediate priority. I think calling it as a subprocess would be easiest for now.

Storyyeller commented 2 years ago

By the way, one other question for you all - I've been thinking about removing the ACC_SUPER and strictfp flags when disassembling in non-roundtrip mode since those flags don't actually do anything in modern Java and just add visual noise to the disassembly which might make it harder to understand. What do you think?

KOLANICH commented 2 years ago

I don't have any strong opinion on that since I'm not familiar to that impl details.

don't actually do anything in modern Java

I guess if the code is intended to be executed on the versions of Java where they do something (does the bytecode format have any mechanism to indicate that?), they should be kept.

Col-E commented 2 years ago

I briefly looked at the Recaf repository, and it looks like it is mostly focused on disassembly and reassembly rather than decompilation, correct? ... It seems like just having access to the Krakatau assembler and disassembler would already be very useful for you.

The current master branch contains the current 2X release source, which has a really crappy assembler in it. We're focusing our efforts on getting 3X ready for release, and on that branch we recently invested into making a new assembler. In addition to not being crap, it offers some quality-of-life features like in-line expressions and name-based variable access. Stuff to make the bytecode a bit more accessible to new users. With that in mind, I don't think we would get a lot of value from a new assembler at the moment, especially if it were to require an layer of interop and not support the features our current model operates off of.

As everyone else has said thus far, we still look forward to whatever progress gets made on Krakatau 2 :)

KOLANICH commented 2 years ago

BTW for parsing java bytecode you can try to utilize Kaitai Struct

Storyyeller commented 2 years ago

The main advantage of Krakatau is full support for every part of the classfile format, as well as support for bytecode that makes use of a number of bugs and undocumented features in older versions of the JVM. Admittedly, that's not so relevant nowadays. It definitely prioritizes control over low level details over beginner-friendliness though - it's more aimed at bytecode hackers who are already familiar with how Java bytecode works.

Storyyeller commented 2 years ago

Status update: I started work on the assembler today.

Janmm14 commented 2 years ago

I briefly looked at the Recaf repository, and it looks like it is mostly focused on disassembly and reassembly rather than decompilation, correct? I expect it to take much longer to rewrite the Krakatau decompiler than the assembler and disassembler, so I was wondering if you would be interested in trying it out once I finish the assembler, even without the decompiler being rewritten. It seems like just having access to the Krakatau assembler and disassembler would already be very useful for you.

I use recaf as a gui for decompilers when very old Helios 0.0.7 fails me. Recaf aims to be able to read zip files like the jvm and uses CAFED00D to normalize bytecode.

Storyyeller commented 2 years ago

Do you have any samples handy that I can use to test the zipfile reading issue you mentioned?

Janmm14 commented 2 years ago

Do you have any samples handy that I can use to test the zipfile reading issue you mentioned?

Actually I was wrong with that part, recaf doesn't include such a feature and I'm not sure whether I got such a jar so far or whether current jvm does open initial jar differently from its java zip implementation at all. However I encountered some zip files which refuse to open with many non-java programs. One such a zip file is https://www.mediafire.com/file/3wybtz4uu152fk1/Origin_Realms.jar.src.zip/file (it doesn't contain class files tho), another one I remember is grafik but you need an account to download on the page when you google it.

Col-E commented 2 years ago

Actually I was wrong with that part, recaf doesn't include such a feature

What do you mean? The 3X branch does read zip files as the JVM does. Both 2X and 3X include bytecode normalization (Removing intentionally malformed attributes that aren't used at runtime in order to crash reverse engineering tools/libraries).

Do you have any samples handy

The library we made to read zip files as the JVM does has some samples in the test directory. See the *-trick jars. https://github.com/Col-E/LL-Java-Zip/tree/master/src/test/resources

The major thing being that most zip parsers sig scan going forward for section headers. The JVM looks for the "end central directory entry" by looking backwards because that is optimal. The entry is found at the end of the file. Now consider if you use a hex editor to put two zip files together. Most tools will read/display the one at the beginning. But the JVM will read the one at the end. You can add on some extra tricks to make for a confusing archive, but this is the major gist.

Janmm14 commented 2 years ago

@Col-E i was just using githubs search so it doesnt check branches. also he didnt ask for abnormal bytecode, so my answer was fully related to zip files

fee1-dead commented 1 year ago

The CharMatcher$Invisible class from guava is interesting: https://github.com/google/guava/blob/c111c0150225739b3f5914d1739cd22fb692bce7/guava/src/com/google/common/base/CharMatcher.java#L1459-L1476

I was writing my own rust library for parsing class files and hit this. There was an unpaired surrogate codepoint in that string which is not valid UTF-8 when decoded.

Storyyeller commented 1 year ago

Status update: I finished the initial version of the new assembler and started testing it today.

Storyyeller commented 1 year ago

I have finished testing the assembler and disassembler and think they are ready for public testing now. Anyone interested in trying them out?

New features:

Much, much faster than before
Support for new-style Code attrs in pre 45.3 classfiles For pre 45.3 classfiles, disassembler will try disassembling the bytecode in both formats. If only one of them is valid, it will just use that. If both are valid (which can only happen for a specially crafted classfiles like my Invisible Crackme), it will print out a warning and display the disassembly of the old-style parse. You can pass the new --no-short-code-attr option in order to force it to disassemble pre-45.3 classfiles using only the new style. As before, the assembler will output old-style Code attrs by default for pre-45.3 classfiles. However, you can use the new-style instead by using the new "long" option after the ".code" directive. (Also backported to Krakatau 1)
Improved constant pool allocation algorithm fixes rare case where a specially crafted combination of ldc and wide raw .consts could cause Krakatau 1 assembler to fail even though there is a valid way to allocate the constants.
Various cases of invalid attributes are now supported that Krakatau 1 did not support. (e.g. Record attribute inside a Record attribute)
Support for bytecode features up to Java 18 (Also backported to Krakatau 1)
Miscellaneous bugfixes (Also backported to Krakatau 1)
Improved disassembler output - disassembler no longer writes out trailing whitespace at the end of each line.
Disassembler supports disassembling multiple classes to a single combined .j file (Also backported to Krakatau 1)

Backwards incompatible changes: I deliberately kept the assembler syntax as close as possible and tried to maintain as much backwards compatibility as possible. However, I decided to make a few extremely minor simplifications to the assembler syntax in order to simplify the implementation:

Subattributes of a Code attribute must now appear at the end of the attribute, following all bytecode, .catch, and .stack directives. This is already how the disassembler disassembles things, so only hand-written assembly files might be affected.
Float literals must now end in lowercase f (previously either case worked)
Long literals must now end in uppercase L (previously either case worked)
Hexidecimal float and double literals can no longer have a mantissa more than 64 bits long (for example, "0x10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000p-1474" was previously accepted but is now invalid.)
+/- infinity/double float literals must now be exactly "Infinity" (case sensitive). Previously any capitalization was accepted.
NaN float/double literals must now be exactly "NaN" (case sensitive). Previously any capitalization was accepted.
Escape sequences in string literals in the assembler are stricter. Only \\, \n, \r, \t, \', \", \u, \U, and \x are supported now. Previously, the assembler would support anything that the (extremely lax) Python string literal syntax would support. The assembler still supports both ' and " strings as well as binary strings.
Command line parameters in long form now take two dashes instead of one, as is the common practice (e.g. "--roundtrip" instead of "-roundtrip")

XenoAmess commented 1 year ago

I have finished testing the assembler and disassembler and think they are ready for public testing now. Anyone interested in trying them out?

New features:

Much, much faster than before

Support for new-style Code attrs in pre 45.3 classfiles For pre 45.3 classfiles, disassembler will try disassembling the bytecode in both formats. If only one of them is valid, it will just use that. If both are valid (which can only happen for a specially crafted classfiles like my Invisible Crackme), it will print out a warning and display the disassembly of the old-style parse. You can pass the new --no-short-code-attr option in order to force it to disassemble pre-45.3 classfiles using only the new style. As before, the assembler will output old-style Code attrs by default for pre-45.3 classfiles. However, you can use the new-style instead by using the new "long" option after the ".code" directive. (Also backported to Krakatau 1)

Improved constant pool allocation algorithm fixes rare case where a specially crafted combination of ldc and wide raw .consts could cause Krakatau 1 assembler to fail even though there is a valid way to allocate the constants.

Various cases of invalid attributes are now supported that Krakatau 1 did not support. (e.g. Record attribute inside a Record attribute)

Support for bytecode features up to Java 18 (Also backported to Krakatau 1)

Miscellaneous bugfixes (Also backported to Krakatau 1)

Improved disassembler output - disassembler no longer writes out trailing whitespace at the end of each line.

Disassembler supports disassembling multiple classes to a single combined .j file (Also backported to Krakatau 1)

Backwards incompatible changes: I deliberately kept the assembler syntax as close as possible and tried to maintain as much backwards compatibility as possible. However, I decided to make a few extremely minor simplifications to the assembler syntax in order to simplify the implementation:

Subattributes of a Code attribute must now appear at the end of the attribute, following all bytecode, .catch, and .stack directives. This is already how the disassembler disassembles things, so only hand-written assembly files might be affected.

Float literals must now end in lowercase f (previously either case worked)

Long literals must now end in uppercase L (previously either case worked)

Hexidecimal float and double literals can no longer have a mantissa more than 64 bits long (for example, "0x10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000p-1474" was previously accepted but is now invalid.)

+/- infinity/double float literals must now be exactly "Infinity" (case sensitive). Previously any capitalization was accepted.

NaN float/double literals must now be exactly "NaN" (case sensitive). Previously any capitalization was accepted.

Escape sequences in string literals in the assembler are stricter. Only \, \n, \r, \t, \', \", \u, \U, and \x are supported now. Previously, the assembler would support anything that the (extremely lax) Python string literal syntax would support. The assembler still supports both ' and " strings as well as binary strings.

Command line parameters in long form now take two dashes instead of one, as is the common practice (e.g. "--roundtrip" instead of "-roundtrip")

Hi. You said you would use rust, so is there anyway for calling your program from java side? like a jni jar or something?

Storyyeller commented 1 year ago

I haven't tried to implement JNI, but it shouldn't be hard to do, at least for a barebones API. It mostly depends on how much interest there is in it.

XenoAmess commented 1 year ago

@Storyyeller good, so why not open the krakatau2 repo yet?

Janmm14 commented 1 year ago

@XenoAmess I guess because stuff is still quite fluctuating as the biggest task, the decompiler, is not done yet.

fee1-dead commented 1 year ago

I am interested in using the assembler as a compiler backend, so I would like to it. Since I will be using it from rust, I guess a builder API would be beneficial.

Storyyeller commented 1 year ago

Right now the assembler is designed to convert textual Krakatau assembly files to classfiles. I think building a programmatic classfile generation API would be a very different project. The API design would probably have to depend a lot on your particular needs as well.

Storyyeller commented 1 year ago

So is anyone interested in trying out the new assembler and disassembler?

Geolykt commented 1 year ago

I definitely am

XenoAmess commented 1 year ago

So is anyone interested in trying out the new assembler and disassembler?

Im sort of interested.

branchmispredictor commented 1 year ago

I am interested, if it is possible to start writing some de-obfuscation plugins on top of the disassembler - I have a few ideas based on a recent project.

Janmm14 commented 1 year ago

I am interested, if it is possible to start writing some de-obfuscation plugins on top of the disassembler - I have a few ideas based on a recent project.

No, Krakatau does not want to do deobfuscation.

Krakatau is a (dis-)assembler and a decompiler. If you want to do deobfuscation go to java-deobfuscator or naruumi deobfuscator on github.

branchmispredictor commented 1 year ago

I am interested, if it is possible to start writing some de-obfuscation plugins on top of the disassembler - I have a few ideas based on a recent project.

No, Krakatau does not want to do deobfuscation.

Krakatau is a (dis-)assembler and a decompiler. If you want to do deobfuscation go to java-deobfuscator or naruumi deobfuscator on github.

I think you misunderstood - the deobfuscation is my own project where I would be using the disassembler as a front-end.

Storyyeller commented 1 year ago

The new assembler and disassembler are now ready for public testing! To try it out, checkout the new v2 branch, then run cargo build --release. A binary will appear at "target/release/krak2". To use the assembler or disassembler, run krak2 asm and krak2 dis, respectively.

Please let me know what you think!

Geolykt commented 1 year ago

Did a quick roundtrip on one of my largest classes I have at my disposal and found that the constant pool does not have the same order after a roundtrip. I'll eventually look at it in bigger detail, but considering that the line count does match up it shouldn't be anything too wild that is broken right now

Another slight complain that I have is that the GPL is kindof restrictive if I were to make software that depends on krakatau, which would make it somewhat unviable for what I had in mind.

Storyyeller commented 1 year ago

Does it roundtrip successfully with Krakatau 1?

Geolykt commented 1 year ago

Without the roundtrip flag on the disassembler, no

Storyyeller commented 1 year ago

Of course it won't roundtrip exactly without the roundtrip flag. That's the whole reason that flag even exists!

Geolykt commented 1 year ago

Yeah, I wasn't aware that the roundtrip flag was a thing in v2 too because for me it doesn't make much sense to not have it roundtrip. But I guess there is a reason for it being like it is.

Sorry about that!

Storyyeller commented 1 year ago

The reason is that outputting the information required to preserve constant pool ordering makes the code a lot less readable for people who don't care about the low level encoding details. Both modes have their uses.

Geolykt commented 1 year ago

Finally dedicated some time into looking a bit more deeply into Kraktau 2

The .fieldattributes thing seems to be on the same line as the field declaration, so you get things like

.field private lineMappings Ljava/util/Map; .fieldattributes
    .signature Ljava/util/Map<Ljava/lang/String;[I>;
.end fieldattributes

Though I guess it isn't the end of the world considering that there does not seem to be an .end field, but is still a bit confusing at first. I guess .fieldattributes shouldn't be needed, but I am likely to be wrong on that front given my lack of expertise

Storyyeller commented 1 year ago

The syntax has been the same since Krakatau 1. I agree that it's not ideal, but I guess that's what happens when you need to cram an attribute list somehow into what was previously a one line directive (inherited from Jasmin syntax).

Geolykt commented 1 year ago

Figured as such

KOLANICH commented 1 year ago

I have tried disassembly of the Scala code relevant to me with Python and Rust versions, both versions worked without an error, but the Rust version has emitted prettier code, unfortunately we cannot directly compare them with diffs.

Rust version works significantly faster, but not the orders of magnitude: 4.7s (optimized build, almost no difference to unoptimized one) vs 9.4s (cpython 3.9). The jar is small enough to fit into FS cache: 1.5 MiB.

The Rust version is bloated. Rust itself is the language incapable of proper dynamic linking and code reuse, so it fetches and rebuilds all the deps. While building this package the most controversal deps it fetches are deps of zip crate, the ones it is likely are never used within jars: zstd, aes and so on.

I guess the next goals can be:

making output of Python and Rust versions of Krakatau comparable.
Allowing the Python decompiler to consume the zip archives created by the disassembler and ensuring that it outputs the same code when fed with the archives produced by Python and Rust versions
optimize the single-threaded impl to make "fast" Rust work really orders of magnitude faster than slow interpreted cpython 3.9.
maybe create a cffi API for the disassembler and integrate it into Python decompiler
.stack_size(256 * 1024 * 1024) I own a machine from 2001 where a 256 MiB used to be the whole physical RAM. I have upgraded it to 512 MiB and used python version of Krakatau on it succesfully (from a graphic LXQt session, so as you can guess that quite some of the RAM was consumed by the GUI apps that are a part of LXQt).

Storyyeller / Krakatau

Interest in Krakatau 2? #185

Question 1:

Question 2