TheDan64 / inkwell

It's a New Kind of Wrapper for Exposing LLVM (Safely)
https://thedan64.github.io/inkwell/
Apache License 2.0
2.29k stars 218 forks source link

View function disassembly or raw instructions? #184

Open novacrazy opened 4 years ago

novacrazy commented 4 years ago

I'm about to start using Inkwell for a highly-optimized JIT system, and it would be great if there were a way to view the resulting compiled code or even just getting a pointer and length to where the code is, allowing me to read it directly.

I'm aware of the print_to_string/print_to_stderr methods on FunctionValue, but those only seem to print the raw LLVM IR.

Without access to horizontal vector ops, I'm hoping LLVM will be able to autovectorize vector sums and products well enough, but without a way to see the resulting instructions I can't know.

Please let me know if I'm missing something obvious! Also if you have any ideas for autovectorization or horizontal vector ops, I'd love to hear them.

Here is the kind of thing I plan to do:

```rust fn simd_extract<'ctx>(cg: &CodeGen<'ctx>, ty: &types::JITTypes<'ctx>, x: VectorValue<'ctx>, lane: u64) -> FloatValue<'ctx> { cg.builder .build_extract_element(x, ty.i32_t.const_int(lane, false), &format!("lane_{}", lane)) .into_float_value() } fn build_dot_product<'ctx>(cg: &CodeGen<'ctx>, ty: &types::JITTypes<'ctx>, a: VectorValue<'ctx>, b: VectorValue<'ctx>) -> FloatValue<'ctx> { let product = cg.builder.build_float_mul(a, b, "product"); let x = simd_extract(cg, ty, product, 0); let y = simd_extract(cg, ty, product, 1); let z = simd_extract(cg, ty, product, 2); let w = simd_extract(cg, ty, product, 3); let xy = cg.builder.build_float_add(x, y, "xy"); let xyz = cg.builder.build_float_add(xy, z, "xyz"); cg.builder.build_float_add(xyz, w, "xyzw") } ``` which results in this LLVM IR: ```llvm define float @dot_product(<4 x float> %0, <4 x float> %1) { entry: %product = fmul <4 x float> %0, %1 %lane_0 = extractelement <4 x float> %product, i32 0 %lane_1 = extractelement <4 x float> %product, i32 1 %lane_2 = extractelement <4 x float> %product, i32 2 %lane_3 = extractelement <4 x float> %product, i32 3 %xy = fadd float %lane_0, %lane_1 %xyz = fadd float %lane_2, %xy %xyzw = fadd float %lane_3, %xyz ret float %xyzw } ``` which was printed after running the optimization passes shown in the Kaleidoscope demo, which didn't seem to change much. Adding the two "vectorize" passes didn't seem to do anything either.
novacrazy commented 4 years ago

After more research, it seems horizontal ops are a bit weird in general, and even Rust's simd_reduce_add_unordered, which compiles to @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32, results in a simple add/shuffle/extract algorithm. I had been expecting hadd instructions or something.

nlewycky commented 4 years ago

hadd is slow on real hardware, it's only useful as a size optimization.

LLVM has the @llvm.experimental.vector.reduce intrinsics that you've identified, which is part of an effort to improve support for horizontal/reduce operations on vectors. Taking a quick look at LLVM 10 source, I don't believe any optimization outputs those intrinsics yet, but you can use them yourself. LLVM exposes native instructions through intrinsics like @llvm.x86.sse3.hadd.ps which are target-specific, if you need them.

I suggest writing the natural code with target-neutral LLVM including the experimental vector reduce intrinsics, and once you get the resulting assembly, see if you can beat it. Previously I'd recommend using Intel Architecture Code Analyzer to analyzer assembly performance, but their webpage is now redirecting to llvm-mca: https://llvm.org/docs/CommandGuide/llvm-mca.html .

novacrazy commented 4 years ago

Well, the original point of the issue is still present. How would I go about viewing the generated machine code from Inkwell?

In fact, I'm also not sure how to inject arbitrary LLVM IR other than to create an entirely new module out of it.

TheDan64 commented 4 years ago

I'm not certain you can do the former at the moment. For the latter, maybe ~Module::parse_bitcode_from_buffer~ Context::create_module_from_ir? I don't think you can just inject IR into the module other than creating it from scratch

novacrazy commented 4 years ago

I'll have to experiment with that. Perhaps handwrite a few modules for common ops and rely on link_in_module to combine it with generated code.

As for viewing the assembly, perhaps reinterpreting the raw function pointer in JitFunction as a slice of something and searching for a ret instructions could work to get a range, depending on what the underlying real code is provided by LLVM. I mean, it's still just raw bytes at that point, but it's a start. Nevermind, dumb idea.

novacrazy commented 4 years ago

Oh. Of course, this is already available.

Target::initialize_native(&InitializationConfig::default()).expect("Failed to initialize native target");

let triple = TargetMachine::get_default_triple();
let cpu = TargetMachine::get_host_cpu_name().to_string();
let features = TargetMachine::get_host_cpu_features().to_string();

let target = Target::from_triple(&triple).unwrap();
let machine = target
    .create_target_machine(
        &triple,
        &cpu,
        &features,
        OptimizationLevel::Aggressive,
        RelocMode::Default,
        CodeModel::Default,
    )
    .unwrap();

    // create a module and do JIT stuff

machine.write_to_file(&module, FileType::Assembly, "out.asm".as_ref()).unwrap();

So yeah, took me a while to find out how, but it does indeed save the whole assembly with labels, attributes and so forth.

It also confirms that it's producing highly-optimized machine code just like I hoped.

However, some better documentation around target machines would be very helpful. Is it stateful? Does it actually affect codegen? Other than exporting that module, it doesn't touch the JIT code, so its affect is unknown.

You're welcome to close this if this solution is acceptable, though my questions still stand.