Generation of cxx::bridge from existing C++ code

The cxx help says:

It would be reasonable to build a higher level bindgen-like tool on top of CXX which consumes a C++ header ... as source of truth and generates the cxx::bridge, eliminating the repetition while leveraging the static analysis safety guarantees of CXX.

It is looking to me like that's exactly the model we may need in order to get permission to use Rust within the large codebase in which I work.

Could we use this issue to speculate about how this might work? It's not something I'm about to start hacking together, but it's conceivable that we could dedicate significant effort in the medium term, if we can become convinced of a solid plan.

Here are my early thoughts.

Note that C++ remains the undisputed ruler in our codebase, so all of this is based around avoiding any C++ changes whatsoever. That might not be practical.

Bits of cxx to keep unchanged

Built-in types (notably, which objects can be passed by reference/pointer/etc.... a major concern about Rust/C++ interop in my organisation is that C++ objects often have internal pointers; the fact that you can't pass them by value into Rust within cxx neatly sidesteps that concern)
C++-side and Rust-side shim generation
Magical emergent behavior where an unsafe keyword is only used if you're doing something that actually is a bit iffy with C++ object lifetimes
Declaration of any Rust functions which C++ should be able to call (for now). This proposal only talks about calls from Rust into C++.
The fact that the wrapper shims should work nicely with cross-language LTO inlining and thus (in the optimal case where there are no type conversions) shrink down to nothing
cxxbridge tool.

Bits where we'll need to do something different

No cxx::bridge section. Instead, declarations are read from C++ headers. A simple include_cpp! macro instead. ("Simple" in the user's sense, very much not in the implementation sense, where we'd need to replicate much of bindgen)
Generation of C++ wrapper functions only for those which are actually used. An include_cpp! macro may pull in hundreds of header files with thousands of functions. The cxxbridge C++-generation tool would have to work incredibly hard for every .rs file if it generated wrappers for each one. So, instead, the call sites into C++ from Rust would probably need to be recognizable by that tool such that it could generate the minimum number of shims. (Procedural macro in statement position? cpp_call!?)
Perhaps some way to avoid duplication of cxxbridge effort if hundreds of .rs files use the same include_cpp! macro(s). This might be purely a build-system thing; I haven't thought it through.
If such a function call refers to a struct which can be represented in cxx::bridge (e.g. because it contains only unique_ptrs and ints) then said struct is fully defined such that it's accessible to both Rust and C++. If a struct is not representable in cxx then it is declared in the virtual cxx::bridge as an opaque type (and thus can only be referenced from unique_ptrs etc.)
Passing the C++ include paths and #defines into rustc. I can't think of a better way to do this than to rely on the env! macro. Ugh.
If Rust code calls a C++ function which can't be translated, a compile-time error occurs with information about why cxx can't handle it (e.g. C++ object passed by value). Ideally this could (in future) be improved by C++ annotations that tell Rust users where and how to use a wrapper function that may already have been defined (input to clippy/rust-analyzer/etc. in future? Highly ambitious.)
This involves a possible mindset shift. At the moment, we're using cxx in localized places where it wraps up C++ function calls with an idiomatic Rust wrapper. This proposal involves instead allowing production Rust code to freely call C++ classes and functions. There may be some major leaps here to make this practical, which I might not have thought of (as I say, these are early thoughts...)
A few things orthogonal to this feature and beyond the scope of this issue: class method calls, enums, extensible support for our shared pointer types, etc.

Next steps our side

We need to list out the sorts of C++ function calls and class accesses which typical Rust code would need to do in our codebase. We then need to analyze the types involved and find out if this plan is actually practical for our codebase. If we find a significant proportion of likely function calls involve passing C++ types by value that can't be represented by cxx, we're in trouble. My hope is that this isn't the case, but work is required to figure this out.

Meanwhile though I wanted to raise this to find out if this is the sort of thing you were thinking, when you wrote that comment in the docs!

You got it -- this is almost exactly the kind of thing I had in mind in that paragraph. The "almost" is because the idea of scanning Rust code to find which C++ functions need to be made callable is not something I had considered (but very interesting). In my organization we'd have been fine using bindgen-style whitelist (this kind of thing, but in Buck).

I'll loop in @abrown here since we have been poking at the concept of generation of cxx::bridge invocations recently in #228. One aspect of generation of cxx::bridge relevant to #228 is that we're free to explore making such a generator more customizable and/or opinionated than a straight one-to-one translation of C++ signatures to cxx::bridge signatures. For example it sounds like @abrown would be interested in a way to expose C++ constructors of possibly internal-pointer-y C++ types by automatically inserting UniquePtr in the right places to make them sensibly usable from Rust. I shared in that issue my considerations on not going for that kind of sugar in cxx directly, when this sort of generation of cxx::bridge is a possibility:

The core in cxx remains solely responsible for guaranteeing safety, without a high amount of complexity/sugary functionality. The higher level generator becomes responsible for ergonomics and capturing the idioms of different codebases (like #16) without risk of introducing unsafety because the safety is guaranteed by cxx::bridge as the lower level substrate.

I haven't addressed your whole post but wanted to get back to you quickly; I'll try to respond in more detail in the next couple days.

Great - glad I am not completely off in cloud cuckoo land. I'll look at #228 in detail as well.

When you get a chance to reply more thoroughly (no hurry) - can you comment on how you would practically expect such a higher-level translator to fit in with cxx in terms of crate and module arrangements? For example, were I to attempt something like an include_cpp! macro and/or a call_cpp! macro, which behind the scenes generated equivalent information to a cxx::bridge block, would you expect that to be done as a whole new crate separate from cxx? If so, to what extent does cxx need to be restructured to expose suitable APIs? Possibly "not at all" if the new crate macros could generate a cxx::bridge macro - is that what you were thinking?

Generation of C++ wrapper functions only for those which are actually used. [...] Call sites into C++ from Rust would need to be recognizable such that the tool could generate the minimum number of shims. (Procedural macro in statement position? cpp_call!?)

I'll share the technical constraints as far as I understand them.

By cpp_call! I understand you to mean this kind of thing:

let ret = cpp_call!(path::to::Function)(arg, arg);

// or maybe this; the difference isn't consequential
let ret = cpp_call!(path::to::Function(arg, arg));

where Function is a C++ function inside of namespace path::to.

One important constraint is that every procedural macro invocation is executed by Rust in isolation, being given access only to the input tokens of that call site and producing the output tokens for that call site. That means the various cpp_call invocations throughout a crate wouldn't be able to "collaborate" to produce one central #[cxx::bridge] module. That's not to say we won't want some kind of cpp_call macro to mark C++ call sites; I'll come back to this. It just means it wouldn't be the full story, even as a procedural macro. There would still need to be something else crawling the crate to find cpp_call call sites in order to have visibility into all of them together.

When it comes to "crawling the crate", today procedural macros are not powerful enough to do this. There is no such thing as a crate-level procedural macro (i.e. like #![mymacro] in the root module). Procedural macros operate at most on one "item" at a time, where an item is a top-level syntactic construct corresponding to 1 Rust keyword: for example a single fn, single struct, single impl block, single mod (inline or out-of-line. for out-of-line, it would only see mod m; as the macro input), etc. In the future there is a good chance that we'll have crate-level procedural macros but I would count on this being at least 2-3 years out, likely more.

As such, the options are (1) don't crawl the crate, (2) crawl the crate using something that is not a procedural macro.

Option 1 could entail something like this:

// at the crate root

#[extern_c]
mod ffi {
    pub use path::to::Function;
    pub use some::namespace::*;
    ...
}

// in a submodule

let ret = crate::ffi::Function(arg, arg);  // or crate::ffi::path::to::Function?

The way this would operate is: #[extern_c] is a procedural macro that expands to include!(concat!(env!("OUT_DIR"), "/ffi.rs")); during macro expansion and that's it -- basically doing no work. Separately there would be an external code generator (not a procedural macro) responsible for loading the crate root (lib.rs or whatever), locating the #[extern_c] invocation in it, parsing the import paths out (which refer to C++ namespaces/items), parsing whatever C++ headers, extracting the requested signatures, converting them to a Rust #[cxx::bridge] syntax, and writing that out to a file ffi.rs in the right place.

The rest is just like raw cxx. We'd run cxxbridge on the generated ffi.rs. From the user's perspective in Rust, it just looks like there's one special module mod ffi (tagged #[extern_c]) inside of which any pub use reexports refer to C++ namespaces/items rather than to the Rust module hierarchy. From outside the special module, we access those reexported C++ items normally i.e. without some cpp_call! macro involved.

The downside is that like cxx we require the human to identify in one place all the reexports they want, though unlike cxx they now write just the name of the thing, not the complete signature. The signature is kept up to date for them with the C++ header as source of truth.

Option 2 would be more like this:

// at the crate root

mod_ffi!();

// in a submodule

#[extern_c]
use path::to::Function;

let ret = Function(arg, arg);

// or without a `use`:
let ret = extern_c!(path::to::Function)(arg, arg);

In terms of how it expands, it would be quite similar to option 1. mod_ffi! becomes that same include! from before. The external code generation step, instead of looking at just the crate root, would look at all the files involved in the crate and scan them for #[extern_c] and/or extern_c! (or cpp_call! or whatever we make it) to find what signatures to include in the generated #[cxx::bridge]. During Rust compilation, extern_c would be a procedural macro which transforms use path::to::Function into use crate::ffi::path::to::Function and transforms extern_c!(path::to::Function) into crate::ffi::path::to::Function.

From a user's perspective in Rust, they get to "import" any item directly from C++ by writing an ordinary use but with #[extern_c] on top, or by naming it inside of extern_c!(...).

In comparison to option 1, crawling the crate has the disadvantage that in the presence of macros and/or conditional compilation it isn't possible to have an accurate picture of what the exact set of input files is. Tools like srcfiles (https://github.com/Areredify/srcfiles) are able to provide an approximate answer but it's always best-effort. Look at https://github.com/rust-lang/rust/blob/438c59f010016a8a3a11fbcc4c18ae555d7adf94/library/std/src/sys/mod.rs#L25-L54 -- cfg_if is some random macro from a third party crate, and its behavior determines what paths become included in the crate, each of which likely wants a different set of C++ functions.

But depending on build system, this is a non-issue. I know in Buck we require the library author to declare what files constitute the crate, as a glob expression or as a list of files or as Python logic. It would be possible for us to use that same file list without attempting to trace the module hierarchy established by the source code.

Thanks, yes, that's what I was thinking. Thanks for taking the time to explain the current state of procedural macros; I appreciate it.

Option 2 seems more powerful, as it doesn't require any sort of pre-declarations at all. Since both options require an extra external code generator, it feels preferable to me to start with option 2, until or unless it's proven to be impossible?

The code generator would also need to know where to find C++ .h files, so there would also need to be an include_cpp! macro or similar. So how's about this sort of amended Option 2:

include_cpp!("base/feature_list.h")

let ret = extern_c!(base::feature_list::FeatureList::whatever)

This is beautifully similar to the way similar code would be written in C++ 👍

In the Rust build, include_cpp! expands to include!(concat!(env!("OUT_DIR"), "/base/feature_list/ffi.h.rs")) or somesuch. (I fear we might need to do something more complex around hashing the paths of all included header files and known #defines, and then load /generated_cxx/<hash-goes-here>.rs. We might need a multi-line include_cpp! macro to include multiple header files and maybe #defines too. Details...).

Then the extern_c call expands to - as you say - crate::ffi::base::feature_list::FeatureList::whatever or wherever the path ends up.

At the codegen stage, include_cpp! then knows the header files to search for the declarations of the things needed later in the extern_c call.

But depending on build system, this is a non-issue. I know in Buck we require the library author to declare what files constitute the crate, as a glob expression or as a list of files or as Python logic. It would be possible for us to use that same file list without attempting to trace the module hierarchy established by the source code.

In our case, right now, we don't need to list the .rs files that are pure Rust code, but we already do need to list the files which contain FFI (so we can pass them to cxxbridge). So this is fine with us too.

How would you practically expect to structure this?

We are now imagining these build stages:

Codegen stage one: generates the ffi.rs.h files. Steps 2 and 3 are blocked on this.
Rust build.
Codegen stage two: cxxbridge generates .cc and .h files. Stage 4 blocked on this.
C++ build
Linking, in some far off distant target

Is there an argument that cxxbridge be restructured to expose a Rust API such that it can be called directly from the stage 1 codegen, and thus we spit out the Rust, C++ and .h files all at once?

A few more questions/thoughts:

The name extern_c sounds good but extern_cpp or extern_cxx sounds perhaps more accurate? We're going far beyond the built-in "extern C" functionality.
In #228 you've proposed that this code generator would also know how to do extra high-level stuff, e.g. adding a make_unique API to enable construction of opaque C++ types from within Rust (which is also important for us). I don't see anything here which is incompatible with that plan, but I thought I'd mention it in case you see any concerns.
We want to pull enums, consts and maybe even constant structuresin from the C++ in the future. Again I don't see anything incompatible here but thought I'd mention.
We're asking a heck of a lot of the bindgen-equivalent code here. Not only does it need to do all this horrible parsing of C++ headers, but ideally (for example) we're going to need to work out which structs are sufficient cxx-compatible to be fully defined in Rust, vs which ones will need to be opaque types. We'll need to work out which functions are methods vs plain function calls etc. This plan is nothing if not ambitious!
For things which are discovered in C++ and thus made available in cxx::bridge we'll not want to include them in the generated .h and .cc files which we feed back to C++.

include_cpp!("base/feature_list.h")

let ret = extern_c!(base::feature_list::FeatureList::whatever);

LGTM

Is there an argument that cxxbridge be restructured to expose a Rust API such that it can be called directly from the stage 1 codegen, and thus we spit out the Rust, C++ and .h files all at once?

I am open to this. We provide something almost like this already, in the form of https://docs.rs/cxx-build/0.3.4/cxx_build/ (see https://github.com/dtolnay/cxx/tree/0.3.4#cargo-based-setup). It's a Rust entrypoint for our C++ code generator, as opposed to cxxbridge which is the command line entrypoint. But it's pretty specialized toward the Cargo build script use case. I would be open to splitting out another crate where just Rust source code + config goes in and C++ code comes out. Then the existing two entrypoints would be adjusted to depend on it.

Alternatively, it might be reasonable for your step 1 code generator to also do the cxxbridge run from step 3 by spawning a process to run cxxbridge (by std::process::Command or whatever). They don't necessarily need to be distinct steps from the build system's perspective. It depends on the build system though, whether this makes sense. In Buck we wouldn't do this.

I would call out that having a Rust API for the cxx C++ code generator would only help if you plan to implement step 1 in Rust. I would say that's not an obvious choice to me since it's going to involve some elaborate use of libclang or libtooling which might be easier in a different language. If your step 1 code generator is not Rust, likely you'd want one of the previous two approaches: either distinct steps from the build system's perspective, or step 1 running cxxbridge as a shell command.

We want to pull enums, consts and maybe even constant structures in from the C++ in the future. Again I don't see anything incompatible here but thought I'd mention.

This sounds fine to me. To the extent that any feature work would be required in cxx, it's stuff that we would want to implement even independent of this work.

I would call out that having a Rust API for the cxx C++ code generator would only help if you plan to implement step 1 in Rust

Interesting, I hadn't thought of that. Insofar as I'd thought about this at all, I was probably imagining forking `bindgen` (temporarily), then abstracting out [parts of the codegen side](https://github.com/rust-lang/rust-bindgen/blob/master/src/codegen/mod.rs) so that it could spit out either Rust code as it currently does, or `cxx-bridge` Rust code as we want. It looks like bindgen [supports roughly the set of features we need](https://rust-lang.github.io/rust-bindgen/cpp.html) but it's been an awfully long time since I've fiddled with it.

Another aspect to discuss.

Supposing (as you propose in #228) this higher-level code generator adds support for std::make_unique (which is also something which we keenly want.)

i.e. we want to able to see C++ code like this:

class Foo {
public:
   Foo(std::string& label, uint32_t number);
   ...
};

and write Rust code like this:

let x = ffi::make_unique_Foo(CxxString::new("MyLabel"), 42);
// ideally, we could call this ffi::Foo::make_unique, but that's a detail

I believe the higher-level code generator would currently need to generate .cc as well as .rs code.

We'd need some C++ generated like this:

std::unique_ptr<Foo> make_unique_Foo(std::string a, uint32_t b) {
  return std::make_unique<Foo>(a,b);
}

and then we'd want this to be represented in the [cxx::bridge] like this:

[cxx::bridge]
mod ffi {
   extern "C" {
     type Foo;
     fn make_unique_Foo(label: CxxString, number: u32) -> cxx::UniquePtr<Foo>;
  }
}

There are three approaches. Which would you prefer?

The higher-level code generator does indeed generate .cc as well as an .rs containing [cxx::bridge]
cxx gains (something like a) passthrough_cc!("bunch-of-c-plus-plus-code-goes-here") macro which is ignored by Rust, but picked up by cxxbridge to pass code directly through to the .cc file which it generates.
cxx supports make_unique directly.

Other thoughts happening here as I continue to think this through:

For C++ types, we would likely be going from C++ definition -> cxx::bridge section -> { Rust definition, another C++ definition}. That seems daft and will end up violating ODR. We'll need some annotation to a cxx::bridge section to say "don't generate a new C++ definition here", whilst retaining suitable levels of tests to ensure layout is the same.
A sub-project here is to make an equivalent of the cfg! macro which can instead depend upon #defines and #includes from C++.
In our world, I suspect we don't need Rust types to be able to inherit from C++ types in general, but we will need a way of implementing pure virtual types. I suspect the best way there is for the high-level code generator to define a C++ derived class and populate it with a load of Fns passed across via the existing cxx interface, but with a lot of syntactic sugar to make it nearly transparent. I haven't really thought about it beyond that.

Re: collecting a whitelist of C++ types/functions used in Rust code, I took a hacky, yet reasonably successful approach: try to compile the Rust crate without any of the C++ bindings included and parse the cannot find X in this scope error messages to find out which C++ items are expected to be available. This has the advantages of not requiring a macro or annotation at the call site and of making it possible to transparently replace a C++ function with a Rust implementation or vice versa. I believe the disadvantages are fairly obvious and I can't say for sure whether this approach would be robust enough to be practical in a large codebase.

OK, there's an early attempt at such a higher-level code generator here: https://github.com/google/autocxx. It currently depends on a (slight) fork of cxx, and a (gross, appalling, hacky, make-your-eyeballs-bleed) fork of bindgen. Comments most welcome! It's still at the stage where I'm throwing random commits at it, rather than having any kind of PR-based process, but if anyone wants to join in I can certainly grow up a bit.

Update for anyone reading along here.

autocxx now no longer requires a fork of either bindgen or cxx. It is still, in every other way, "toy" code. It has a large number of test cases for individual cases of interop, but I suspect everything breaks horribly when it is asked to deal with a real C++ codebase. I hope to find out in the next couple of weeks.

I'll close out this issue and we can keep the rest of the discussion on this topic in the autocxx repo. Thanks all!

dtolnay / cxx