dtolnay / cxx

Safe interop between Rust and C++
https://cxx.rs
Apache License 2.0
5.83k stars 330 forks source link

Generation of cxx::bridge from existing C++ code #235

Closed adetaylor closed 3 years ago

adetaylor commented 4 years ago

The cxx help says:

It would be reasonable to build a higher level bindgen-like tool on top of CXX which consumes a C++ header ... as source of truth and generates the cxx::bridge, eliminating the repetition while leveraging the static analysis safety guarantees of CXX.

It is looking to me like that's exactly the model we may need in order to get permission to use Rust within the large codebase in which I work.

Could we use this issue to speculate about how this might work? It's not something I'm about to start hacking together, but it's conceivable that we could dedicate significant effort in the medium term, if we can become convinced of a solid plan.

Here are my early thoughts.

Note that C++ remains the undisputed ruler in our codebase, so all of this is based around avoiding any C++ changes whatsoever. That might not be practical.

Bits of cxx to keep unchanged

Bits where we'll need to do something different

Next steps our side

Meanwhile though I wanted to raise this to find out if this is the sort of thing you were thinking, when you wrote that comment in the docs!

dtolnay commented 4 years ago

You got it -- this is almost exactly the kind of thing I had in mind in that paragraph. The "almost" is because the idea of scanning Rust code to find which C++ functions need to be made callable is not something I had considered (but very interesting). In my organization we'd have been fine using bindgen-style whitelist (this kind of thing, but in Buck).

I'll loop in @abrown here since we have been poking at the concept of generation of cxx::bridge invocations recently in #228. One aspect of generation of cxx::bridge relevant to #228 is that we're free to explore making such a generator more customizable and/or opinionated than a straight one-to-one translation of C++ signatures to cxx::bridge signatures. For example it sounds like @abrown would be interested in a way to expose C++ constructors of possibly internal-pointer-y C++ types by automatically inserting UniquePtr in the right places to make them sensibly usable from Rust. I shared in that issue my considerations on not going for that kind of sugar in cxx directly, when this sort of generation of cxx::bridge is a possibility:

The core in cxx remains solely responsible for guaranteeing safety, without a high amount of complexity/sugary functionality. The higher level generator becomes responsible for ergonomics and capturing the idioms of different codebases (like #16) without risk of introducing unsafety because the safety is guaranteed by cxx::bridge as the lower level substrate.

I haven't addressed your whole post but wanted to get back to you quickly; I'll try to respond in more detail in the next couple days.

adetaylor commented 4 years ago

Great - glad I am not completely off in cloud cuckoo land. I'll look at #228 in detail as well.

When you get a chance to reply more thoroughly (no hurry) - can you comment on how you would practically expect such a higher-level translator to fit in with cxx in terms of crate and module arrangements? For example, were I to attempt something like an include_cpp! macro and/or a call_cpp! macro, which behind the scenes generated equivalent information to a cxx::bridge block, would you expect that to be done as a whole new crate separate from cxx? If so, to what extent does cxx need to be restructured to expose suitable APIs? Possibly "not at all" if the new crate macros could generate a cxx::bridge macro - is that what you were thinking?

dtolnay commented 4 years ago
  • Generation of C++ wrapper functions only for those which are actually used. [...] Call sites into C++ from Rust would need to be recognizable such that the tool could generate the minimum number of shims. (Procedural macro in statement position? cpp_call!?)

I'll share the technical constraints as far as I understand them.

By cpp_call! I understand you to mean this kind of thing:

let ret = cpp_call!(path::to::Function)(arg, arg);

// or maybe this; the difference isn't consequential
let ret = cpp_call!(path::to::Function(arg, arg));

where Function is a C++ function inside of namespace path::to.

One important constraint is that every procedural macro invocation is executed by Rust in isolation, being given access only to the input tokens of that call site and producing the output tokens for that call site. That means the various cpp_call invocations throughout a crate wouldn't be able to "collaborate" to produce one central #[cxx::bridge] module. That's not to say we won't want some kind of cpp_call macro to mark C++ call sites; I'll come back to this. It just means it wouldn't be the full story, even as a procedural macro. There would still need to be something else crawling the crate to find cpp_call call sites in order to have visibility into all of them together.

When it comes to "crawling the crate", today procedural macros are not powerful enough to do this. There is no such thing as a crate-level procedural macro (i.e. like #![mymacro] in the root module). Procedural macros operate at most on one "item" at a time, where an item is a top-level syntactic construct corresponding to 1 Rust keyword: for example a single fn, single struct, single impl block, single mod (inline or out-of-line. for out-of-line, it would only see mod m; as the macro input), etc. In the future there is a good chance that we'll have crate-level procedural macros but I would count on this being at least 2-3 years out, likely more.

As such, the options are (1) don't crawl the crate, (2) crawl the crate using something that is not a procedural macro.


Option 1 could entail something like this:

// at the crate root

#[extern_c]
mod ffi {
    pub use path::to::Function;
    pub use some::namespace::*;
    ...
}

// in a submodule

let ret = crate::ffi::Function(arg, arg);  // or crate::ffi::path::to::Function?

The way this would operate is: #[extern_c] is a procedural macro that expands to include!(concat!(env!("OUT_DIR"), "/ffi.rs")); during macro expansion and that's it -- basically doing no work. Separately there would be an external code generator (not a procedural macro) responsible for loading the crate root (lib.rs or whatever), locating the #[extern_c] invocation in it, parsing the import paths out (which refer to C++ namespaces/items), parsing whatever C++ headers, extracting the requested signatures, converting them to a Rust #[cxx::bridge] syntax, and writing that out to a file ffi.rs in the right place.

The rest is just like raw cxx. We'd run cxxbridge on the generated ffi.rs. From the user's perspective in Rust, it just looks like there's one special module mod ffi (tagged #[extern_c]) inside of which any pub use reexports refer to C++ namespaces/items rather than to the Rust module hierarchy. From outside the special module, we access those reexported C++ items normally i.e. without some cpp_call! macro involved.

The downside is that like cxx we require the human to identify in one place all the reexports they want, though unlike cxx they now write just the name of the thing, not the complete signature. The signature is kept up to date for them with the C++ header as source of truth.


Option 2 would be more like this:

// at the crate root

mod_ffi!();

// in a submodule

#[extern_c]
use path::to::Function;

let ret = Function(arg, arg);

// or without a `use`:
let ret = extern_c!(path::to::Function)(arg, arg);

In terms of how it expands, it would be quite similar to option 1. mod_ffi! becomes that same include! from before. The external code generation step, instead of looking at just the crate root, would look at all the files involved in the crate and scan them for #[extern_c] and/or extern_c! (or cpp_call! or whatever we make it) to find what signatures to include in the generated #[cxx::bridge]. During Rust compilation, extern_c would be a procedural macro which transforms use path::to::Function into use crate::ffi::path::to::Function and transforms extern_c!(path::to::Function) into crate::ffi::path::to::Function.

From a user's perspective in Rust, they get to "import" any item directly from C++ by writing an ordinary use but with #[extern_c] on top, or by naming it inside of extern_c!(...).

In comparison to option 1, crawling the crate has the disadvantage that in the presence of macros and/or conditional compilation it isn't possible to have an accurate picture of what the exact set of input files is. Tools like srcfiles (https://github.com/Areredify/srcfiles) are able to provide an approximate answer but it's always best-effort. Look at https://github.com/rust-lang/rust/blob/438c59f010016a8a3a11fbcc4c18ae555d7adf94/library/std/src/sys/mod.rs#L25-L54 -- cfg_if is some random macro from a third party crate, and its behavior determines what paths become included in the crate, each of which likely wants a different set of C++ functions.

But depending on build system, this is a non-issue. I know in Buck we require the library author to declare what files constitute the crate, as a glob expression or as a list of files or as Python logic. It would be possible for us to use that same file list without attempting to trace the module hierarchy established by the source code.

adetaylor commented 4 years ago

Thanks, yes, that's what I was thinking. Thanks for taking the time to explain the current state of procedural macros; I appreciate it.

Option 2 seems more powerful, as it doesn't require any sort of pre-declarations at all. Since both options require an extra external code generator, it feels preferable to me to start with option 2, until or unless it's proven to be impossible?

The code generator would also need to know where to find C++ .h files, so there would also need to be an include_cpp! macro or similar. So how's about this sort of amended Option 2:

include_cpp!("base/feature_list.h")

let ret = extern_c!(base::feature_list::FeatureList::whatever)

This is beautifully similar to the way similar code would be written in C++ 👍

In the Rust build, include_cpp! expands to include!(concat!(env!("OUT_DIR"), "/base/feature_list/ffi.h.rs")) or somesuch. (I fear we might need to do something more complex around hashing the paths of all included header files and known #defines, and then load /generated_cxx/<hash-goes-here>.rs. We might need a multi-line include_cpp! macro to include multiple header files and maybe #defines too. Details...).

Then the extern_c call expands to - as you say - crate::ffi::base::feature_list::FeatureList::whatever or wherever the path ends up.

At the codegen stage, include_cpp! then knows the header files to search for the declarations of the things needed later in the extern_c call.

But depending on build system, this is a non-issue. I know in Buck we require the library author to declare what files constitute the crate, as a glob expression or as a list of files or as Python logic. It would be possible for us to use that same file list without attempting to trace the module hierarchy established by the source code.

In our case, right now, we don't need to list the .rs files that are pure Rust code, but we already do need to list the files which contain FFI (so we can pass them to cxxbridge). So this is fine with us too.

How would you practically expect to structure this?

We are now imagining these build stages:

  1. Codegen stage one: generates the ffi.rs.h files. Steps 2 and 3 are blocked on this.
  2. Rust build.
  3. Codegen stage two: cxxbridge generates .cc and .h files. Stage 4 blocked on this.
  4. C++ build
  5. Linking, in some far off distant target

Is there an argument that cxxbridge be restructured to expose a Rust API such that it can be called directly from the stage 1 codegen, and thus we spit out the Rust, C++ and .h files all at once?

A few more questions/thoughts:

dtolnay commented 4 years ago
include_cpp!("base/feature_list.h")

let ret = extern_c!(base::feature_list::FeatureList::whatever);

LGTM


Is there an argument that cxxbridge be restructured to expose a Rust API such that it can be called directly from the stage 1 codegen, and thus we spit out the Rust, C++ and .h files all at once?

I am open to this. We provide something almost like this already, in the form of https://docs.rs/cxx-build/0.3.4/cxx_build/ (see https://github.com/dtolnay/cxx/tree/0.3.4#cargo-based-setup). It's a Rust entrypoint for our C++ code generator, as opposed to cxxbridge which is the command line entrypoint. But it's pretty specialized toward the Cargo build script use case. I would be open to splitting out another crate where just Rust source code + config goes in and C++ code comes out. Then the existing two entrypoints would be adjusted to depend on it.

Alternatively, it might be reasonable for your step 1 code generator to also do the cxxbridge run from step 3 by spawning a process to run cxxbridge (by std::process::Command or whatever). They don't necessarily need to be distinct steps from the build system's perspective. It depends on the build system though, whether this makes sense. In Buck we wouldn't do this.

I would call out that having a Rust API for the cxx C++ code generator would only help if you plan to implement step 1 in Rust. I would say that's not an obvious choice to me since it's going to involve some elaborate use of libclang or libtooling which might be easier in a different language. If your step 1 code generator is not Rust, likely you'd want one of the previous two approaches: either distinct steps from the build system's perspective, or step 1 running cxxbridge as a shell command.


We want to pull enums, consts and maybe even constant structures in from the C++ in the future. Again I don't see anything incompatible here but thought I'd mention.

This sounds fine to me. To the extent that any feature work would be required in cxx, it's stuff that we would want to implement even independent of this work.

adetaylor commented 4 years ago

I would call out that having a Rust API for the cxx C++ code generator would only help if you plan to implement step 1 in Rust

Interesting, I hadn't thought of that. Insofar as I'd thought about this at all, I was probably imagining forking `bindgen` (temporarily), then abstracting out [parts of the codegen side](https://github.com/rust-lang/rust-bindgen/blob/master/src/codegen/mod.rs) so that it could spit out either Rust code as it currently does, or `cxx-bridge` Rust code as we want. It looks like bindgen [supports roughly the set of features we need](https://rust-lang.github.io/rust-bindgen/cpp.html) but it's been an awfully long time since I've fiddled with it.
adetaylor commented 4 years ago

Another aspect to discuss.

Supposing (as you propose in #228) this higher-level code generator adds support for std::make_unique (which is also something which we keenly want.)

i.e. we want to able to see C++ code like this:

class Foo {
public:
   Foo(std::string& label, uint32_t number);
   ...
};

and write Rust code like this:

let x = ffi::make_unique_Foo(CxxString::new("MyLabel"), 42);
// ideally, we could call this ffi::Foo::make_unique, but that's a detail

I believe the higher-level code generator would currently need to generate .cc as well as .rs code.

We'd need some C++ generated like this:

std::unique_ptr<Foo> make_unique_Foo(std::string a, uint32_t b) {
  return std::make_unique<Foo>(a,b);
}

and then we'd want this to be represented in the [cxx::bridge] like this:

[cxx::bridge]
mod ffi {
   extern "C" {
     type Foo;
     fn make_unique_Foo(label: CxxString, number: u32) -> cxx::UniquePtr<Foo>;
  }
}

There are three approaches. Which would you prefer?

  1. The higher-level code generator does indeed generate .cc as well as an .rs containing [cxx::bridge]
  2. cxx gains (something like a) passthrough_cc!("bunch-of-c-plus-plus-code-goes-here") macro which is ignored by Rust, but picked up by cxxbridge to pass code directly through to the .cc file which it generates.
  3. cxx supports make_unique directly.
adetaylor commented 4 years ago

Other thoughts happening here as I continue to think this through:

emosenkis commented 4 years ago

Re: collecting a whitelist of C++ types/functions used in Rust code, I took a hacky, yet reasonably successful approach: try to compile the Rust crate without any of the C++ bindings included and parse the cannot find X in this scope error messages to find out which C++ items are expected to be available. This has the advantages of not requiring a macro or annotation at the call site and of making it possible to transparently replace a C++ function with a Rust implementation or vice versa. I believe the disadvantages are fairly obvious and I can't say for sure whether this approach would be robust enough to be practical in a large codebase.

adetaylor commented 4 years ago

OK, there's an early attempt at such a higher-level code generator here: https://github.com/google/autocxx. It currently depends on a (slight) fork of cxx, and a (gross, appalling, hacky, make-your-eyeballs-bleed) fork of bindgen. Comments most welcome! It's still at the stage where I'm throwing random commits at it, rather than having any kind of PR-based process, but if anyone wants to join in I can certainly grow up a bit.

adetaylor commented 3 years ago

Update for anyone reading along here.

autocxx now no longer requires a fork of either bindgen or cxx. It is still, in every other way, "toy" code. It has a large number of test cases for individual cases of interop, but I suspect everything breaks horribly when it is asked to deal with a real C++ codebase. I hope to find out in the next couple of weeks.

dtolnay commented 3 years ago

I'll close out this issue and we can keep the rest of the discussion on this topic in the autocxx repo. Thanks all!