Closed adetaylor closed 3 years ago
You got it -- this is almost exactly the kind of thing I had in mind in that paragraph. The "almost" is because the idea of scanning Rust code to find which C++ functions need to be made callable is not something I had considered (but very interesting). In my organization we'd have been fine using bindgen-style whitelist (this kind of thing, but in Buck).
I'll loop in @abrown here since we have been poking at the concept of generation of cxx::bridge invocations recently in #228. One aspect of generation of cxx::bridge relevant to #228 is that we're free to explore making such a generator more customizable and/or opinionated than a straight one-to-one translation of C++ signatures to cxx::bridge signatures. For example it sounds like @abrown would be interested in a way to expose C++ constructors of possibly internal-pointer-y C++ types by automatically inserting UniquePtr in the right places to make them sensibly usable from Rust. I shared in that issue my considerations on not going for that kind of sugar in cxx directly, when this sort of generation of cxx::bridge is a possibility:
The core in cxx remains solely responsible for guaranteeing safety, without a high amount of complexity/sugary functionality. The higher level generator becomes responsible for ergonomics and capturing the idioms of different codebases (like #16) without risk of introducing unsafety because the safety is guaranteed by cxx::bridge as the lower level substrate.
I haven't addressed your whole post but wanted to get back to you quickly; I'll try to respond in more detail in the next couple days.
Great - glad I am not completely off in cloud cuckoo land. I'll look at #228 in detail as well.
When you get a chance to reply more thoroughly (no hurry) - can you comment on how you would practically expect such a higher-level translator to fit in with cxx in terms of crate and module arrangements? For example, were I to attempt something like an include_cpp!
macro and/or a call_cpp!
macro, which behind the scenes generated equivalent information to a cxx::bridge
block, would you expect that to be done as a whole new crate separate from cxx? If so, to what extent does cxx need to be restructured to expose suitable APIs? Possibly "not at all" if the new crate macros could generate a cxx::bridge
macro - is that what you were thinking?
- Generation of C++ wrapper functions only for those which are actually used. [...] Call sites into C++ from Rust would need to be recognizable such that the tool could generate the minimum number of shims. (Procedural macro in statement position?
cpp_call!
?)
I'll share the technical constraints as far as I understand them.
By cpp_call!
I understand you to mean this kind of thing:
let ret = cpp_call!(path::to::Function)(arg, arg);
// or maybe this; the difference isn't consequential
let ret = cpp_call!(path::to::Function(arg, arg));
where Function
is a C++ function inside of namespace path::to
.
One important constraint is that every procedural macro invocation is executed by Rust in isolation, being given access only to the input tokens of that call site and producing the output tokens for that call site. That means the various cpp_call invocations throughout a crate wouldn't be able to "collaborate" to produce one central #[cxx::bridge] module. That's not to say we won't want some kind of cpp_call macro to mark C++ call sites; I'll come back to this. It just means it wouldn't be the full story, even as a procedural macro. There would still need to be something else crawling the crate to find cpp_call call sites in order to have visibility into all of them together.
When it comes to "crawling the crate", today procedural macros are not powerful enough to do this. There is no such thing as a crate-level procedural macro (i.e. like #![mymacro]
in the root module). Procedural macros operate at most on one "item" at a time, where an item is a top-level syntactic construct corresponding to 1 Rust keyword: for example a single fn
, single struct
, single impl
block, single mod
(inline or out-of-line. for out-of-line, it would only see mod m;
as the macro input), etc. In the future there is a good chance that we'll have crate-level procedural macros but I would count on this being at least 2-3 years out, likely more.
As such, the options are (1) don't crawl the crate, (2) crawl the crate using something that is not a procedural macro.
Option 1 could entail something like this:
// at the crate root
#[extern_c]
mod ffi {
pub use path::to::Function;
pub use some::namespace::*;
...
}
// in a submodule
let ret = crate::ffi::Function(arg, arg); // or crate::ffi::path::to::Function?
The way this would operate is: #[extern_c] is a procedural macro that expands to include!(concat!(env!("OUT_DIR"), "/ffi.rs"));
during macro expansion and that's it -- basically doing no work. Separately there would be an external code generator (not a procedural macro) responsible for loading the crate root (lib.rs or whatever), locating the #[extern_c] invocation in it, parsing the import paths out (which refer to C++ namespaces/items), parsing whatever C++ headers, extracting the requested signatures, converting them to a Rust #[cxx::bridge] syntax, and writing that out to a file ffi.rs
in the right place.
The rest is just like raw cxx
. We'd run cxxbridge
on the generated ffi.rs. From the user's perspective in Rust, it just looks like there's one special module mod ffi
(tagged #[extern_c]) inside of which any pub use
reexports refer to C++ namespaces/items rather than to the Rust module hierarchy. From outside the special module, we access those reexported C++ items normally i.e. without some cpp_call!
macro involved.
The downside is that like cxx
we require the human to identify in one place all the reexports they want, though unlike cxx
they now write just the name of the thing, not the complete signature. The signature is kept up to date for them with the C++ header as source of truth.
Option 2 would be more like this:
// at the crate root
mod_ffi!();
// in a submodule
#[extern_c]
use path::to::Function;
let ret = Function(arg, arg);
// or without a `use`:
let ret = extern_c!(path::to::Function)(arg, arg);
In terms of how it expands, it would be quite similar to option 1. mod_ffi!
becomes that same include!
from before. The external code generation step, instead of looking at just the crate root, would look at all the files involved in the crate and scan them for #[extern_c]
and/or extern_c!
(or cpp_call!
or whatever we make it) to find what signatures to include in the generated #[cxx::bridge]. During Rust compilation, extern_c
would be a procedural macro which transforms use path::to::Function
into use crate::ffi::path::to::Function
and transforms extern_c!(path::to::Function)
into crate::ffi::path::to::Function
.
From a user's perspective in Rust, they get to "import" any item directly from C++ by writing an ordinary use
but with #[extern_c] on top, or by naming it inside of extern_c!(...)
.
In comparison to option 1, crawling the crate has the disadvantage that in the presence of macros and/or conditional compilation it isn't possible to have an accurate picture of what the exact set of input files is. Tools like srcfiles
(https://github.com/Areredify/srcfiles) are able to provide an approximate answer but it's always best-effort. Look at https://github.com/rust-lang/rust/blob/438c59f010016a8a3a11fbcc4c18ae555d7adf94/library/std/src/sys/mod.rs#L25-L54 -- cfg_if
is some random macro from a third party crate, and its behavior determines what paths become included in the crate, each of which likely wants a different set of C++ functions.
But depending on build system, this is a non-issue. I know in Buck we require the library author to declare what files constitute the crate, as a glob expression or as a list of files or as Python logic. It would be possible for us to use that same file list without attempting to trace the module hierarchy established by the source code.
Thanks, yes, that's what I was thinking. Thanks for taking the time to explain the current state of procedural macros; I appreciate it.
Option 2 seems more powerful, as it doesn't require any sort of pre-declarations at all. Since both options require an extra external code generator, it feels preferable to me to start with option 2, until or unless it's proven to be impossible?
The code generator would also need to know where to find C++ .h files, so there would also need to be an include_cpp!
macro or similar. So how's about this sort of amended Option 2:
include_cpp!("base/feature_list.h")
let ret = extern_c!(base::feature_list::FeatureList::whatever)
This is beautifully similar to the way similar code would be written in C++ 👍
In the Rust build, include_cpp!
expands to include!(concat!(env!("OUT_DIR"), "/base/feature_list/ffi.h.rs"))
or somesuch. (I fear we might need to do something more complex around hashing the paths of all included header files and known #defines
, and then load /generated_cxx/<hash-goes-here>.rs
. We might need a multi-line include_cpp!
macro to include multiple header files and maybe #define
s too. Details...).
Then the extern_c
call expands to - as you say - crate::ffi::base::feature_list::FeatureList::whatever
or wherever the path ends up.
At the codegen stage, include_cpp!
then knows the header files to search for the declarations of the things needed later in the extern_c
call.
But depending on build system, this is a non-issue. I know in Buck we require the library author to declare what files constitute the crate, as a glob expression or as a list of files or as Python logic. It would be possible for us to use that same file list without attempting to trace the module hierarchy established by the source code.
In our case, right now, we don't need to list the .rs files that are pure Rust code, but we already do need to list the files which contain FFI (so we can pass them to cxxbridge
). So this is fine with us too.
How would you practically expect to structure this?
We are now imagining these build stages:
ffi.rs.h
files. Steps 2 and 3 are blocked on this.cxxbridge
generates .cc
and .h
files. Stage 4 blocked on this.Is there an argument that cxxbridge
be restructured to expose a Rust API such that it can be called directly from the stage 1 codegen, and thus we spit out the Rust, C++ and .h files all at once?
A few more questions/thoughts:
extern_c
sounds good but extern_cpp
or extern_cxx
sounds perhaps more accurate? We're going far beyond the built-in "extern C" functionality.make_unique
API to enable construction of opaque C++ types from within Rust (which is also important for us). I don't see anything here which is incompatible with that plan, but I thought I'd mention it in case you see any concerns.enums
, consts
and maybe even constant structuresin from the C++ in the future. Again I don't see anything incompatible here but thought I'd mention.bindgen
-equivalent code here. Not only does it need to do all this horrible parsing of C++ headers, but ideally (for example) we're going to need to work out which struct
s are sufficient cxx-compatible to be fully defined in Rust, vs which ones will need to be opaque types. We'll need to work out which functions are methods vs plain function calls etc. This plan is nothing if not ambitious!cxx::bridge
we'll not want to include them in the generated .h
and .cc
files which we feed back to C++.include_cpp!("base/feature_list.h") let ret = extern_c!(base::feature_list::FeatureList::whatever);
LGTM
Is there an argument that
cxxbridge
be restructured to expose a Rust API such that it can be called directly from the stage 1 codegen, and thus we spit out the Rust, C++ and .h files all at once?
I am open to this. We provide something almost like this already, in the form of https://docs.rs/cxx-build/0.3.4/cxx_build/ (see https://github.com/dtolnay/cxx/tree/0.3.4#cargo-based-setup). It's a Rust entrypoint for our C++ code generator, as opposed to cxxbridge
which is the command line entrypoint. But it's pretty specialized toward the Cargo build script use case. I would be open to splitting out another crate where just Rust source code + config goes in and C++ code comes out. Then the existing two entrypoints would be adjusted to depend on it.
Alternatively, it might be reasonable for your step 1 code generator to also do the cxxbridge
run from step 3 by spawning a process to run cxxbridge
(by std::process::Command or whatever). They don't necessarily need to be distinct steps from the build system's perspective. It depends on the build system though, whether this makes sense. In Buck we wouldn't do this.
I would call out that having a Rust API for the cxx
C++ code generator would only help if you plan to implement step 1 in Rust. I would say that's not an obvious choice to me since it's going to involve some elaborate use of libclang or libtooling which might be easier in a different language. If your step 1 code generator is not Rust, likely you'd want one of the previous two approaches: either distinct steps from the build system's perspective, or step 1 running cxxbridge
as a shell command.
We want to pull
enum
s,const
s and maybe even constant structures in from the C++ in the future. Again I don't see anything incompatible here but thought I'd mention.
This sounds fine to me. To the extent that any feature work would be required in cxx
, it's stuff that we would want to implement even independent of this work.
I would call out that having a Rust API for the cxx C++ code generator would only help if you plan to implement step 1 in Rust
Another aspect to discuss.
Supposing (as you propose in #228) this higher-level code generator adds support for std::make_unique
(which is also something which we keenly want.)
i.e. we want to able to see C++ code like this:
class Foo {
public:
Foo(std::string& label, uint32_t number);
...
};
and write Rust code like this:
let x = ffi::make_unique_Foo(CxxString::new("MyLabel"), 42);
// ideally, we could call this ffi::Foo::make_unique, but that's a detail
I believe the higher-level code generator would currently need to generate .cc as well as .rs code.
We'd need some C++ generated like this:
std::unique_ptr<Foo> make_unique_Foo(std::string a, uint32_t b) {
return std::make_unique<Foo>(a,b);
}
and then we'd want this to be represented in the [cxx::bridge]
like this:
[cxx::bridge]
mod ffi {
extern "C" {
type Foo;
fn make_unique_Foo(label: CxxString, number: u32) -> cxx::UniquePtr<Foo>;
}
}
There are three approaches. Which would you prefer?
[cxx::bridge]
passthrough_cc!("bunch-of-c-plus-plus-code-goes-here")
macro which is ignored by Rust, but picked up by cxxbridge
to pass code directly through to the .cc file which it generates.make_unique
directly.Other thoughts happening here as I continue to think this through:
cxx::bridge
section -> { Rust definition, another C++ definition}. That seems daft and will end up violating ODR. We'll need some annotation to a cxx::bridge
section to say "don't generate a new C++ definition here", whilst retaining suitable levels of tests to ensure layout is the same.cfg!
macro which can instead depend upon #define
s and #includes
from C++.Fn
s passed across via the existing cxx interface, but with a lot of syntactic sugar to make it nearly transparent. I haven't really thought about it beyond that.Re: collecting a whitelist of C++ types/functions used in Rust code, I took a hacky, yet reasonably successful approach: try to compile the Rust crate without any of the C++ bindings included and parse the cannot find X in this scope
error messages to find out which C++ items are expected to be available. This has the advantages of not requiring a macro or annotation at the call site and of making it possible to transparently replace a C++ function with a Rust implementation or vice versa. I believe the disadvantages are fairly obvious and I can't say for sure whether this approach would be robust enough to be practical in a large codebase.
OK, there's an early attempt at such a higher-level code generator here: https://github.com/google/autocxx. It currently depends on a (slight) fork of cxx, and a (gross, appalling, hacky, make-your-eyeballs-bleed) fork of bindgen. Comments most welcome! It's still at the stage where I'm throwing random commits at it, rather than having any kind of PR-based process, but if anyone wants to join in I can certainly grow up a bit.
Update for anyone reading along here.
autocxx now no longer requires a fork of either bindgen or cxx. It is still, in every other way, "toy" code. It has a large number of test cases for individual cases of interop, but I suspect everything breaks horribly when it is asked to deal with a real C++ codebase. I hope to find out in the next couple of weeks.
I'll close out this issue and we can keep the rest of the discussion on this topic in the autocxx repo. Thanks all!
The cxx help says:
It is looking to me like that's exactly the model we may need in order to get permission to use Rust within the large codebase in which I work.
Could we use this issue to speculate about how this might work? It's not something I'm about to start hacking together, but it's conceivable that we could dedicate significant effort in the medium term, if we can become convinced of a solid plan.
Here are my early thoughts.
Note that C++ remains the undisputed ruler in our codebase, so all of this is based around avoiding any C++ changes whatsoever. That might not be practical.
Bits of cxx to keep unchanged
unsafe
keyword is only used if you're doing something that actually is a bit iffy with C++ object lifetimescxxbridge
tool.Bits where we'll need to do something different
cxx::bridge
section. Instead, declarations are read from C++ headers. A simpleinclude_cpp!
macro instead. ("Simple" in the user's sense, very much not in the implementation sense, where we'd need to replicate much ofbindgen
)include_cpp!
macro may pull in hundreds of header files with thousands of functions. Thecxxbridge
C++-generation tool would have to work incredibly hard for every.rs
file if it generated wrappers for each one. So, instead, the call sites into C++ from Rust would probably need to be recognizable by that tool such that it could generate the minimum number of shims. (Procedural macro in statement position?cpp_call!
?)cxxbridge
effort if hundreds of .rs files use the sameinclude_cpp!
macro(s). This might be purely a build-system thing; I haven't thought it through.struct
which can be represented incxx::bridge
(e.g. because it contains onlyunique_ptr
s and ints) then said struct is fully defined such that it's accessible to both Rust and C++. If a struct is not representable incxx
then it is declared in the virtualcxx::bridge
as an opaquetype
(and thus can only be referenced fromunique_ptr
s etc.)#define
s intorustc
. I can't think of a better way to do this than to rely on theenv!
macro. Ugh.cxx
in localized places where it wraps up C++ function calls with an idiomatic Rust wrapper. This proposal involves instead allowing production Rust code to freely call C++ classes and functions. There may be some major leaps here to make this practical, which I might not have thought of (as I say, these are early thoughts...)Next steps our side
cxx
, we're in trouble. My hope is that this isn't the case, but work is required to figure this out.Meanwhile though I wanted to raise this to find out if this is the sort of thing you were thinking, when you wrote that comment in the docs!