Support for packaging dual header/module C++ code

Moving a slack discussion started by @kamrann:

Wondering recently about c++ modules and build2 packaging. Two things in particular:

Regarding import std, given patchy compiler support it really needs to be handled independently of enabling/disabling modules. It can be done with a package config var, but for any given build configuration there surely wouldn't be a need to vary it - if the build configuration can support import std then it would make sense for it to be enabled across the board. So I'm wondering if there is an argument for build2 exposing something like config.cxx.features.import_std ? As I see it, there may not be a need for build2 to actually touch this internally (unless perhaps to default it based on some auto-detection of compiler capabilities). More important is just to have a single accepted way to toggle this which can be accessed by package buildfiles in a consistent way (in order to pass poptions to upstream libs), and can be set by a user with a single configuration-wide command rather than messing around setting a package-specific var for a bunch of packages.

For packaging third party libraries that don't internally provide any modules support, what would be the best way to add an optional module wrapper around them? Would it be acceptable to embed it into the package with a config var to enable it, or is adding functionality like that considered too intrusive and it should be provided as an independent libfoo-modules package?

I think we need to step back a bit and consider what kind of variability in this area is practical/sensible and what we want to discourage. And then based on that understanding try to define build2 mechanisms to support this variability. In particular, I really don't want to add any mechanisms that will give C++ users even more rope to hang themselves (meaning worsen the "variability mess" which is modern C++ builds), especially if this also saddles build2 with extra complexity and maintenance burden.

What are the plausible approaches when switching a project from headers to modules? I think it makes sense to enumerate all the likely choices since these approaches will have to co-exists (i.e., different projects will make different choices but may end up in the same build). I can think of the following options:

Replace headers with modules (for example, in the next major version of the project).

With this approach there is no attempt to make headers and modules versions to co-exist in the same build with everyone either using the modules version or the headers.
Create a new project (for example, libhello2) which uses modules while maintaining (or even actively developing) the original header-based version for some time.

If the new project uses a new namespace (for example, hello2), then the two versions may even coexist in the same build. Though allowing the two interfaces to inter-operate will most likely require extra effort (think vocabulary types).
Provide the dual headers/modules interface by providing independent headers and modules wrappers over the shared implementation (which is will necessarily stay headers-based). Think of a pimpl idiom but applied to modules rather than classes.

It feels like there should be no difficulty supporting the dual interface simultaneously from the same build. Though whether the two interfaces can inter-operate is questionable (essentially the same problem as in option (2) above).
Provide the dual headers/modules interface by somehow sharing most of the interface source code between headers and modules.

Whether this approach can supporting the dual interface simultaneously from the same build depends on how exactly things are arranged (see below).

I think the first three options are pretty clear. So let's see what are the practice/sensible ways to achieve (4).

In the early modules days we've tried to support both headers and modules from a shared set of source files in a relatively small library (libbutl). It didn't go well, to put it mildly. The resulting headers/module interfaces got really hairy due to all the macros and ifdef's.

One thing I found particularly dizzying (literally) is keeping straight all the imports/includes in the module interface and implementation units. Remember that when you do, for example,import std; in the module interface in the module's purview, all the imported names are automatically made visible in the module implementation units without an explicit import std;. But that's not the case with headers and you will need to pause and think where you need to include each header. If you are interested to see what it used to look like, here is the commit that ripped all this dual support out: https://github.com/build2/libbutl/commit/df1ef68cd8e85

Now, I am sure people will keep trying this approach (here is Boost exploring this idea) and it may even work for small projects. However, I think it's a dead end, generally, both technically but also conceptually: modules were meant to make source code organization cleaner, not to turn in into an incomprehensible macro mess. So I don't think we need to go out of our way supporting this approach in build2. If someone wants to go down this rabbit hole, they should be able to cobble something together (as we did for our experiment in libbutl).

The only practical/sensible approach that I am aware of for implementing option (4) seems to be exporting names as attached to the global module fragment, which is how the standard library modules are done in both Clang/libc++ and MSVC/STL (GCC/libstdc++ is considering re-exporting standard library headers compiled as header units, though I doubt it will be the final choice). For details and additional nuances see this post on the Boost mailing list (the whole thread is a recommended reading).

Specifically, there appears to be two variants of this approach:

Include the header into the module interface and then export the interface explicitly (this is how the standard libraries are done):
```
module;

#include <libhello/hello.hxx>

export module hello;

export namespace hello
{
  using hello::say_hello;
}
```
With this approach supporting the dual interface simultaneously from the same build comes pretty much automatically (there is no module interface without first having a header).
The alternative is to include the header in the module purview and wrap the header into extern "C++":
```
export module hello;

extern "C++"
{
#include <libhello/hello.hxx>
}
```
And inside hello.hxx we will need to do something like this:
```
#ifdef __cpp_modules
export
#endif
namespace hello
{
  ...
}
```
I am not aware of any substantial codebases that use this approach in practice. While it definitely feels less tedious compared to explicit export, I am not sure whether there are any gotchas (there most likely are). In particular, it seems one will have to export all the inter-included headers at once and from the same module. Also, it's not clear whether an interface compiled like this is compatible with the implementation unit compiled with a header (or vice versa).

To sum up, the first approach for option (4) is tedious but is proven to work well and we can simultaneously support both headers and modules from the same build. The second approach looks less tedious (at the expense of some macro hackery) but is likely to have gotchas and it's unclear whether it can support both headers and modules simultaneously. Note also that with both approaches, at its core, the project stays headers-based. You will not be using any advanced modules features like partitions to organize your code.

Regarding using standard library as modules vs headers, this feels largely orthogonal to the modules enablement issue discussed above. However, a couple of notes:

It's possible that a project may wish to import std but itself continue to use/provide headers.
This desire to be able to choose either modules or headers may extend to libraries other than the standard library, if such a library also provides the dual interface. At the extreme, one may wish to decided this on a library by library basis.

One immediate difficulty that I see with supporting both standard library modules and headers from the same codebase is keeping the correct set of #include directives. Though it's probably just an inconvenience (one can either resolve to use headers during develop or to rely on CI to catch any missing directives).

Thoughts?

However, I think it's a dead end, generally, both technically but also conceptually: modules were meant to make source code organization cleaner, not to turn in into an incomprehensible macro mess. So I don't think we need to go out of our way supporting this approach in build2.

I agree.

The only practical/sensible approach that I am aware of for implementing option (4) seems to be exporting names as attached to the global module fragment, [...] I am not aware of any substantial codebases that use this approach in practice. [...]

fmt uses that approach in production and is widely used (not as module though). It also provides an option for fmt module to use import std; since v11.0.0. @kamrann's reflections, if I'm not mistaken, arise among other things from the packaging effort for that library in addition to experimentations relative to modules that we exchanged about in private.

As a data point, if you go there https://arewemodulesyet.org/ and check the first ✅ you will see a top list of modularized libraries. I did a cursory check of the module source file of each of the libraries in that short list and the only ones that uses the global-module fragment injection approach, specifically alternative 2, are fmt, argparse and async-simple. Though I might have missed a few others using alternative 1 if it was not immediately obvious to me, but at least these ones are clear. Note that tgui is the one with the weirdest modules setup I've seen so far, one of it's modules use alternative 1 but not the others - or I'm confused by the juggling.

Regarding using standard library as modules vs headers, this feels largely orthogonal to the modules enablement issue discussed above. However, a couple of notes: [...]

Indeed. Looks like the more libraries providing the choice the bigger the explosion of options for the end-user with a deep dependency graph.

If each library had a general way to determine by themselves if they can or not use import std; ("use import std; if you can" enabled by default), that would simplify the default situation where the end-user dont need to specify any option per library, but because of the differences in implementations stability when using modules, at the moment at least, the end-user projects might end up having to chose to use only-includes-std on some configurations or only for specific library+configuration combinations. Hence question 1.

A couple of additional sources of information:

There is a Clang document that goes into more detail on various headers-to-modules transition approaches. In particular, it lists "shared set of source files without global module fragment" approach (i.e., what we tried in libbutl) as a viable option (it is referred to as "ABI breaking style"). I still think it's not going to be tractable except maybe in a few special cases (just look at all the macro soup in the examples).
There is a new post in that Boost thread with some additional insight.

Sorry for the delayed input, travelling and generally struggling to stay on top of things lately. Some basic thoughts before I put off replying again:

It's not clear to me what 3 is as distinct from 4?
Regarding the macro mess, I agree, for one of my libraries I took this approach and it does indeed become ugly quickly. I'm still persisting with it for now as it's the only way to both have dual support and use modules during development (4 implies rebuilds on every touch), but I may yet give up.
Agree that 4.1 is preferable to 4.2.
Exploding config options is indeed a problem. Unfortunately the patchy support and plethora of compiler bugs mean that in practice there is a need to be able to tweak things for specific dependencies though in order to have things compile. And I doubt this will change in the short term.

The way I see things right now, modules are just problematic during development if there is still some need for header support (in truth my experiments so far have left me somewhat downbeat on the prospects of modules generally). Given that, I think 4 suits (when 1 isn't viable) as far as modules wrt packaging libraries goes, with the assumption that library development is probably done with modules disabled. It's unfortunate since as noted, this means the code is in no way properly modularized; but it is at least convenient for the downstream consumer. It also fits well for making build2 packages of existing libraries, which is of course the common case.

A couple of other tangential points relating to modules with build2, while I think of it.

I think when attempting to convert projects to modules bit by bit and/or support dual mode, it's probably inevitable (though of course not ideal) that people will end up with occasional cases of header files which, maybe conditionally, import modules. With build2's approach to module resolution not being transitive in the same way include paths are - immediate lib prerequisites only - this leads to needing to add prerequisites on libraries that the code of the target in question doesn't directly reference. For example, C includes header from B which imports module from A. C will need a lib prerequisite on A as well as on B. I think build2's approach here is no doubt the right one in a properly modular world and I'm not suggesting it should be changed, but just wanted to point this out in case it hadn't been encountered.
There are command line length limit issues which I've hit on Windows. I've created a dedicated issue for this.

A further question after hitting some issues with the fmt modularization.

Edit. After writing the below it occurs to me that the problem is perhaps wider than just the symbol export macro. fmt uses also a FMT_MODULE macro to control various module/non-module conditional compilation, and this too would need to be defined when building the BMI for the consumer. I've left the below as is though as I think it sums things up well enough, and also I need lunch!

From what I've read, there are some fairly strict (though varying by implementation) requirements regarding matching compiler options between module and consumer in regards to building the BMI. I believe I'm correct in saying that CMake propagates such options some way or other so as to be able to build a BMI in an imported library with the same compiler options that were used when the module was built as part of the library. I'm wondering if build2 is doing something similar here? From looking at the .pc files from an installation of fmt there doesn't look to be anything special in there, beyond the module mapping.

To give the specific example that's caused me to wonder about this. fmt upstream currently contains the following:

#if !defined(FMT_HEADER_ONLY) && defined(_WIN32)
#  if defined(FMT_LIB_EXPORT)
#    define FMT_API __declspec(dllexport)
#  elif defined(FMT_SHARED)
#    define FMT_API __declspec(dllimport)
#  endif
#elif [...]

Now I guess something will need to be changed here - I hit linker errors when attempting to build the tests against installed fmt, and was reminded of what I read a while back in the build system manual regarding modules and symbol exports. It's not clear to me though how to deal with this, in particular when trying to support dual mode.

When building the library module, FMT_LIB_EXPORT is defined, giving dllexport, which is correct.
If I understand right, when building the BMI on the consumer side, we want it to again expand to dllexport and the compiler will deal with it (though maybe expanding to nothing would also work).
However if after building the library as a dual mode-supporting module we then try to consume it via headers, we need dllimport (actually only necessary for data I think, but anyway we definitely don't want dllexport).

From what I can see, when build2 builds the BMI on the consumer side, it is passing the same compiler options as it would use for consumer code that depended on the library providing the module (i.e. in this case I'm seeing -DFMT_SHARED, which is exported as poptions by the fmt library target). Unless I'm missing something though (apologies if I have, my head is a bit fried right now from juggling all the different combinations), this isn't quite enough information. We would need to either:

Somehow detect within the source file whether it is currently being built as a module unit - don't know if possible and doesn't seem good; or
Have a way to specify (in the buildfile of the library) that FMT_LIB_EXPORT should be defined when building the BMI for the consumer.

Thoughts?

build2 / build2

Support for packaging dual header/module C++ code #413