cps-org / cps

Common Package Specification — A cross-tool mechanism for locating software dependencies
https://cps-org.github.io/cps/
Other
99 stars 8 forks source link

Definitions are keys and values #42

Closed bretbrownjr closed 5 months ago

bretbrownjr commented 7 months ago

Concern

Preprocessor definitions aren't always just "set". They're often set as specific strings. Semantically these strings may then be parsed and interpreted as numbers or booleans by the application, but they're all strings at least at the level of the popular OS ABIs.

Right now, C

Proposal

I'd like to see definitions fields be objects that serve as string to string mappings instead of lists of strings.

I don't really understand the use case for language-specific definitions, but we can make another field named language_definitions if necessary. I'd like to see motivating examples for that though.

For undefines, we can use explicit null values to represent an explicitly undefined preprocessor symbol.

Aside: I would like CPS validation tools to warn about undefines. I realize they're needed, but they're technical debt at best in practice. I would actually be OK making undefines illegal until specific adoption hurdles are demonstrated... they're better set inside source files for maximum correctness and portability. Don't be surprised if I make another issue for that at some point.

Reasoning

While it's possible to parse the string value into values or undefines, it would be better (and friendlier to various tools like jq or basic python scripts) to start out with the semantic mapping instead of requiring every tool implement routines to parse CPS definition values.

Ideally tools that use CPS can rely on existing JSON support and not implement CPS specific parsing logic as well, however trivial.

Background

Relevant Specification Section

For convenience, I'll copy the current wording here:

Type: list of string or map of string to list of string

Applies To: component, configuration

Required: No

Specifies a list of compile definitions that must be defined when compiling code that consumes the component. Definitions should be in the form "FOO" or "FOO=BAR". Additionally, a definition in the form "!FOO" indicates that the specified symbol (FOO, in this example) shall be explicitly undefined (e.g. -UFOO passed to the compiler).

A map may be used instead to give different values depending on the language of the consuming source file. In this case, the build tool shall select the list from the map whose key matches the (case-insensitive) language of the source file being compiled. Recognized languages shall include "C", "C++", and "Fortran".

Existing Practice

In my mind, environment variable APIs are very similar, so I'll list some popular ways they're handled. I'm happy to add more existing practice here on request.

Python

Python uses a key-value object called os.environ.

Node.js

Node.js uses a Javascript object called process.env

C and C++

C and C++ have getenv that returns string value for a string key, which is pretty bad, but it does parse = for you.

If you want to inspect a lot of environment variables, you have no standard mechanisms. Windows and POSIX both support magic environ extern symbols that are lists of KEY=VALUE strings, which is all sorts of bug prone and hard to work with.

GitHub Actions

GitHub Actions use env key-value objects for environment variables.

mwoehlke commented 7 months ago

I'd like to see definitions fields be objects that serve as string to string mappings instead of lists of strings.

On the one hand, I see the appeal. On the other, in practice they're always going to get turned into -DFOO or -DFOO=BAR anyway, possibly even at the tool level (e.g. CMake). The point is that in practice these are just strings that get passed to the compiler. It is up to the compiler how they are interpreted.

For undefines, we can use explicit null values to represent an explicitly undefined preprocessor symbol.

That's... problematic. The "obvious" meaning of a null value is -DFOO (as opposed to -DFOO=)... and we need that distinction; they have different behavior. Something else is needed to represent explicit undefines. I picked !FOO because ! is not a permitted character in a PP definition in at least C/C++. Is there a language in which it is permitted?

I don't really understand the use case for language-specific definitions

Mixed-language headers (i.e. usable in either C or C++) might need different definitions when used in C vs. C++. A more explicit case would be if a library needs to know what language its consumer is using. Unfortunately, I don't have concrete examples to offer, but both @bradking and I feel like it's obviously TRTTD.

This probably calls for feedback from other build-system maintainers.

Ideally tools that use CPS can rely on existing JSON support and not implement CPS specific parsing logic as well

Does that opinion also apply to #6? :wink:

bruxisma commented 7 months ago

C and C++ have getenv that returns string value for a string key, which is pretty bad, but it does parse = for you.

If you want to inspect a lot of environment variables, you have no standard mechanisms. Windows and POSIX both support magic environ extern symbols that are lists of KEY=VALUE strings, which is all sorts of bug prone and hard to work with.

Just a brief reminder that I'd proposed something that would give us a more python-like interface back in 2018 with P1275. It was postponed in favor of papers from SG16 which have yet to materialize, I'm afraid.

dcbaker commented 7 months ago

Mixed-language headers (i.e. usable in either C or C++) might need different definitions when used in C vs. C++. A more explicit case would be if a library needs to know what language its consumer is using. Unfortunately, I don't have concrete examples to offer, but both @bradking and I feel like it's obviously TRTTD.

I agree that having the per-language support for defines is the right thing to do. It's not a large burden to implement since we already need per-language mappings for other things (non-define compile args, link arguments).

Ideally tools that use CPS can rely on existing JSON support and not implement CPS specific parsing logic as well, however trivial.

While I agree with this to a point, I obviously feel strongly that #6 is the right solution, so I want to agree with this with room for nuance.

In this case the code for parsing these (which may not be as robust as it should be), is 15 LOC in cps-config, that doesn't seem like a high burden.

That's... problematic. The "obvious" meaning of a null value is -DFOO (as opposed to -DFOO=)... and we need that distinction; they have different behavior. Something else is needed to represent explicit undefines. I picked !FOO because ! is not a permitted character in a PP definition in at least C/C++. Is there a language in which it is permitted?

This was my first concern as well. Defines aren't just strings, they can also be numbers, symbols, or even other defines. And, defining -DFOO="" isn't the same as -DFOO (I have the scars to).

I think if we were to move to such a mapping it would have to be {<lang>: {string : variant<int, string, bool>}}, with something like:

{
  "c": {
    "FOO": false,  # -UFOO
    "BAR": true, # -DBAR
    "ONE": 2,  # -DONE=1
    "ZERO": 0, #-DZERO=0
    "STR": "\"VALUE\"", # -DSTR="VALUE"
   "ID": "STR", -DID=STR
  }
}
dcbaker commented 7 months ago

Just a brief reminder that I'd proposed something that would give us a more python-like interface back in 2018 with P1275. It was postponed in favor of papers from SG16 which have yet to materialize, I'm afraid.

As someone who does a significant amount of Rust and Python work, I really like that proposal and I'm sad neither it, nor something similar, even materialized

bretbrownjr commented 7 months ago

As someone who does a significant amount of Rust and Python work, I really like that proposal and I'm sad neither it, nor something similar, even materialized

FWIW, I would consider proposing fixes for this to be in scope for the C++ Ecosystem Evolution group, as long as we have critical mass and enough interest to complete the project. It might be enough to have a sufficiently popular library library solution the way fmt is.

Part of the reason C++ Ecosystem Evolution not called "The CPS Meeting" or something is because ecosystems are interconnected things so dividing up around project lines or standards bodies has downsides (see also: ISO or even boost not clearly being the way out of the environment variable mess).

mwoehlke commented 7 months ago

Defines aren't just strings

...except they are (as was correctly noted in the original description, even). Or perhaps it's more correct to call them "token soup", which doesn't have semantic meaning until after (or during; it can get complicated) substitution. Note that you can do "fun" things like -DFOO=". Because of that, value types other than string for values (if we were to split them) are somewhat pointless; there is no difference between "foo": 1 and "foo": "1". Treating Boolean values specially seems like a recipe for users wondering why the preceding are the same, but "foo": true and "foo": "true" are different.

Show me a language where a) "compile definitions" are passed as two command line arguments, and b) names can contain =. If you can do that, then we have a reason (though still not a strict need) to express names and values as semantically separate entities. If not, I submit that the value proposition here is questionable, especially when at least one of the build tools that's going to be using this already models compile definitions as a string list.

I'm not saying the separation doesn't have semantic value, I'm saying it doesn't match how existing (C and C++, at least) compilers actually treat compile definitions. -D takes a single argument, not two.

bretbrownjr commented 7 months ago

In this case the code for parsing these (which may not be as robust as it should be), is 15 LOC in cps-config, that doesn't seem like a high burden.

Sure, and it's not a high burden to run a combination of strlen and find on a FOO=BAR string retrieved from environ, but why make it harder for no real benefit?

...in practice [preprocessor definitions are] always going to get turned into -DFOO or -DFOO=BAR anyway

...if you're only thinking about build systems. There are also systems that would generate CPS that would benefit from simpler json.dump workflows. And there are systems that validate CPS (like python or bash+jq tools) that would benefit from being able to, for instance, quickly find all the CPS files that manipulate a given preprocessor definition.

Is it the end of the world if there's some parsing? No. But there aren't any users yet, so why not model key-value pairs with a key-value data structure?

As to specific spec changes, I like the schema specifics Dylan proposed in his previous comment, though I'd suggest defining null to also be an undefine as, in my mind, that's sort of affirmatively not defining something. I might even consider "FOOBAR": false to be ambiguous between an undefined variable and a variable defined to be false. I could live with it false meaning undefined if it were documented clearly with examples, I suppose. Especially since I'll probably want a tool that warns about all undefinitions anyway. I'd actually be OK leaving them out until someone complains or points out a significant adoption hurdle that requires it. They're pretty bug-prone in my experience.

Does that opinion also apply to https://github.com/cps-org/cps/issues/6? 😉

lol; I don't know that my opinion matters a lot, but it seems like CPS isn't consistent about strings versus structured fields, but I'm an engineer, not an academic, so I'm willing to put up with a lot as long as it gets the job done.

dcbaker commented 7 months ago

Or perhaps it's more correct to call them "token soup"

Yeah, that's a good way to say it.

I'm fine with the current solution. I'm fine with something like what I proposed. Maybe we drop the variant to just variant<str, bool> if we go that way. I don't have a strong opinion on this either way though.

bretbrownjr commented 7 months ago

Treating Boolean values specially seems like a recipe for users wondering why the preceding are the same, but "foo": true and "foo": "true" are different.

I'm OK with just making all preprocessor definition values nullable strings, for the record. As far as I know, most of the environment variable APIs just return strings, so there's a lot of analogous precedent.

In that model, every definition is a string, even "\"" or some escaped unicode mess, I guess, though hopefully someone would contribute some validation that warns about things that create unbalanced parens and quotes and such. It's much more likely that's a bug if you see it in a CPS file. "0" and "1" would mean whatever that means after the preprocessor runs. "" could mean either defined to empty string or just -DWHATEVER with no = (I'm not sure there's a practical distinction). A null would mean undefined, but like I said, I think that should be a warning in a validation context, and I'd be fine leaving that feature out initially. We can always add it in later without breaking anything.

Regarding parsing not being that much to ask for, it's even easier to flip it around the other way and point out that joining keys and values is even more trivial. It's even easier for build systems to join on = or use a format string like -D${key}=${value}. Even a jq script can do that trivially.

Finally, please note that this is issue #42. I believe that calls for some deferential consideration for this issue.

mwoehlke commented 7 months ago

"" could mean either defined to empty string or just -DWHATEVER with no = (I'm not sure there's a practical distinction).

At least GCC treats -Dfoo as equivalent to -Dfoo=1 / #define foo 1. Whereas -Dfoo= is equivalent to #define foo.

$ echo | gcc -DFOO1 -DFOO2= -dM -E - | grep FOO
#define FOO1 1
#define FOO2
mwoehlke commented 7 months ago

So, it turns out at least CMake doesn't have "nice" support for explicit undefines. You'd need to use compiler flags. And CPS can do undefines via compiler flags. So, it seems not-unreasonable that we could drop support for explicit undefines.

That being the case, WDYT about this:

:attribute:`definitions`
------------------------

:Type: |map| to |map| to |string|
:Applies To: |component|, |configuration|
:Required: No

Specifies a collection of compile definitions that must be defined
when compiling code that consumes the component.
Each key in the inner map(s) is the name of a compile definition,
such that e.g. ``-Dkey=value`` is passed to the compiler.
A value may be |null|, indicating a definition with no value
(e.g. ``-Dkey`` is passed to the compiler).
Note that an *empty* string indicated ``-Dkey=``,
which may have a different effect than ``-Dkey``.

The outer map is used to describe
language-specific definitions.
The build tool shall include
only those definitions
whose language matches (case-sensitive)
that of the (lower case) language
of the source file being compiled.
Recognized languages shall include
:string:`"c"`,
:string:`"cpp"`, and
:string:`"fortran"`.
Additionally, the value :string:`"*"` indicates
that the corresponding definitions apply to all languages.

If a definition name is repeated
in both :string:`"*"` and a specific language,
the latter, when applicable to the source being compiled,
shall have precedence.
dcbaker commented 7 months ago

I'd have a preference to keep an undefine syntax and let the implementation convert that to compile flags at the level that's appropriate. Meson++ does (and I hope to eventually teach Meson) about undefined natively, and having them in an unconverted form would allow easier command line de-duplication in the event that you have something defined and undefined at the same time.

Having the syntax also lifts the requirement that the build system or cps-config needs to know about different formats for un-defining. For example, rust is working on an "env" command line option, that turns out to look really similar to C defines. it works like:

rustc --set-env="PATH=some:path:s" --unset-env="DONT_SET_THIS"

And then you can use it inside rust with:

let path: &'static str = env!("PATH");  // compile time evaluated to "some:path:s";
let unset: std::option<&'static str> = option_env!("DONT_SET_THIS");  // will but a null option

chances are fairly high that gcc-rs will use different flags because they try to match gcc rather than rustc (this is already true for some other flags). It would be preferable to not have to teach a build system or cps-config to parse both formats in, but only how to convert "!DONT_SET_THIS" into the right form for various kinds of compilers

mwoehlke commented 7 months ago

I also don't object to retaining !foo as syntax for "foo is undefined". Note that I don't think we otherwise need to change anything in my proposal, as e.g. -Ufoo=bar is perfectly valid (if silly) today.

Part of the motivation for dropping it was an impression that some possibly-desired tool features would be harder to implement if the definition key is not exactly the definition name (which, as I understand it, is much of the motivation to split things into key, value in the first place).

(BTW, I should note that the proposed change also implies changes to compile_flags and includes by way of "*" being added as a "language", but TBH I think the addition of "*" is TRTTD regardless of other changes we may or may not make. I went ahead and did it with the rest of the rewriting, but it could be separated.)

dcbaker commented 7 months ago

If we add "*", can we drop the array[string] | maping[language_key : str] and just have mapping[language_key : string] for the places that is valid? I could propose that as a followup change if we decide to add "*" if you'd prefer to discuss that separately

mwoehlke commented 7 months ago

If we add "*", can we drop the array[string] | maping[language_key : str] and just have mapping[language_key : string] for the places that is valid?

Can we? Certainly. Should we?

Right now we have simple syntax with limited expressiveness or more complex syntax. We don't force you to use the more complex syntax if you don't needs its additional expressiveness. Do we want to buy parsing simplicity at the cost of "gratuitous" syntax?

(TBH, if there was a straight-forward way to allow the no-language-specific defines to have only one level of mapping, I would have made that an option in the proposal. In the other cases, the JSON value type differs, which makes it trivial to tell if you have a language-map or not. Changing definitions from an array to a map means it is no longer simple to tell if you have a language map of definition maps, or just a definition map.)