Closed markus2330 closed 4 years ago
See also #2698 for discussion which character to be used for array (globbing)
First of all, if you list \
, you should also put /
on the list. The slash is clearly the prototypical reserved character (character with special meaning) in Elektra.
Second, "used for globbing" and "used for referring to parentKey, e.g. in conditionals plugin" are not valid use cases here. These are not special meanings during unescaping. If you want to prevent a special interpretation that happens later (e.g. by a plugin or library), you either have to know that this interpretation is based on the escaped key name (which would be bad, as outlined in https://github.com/ElektraInitiative/libelektra/issues/2698#issuecomment-554636250), or you have to ensure that the unescaped name cannot be specially interpreted.
Let's take #
as an example. Our globbing library interprets the key name /a/#
as meaning "match any array element in the array /a
". If we don't want that, but want to match /a/#
literally, we need to use some kind of escaping. However, /a/\#
cannot work. What does the backslash achieve during unescaping of the key name? To prevent the globbing library's special interpretation, it would have to stay there during unescaping. But that violates the rule (°) that a backslash must be escaped, if it is not an escape character itself (it is not an escape character, if doesn't disappear during unescaping). So we need to use /a/\\#
, which will be unescaped into ⊙a⊙\#
(⊙ in place of NUL, namespace and terminator omitted). But in that case #
has no special meaning for key name unescaping and doesn't have to be considered at all.
In other words, to give a reason why a character has to escaped (why it is reserved), you have to think about "What behaviour of elektraKeyNameCanonicalize
or elektraKeyNameUnescape
would escaping prevent?".
For the characters that we actually want to make "reserved", there is a decision we have to make here. Are these characters with special meaning or actual reserved characters?
In my mind, characters with special meaning are normal characters that can be used as such, but may have a special interpretation in a certain context (e.g. %
if it is the whole key part).
Reserved characters on the other hand are reserved for special contexts. They cannot be used normally and are only allowed in the special contexts, in which they have a special meaning.
A character with special meaning only has to be escaped in the context, in which it has its special interpretation, if this special interpretation is not desired. Outside of this context, escaping may be allowed or disallowed, depending on our stance on unnecessary escaping (°°).
A reserved character on the other hand must be escaped whenever it shall have no special meaning. This means, in the context where it has special meaning it can be escaped, to avoid the special meaning. Outside this context, it must be escaped.
As an example, let's look at %
. If a whole key part is just %
(e.g. /%/
), this part is unescaped into an empty part. In both cases, the backslash in /\%/
will prevent the special interpretation and it will be unescaped into a part that is %
. The difference becomes noticeable, however, if look at different contexts, like /a%b/
.
If we treat %
as a reserved character, then /a%b/
is an illegal key name/part and we should throw an error. The correct way to achieve the unescaped part a%b
with a reserved character would be using /a\%b/
.
If we see %
as a character with special meaning, then /a%b/
would be correct and unescape into a%b
. Again, whether /a\%b/
is also allowed, depends on our stance on unnecessary escaping. If it is allowed, it also unescapes into a%b
.
Note: For /
there is no decision, both interpretations are equivalent, since there is no context, in which /
has no special meaning. For \
we already decided (in #3115) to treat it as a reserved character and therefore require escaping, when it is not used as an escape character itself.
The decision above of course also requires us to define, in which contexts can these characters, now or in future, have special meanings.
There are two easy solutions here:
Treat everything (except /
and \
) as characters with special meaning, allow unecessary escaping and for the unused ones define that the have a special interpretation in all contexts. The special interpretation for these would be that they make a key name illegal. Therefore the unused ones always have to be escaped. Any other of the characters list can always be escaped, but only must be, if we want to prevent a specific special interpretation.
(°) As stated in #3115, this rule makes the whole system a lot easier to understand. In an escaped (valid) key name, a backslash always escapes something. In an unescaped (valid) key name, a backslash is always literally a backslash.
(°°) If we adopt a very strict stance on unnecessary escaping, it should be possible to get a 1:1 relation between escaped and unescaped key names. This of course requires more effort in validating key names and therefore will have a performance impact (whether this impact is in any way significant must be tested).
First of all, if you list \, you should also put / on the list. The slash is clearly the prototypical reserved character (character with special meaning) in Elektra.
Thank you, I added it. If you find further characters, you can simply add them to the top-post.
Second, "used for globbing" and "used for referring to parentKey, e.g. in conditionals plugin" are not valid use cases here. These are not special meanings during unescaping.
Yes, of course we can also use another escape character or two \ for some of these characters. But this will be more complicated for users?
What does the backslash achieve during unescaping of the key name?
I had in mind that keyAddBaseName(key, "") will actually add \_
. So to get a `` (with the meaning of the globbing character), you would need to use keyAddName.
E.g. @sanssecours currently actually uses _
inside key names for the directoryvalue plugin. Of course there is no way to escape this, as such escaping engines are non-trivial to write. I had the hope we could have a single place to implement such escaping.
For the characters that we actually want to make "reserved", there is a decision we have to make here. Are these characters with special meaning or actual reserved characters?
If they do not have a special meaning now, they are reserved so that we can give them a special meaning later. We cannot "reserve" or "give a special meaning" at a later point of time, as then applications might already use such characters.
To make it clear: In the top post the "special meaning" and "reserved" are synonyms for the escaping engine. And I would keep them as synonyms to make escaping easier.
Reserved characters on the other hand are reserved for special contexts. They cannot be used normally and are only allowed in the special contexts, in which they have a special meaning.
To make this whole escaping engine sane, I would simply assume that the special characters are always special, except when they are escaped.
In particular, I would like to avoid situations where characters get special meaning if an escape character was added. This is an ugly hack and makes e.g. regex unnecessarily complicated.
If we see % as a character with special meaning, then /a%b/ would be correct and unescape into a%b. Again, whether /a\%b/ is also allowed, depends on our stance on unnecessary escaping. If it is allowed, it also unescapes into a%b.
Exactly. The engine cannot fully prevent from creating names with wrong meanings. So I think we should not try to do it. E.g. % in the placeholders need to have an even number of occurrences but there are exceptions. And maybe some extension of contextual values will add further exceptions.
If we adopt a very strict stance on unnecessary escaping, it should be possible to get a 1:1 relation between escaped and unescaped key names. This of course requires more effort in validating key names and therefore will have a performance impact (whether this impact is in any way significant must be tested).
I think our most important goal should be that the escaping is easy to understand and use. The previous escaping tried hard that every name is valid. Such goals mostly cause headache but actually do not improve the usability. If someone places \ at arbitrary places, it is okay to reject some names. But in front of special characters, \ should be always valid and take away any (future) special meaning.
I had in mind that keyAddBaseName(key, "") will actually add
\_
. So to get a `` (with the meaning of the globbing character), you would need to use keyAddName.
This is super confusing to me...
The semantics of keyAddBaseName (Key * key, const char * x)
are:
x
is an unescaped part. It will be added directly to the end of key->ukey
. Its escaped form will be added to the end of key->key
.
The semantics of keyAddName (Key * key, const char * x)
are:
x
is an escaped name suffix (i.e. zero or more parts, no namespace). It will canonicalized and added to the end of key->key
. The unescaped form will be added to the end of key->ukey
.
Now, you want keyAddBaseName (key, "_")
to add the part \_
to the escaped name and, the part _
to the unescaped name. And that should not have any impact on globbing. But keyAddName (key, "_")
should have meaning for globbing, so what does it do? _
can't be unescaped in any way, it's just a single character. So we have to add _
to both escaped and unescaped name. But then both versions produce the same unescaped name, which means the should have the same meaning for globbing...
currently actually uses
_
inside key names for the directoryvalue plugin.
AFAIK directoryvalue
only uses the key part /__directoryvalue/
. This in no way collides with globbing's use of /_/
.
We cannot "reserve" or "give a special meaning" at a later point of time, as then applications might already use such characters.
It would just be a big breaking change, like the one we are doing to key names now.
I think we need to distinguish two things:
keySetName
(or more specifically elektraKeyNameCanonicalize
and elektraKeyNameUnescape
).The first group will be some form of reserved character, improper use of which will result in a failed keySetName
(and keyNew
will return NULL
). This is where escaping with a backslash helps, as the backslash tells keySetName
to take the following character literally and ignore any meaning it might have.
The second group is entirely different. There is nothing we can do during keySetName
(or keyAddName
, keyAddBaseName
, etc.) to change whether a character of this group has special meaning to some plugin or not. It is entirely up to the plugin to decide that and any form of escaping the plugin might allow to prevent the special meaning must make it to the plugin. Keep in mind, that we said keySetName
fails and keyNew
returns NULL for invalid names. In keySetName
we cannot for example say /#
has special meaning, /#a
is not allowed, but /\#
has no special meaning an /\#a
is allowed. We have no idea, how this is going to be used. If we disallow some keys, no plugin can use them. (Note: in the paragraph above: plugin = plugin, library, application, etc.)
If you really want to unify when something can have special meaning to a plugin/library, I can only see something like this working:
Define some escaping rules for %
, .
, /
and \
, as the affect keySetName
. Do the array index stuff from #2698. Key name parts (escaped and unescaped) starting with @
must be of the form (regex) @[^@]+@.+
. The character @
may not appear in any other context. Everything else has no special meaning during keySetName
and may be used unrestricted. Plugins, libraries, etc. MAY NOT attribute special meaning to any individual character or to a key name part except, to unescaped key name parts that start with @
. An unescaped key name part of the form (regex) @[^@]+@.+
can only have special meaning to the plugin named by the [^@]
part of the regex. What that meaning is, is up to the plugin and determined by the part after the second @
.
I recognize that this is totally different from the current approach and entirely unreasonable to implement, if we want to release 1.0 any time soon, but it is the only working approach I can see, that allows us to definitively say whether a key part has any special meaning or not.
I think we need to distinguish two things:
My hope was that we can discard this distinction to make the whole escaping easier to understand. Then keyAddBaseName
never adds something that has a meaning to Elektra.
If this is not possible, the minimum requirement is that it is possible to pass any string to keyAddBaseName
and get back the same string when asking for the base name with keyBaseName
, see #2698.
Yes, it would break some occurrences of keyAddBaseName
where array names were added. But it will break some cases anyway, as now not every array name is valid, so we need to break some code either:
It is very clear to me that we should not break 1., i.e. code that assumed every string can be passed to keyAddBaseName
. "2." is not so important, these are a few occurrences that can be fixed.
Another example next to SSID: fstab entries can have basically everything including /. Or the mountpoint names within Elektra.
Then keyAddBaseName never adds something that has a meaning to Elektra.
I see now that this once again comes done the major problem at the heart of Elektra. Everything should be coherent and well defined system, but at the same time everything should be flexible and plugins should be used as much as possible. This just can't work.
The problem specifically is the phrase "meaning to Elektra". We can ensure keyAddBaseName
never adds something that has meaning to elektra-core
, or in more clear terms, that keyAddBaseName
adds its argument as a new unescaped key name part. But to an application "Elektra" also means any plugin of Elektra. In that interpretation it becomes very hard to prevent keyAddBaseName
from adding something "with meaning". Any plugin could decide to attribute meaning to any unescaped key name at any point. Like I said, the only solution is to introduce a very restrictive system, like the one with @
proposed above. If you come up with something different, let me know, but for everything I came up with until know, I found a problem.
It is very clear to me that we should not break 1., i.e. code that assumed every string can be passed to keyAddBaseName
Why? It is all but clear to me, why this is preferable? UNIX filenames for example are not allowed to contain /
and nobody ever thought that was a problem. (Not allowing /
makes splitting UNIX paths into parts a lot simpler than with key names).
We are already breaking (almost) EVERY piece of code interfacing with Elektra by adding the :
after the namespace. At the current state of affairs, only cascading keys without any array parts and also excluding some other special character things will continue to work without requiring change. Backwards compatibility is long out the window.
"2." is not so important, these are a few occurrences that can be fixed.
I would not be so sure about that. In fact I know of only one part of Elektra that truely assumes that keyAddBaseName
escapes everything and that is elektra-kdb
and the tools that create mountpoints.
fstab entries can have basically everything including /
Not sure why that is relevant?
Or the mountpoint names within Elektra.
To create the mountpoint user:/my/mount
the tools put the mountpoint configuration under system:/elektra/mountpoints/user:\/my\/mount
, but the could just as well use any other unique key directly below system:/elektra/mountpoints
, because the actual mountpoint is the value of e.g. system:/elektra/mountpoints/user:\/my\/mount/mountpoint
. AFAICT system:/elektra/mountpoints
might as well be an array. I have not found any place that calls keyBaseName
on a key directly below system:/elektra/mountpoints
.
There might be some place that uses mountGetMountpoint
or mountGetBackend
to get the mountpoint and reconstruct system:/elektra/mountpoints/user:\/my\/mount/
to access some of the config, but that could easily be solved by adding the system:/elektra/mountpoints
key below which the config is stored as a field to struct _Backend
.
In short:
the minimum requirement is that it is possible to pass any string to
keyAddBaseName
and get back the same string when asking for the base name withkeyBaseName
This is easy. But combining it with "keyAddBaseName
shall accept all strings" is very hard or almost impossible.
I see now that this once again comes done the major problem at the heart of Elektra. Everything should be coherent and well defined system, but at the same time everything should be flexible and plugins should be used as much as possible. This just can't work.
Yes, but I would not call it a problem but a challenge. We want the best possible Elektra and of course sometimes contradicting goals occur.
Like I said, the only solution is to introduce a very restrictive system, like the one with @ proposed above.
Yes, I also see that it will not be possible to have everything. With array escaping we are already quite limited to a prefix escaping, i.e. we cannot escape other characters within the base names.
But this prefix escaping might be enough. And this should be extensible to all the characters proposed above. The limitation would be that either the whole key name part is escaped or nothing at all.
Of course the escaping would not be magically but applications would need to check if the meaning of the keyBaseName has been taken away or we provide a different keyUnescapedBaseName so that API users can have the string with or without prefix escaping.
Why? It is all but clear to me, why this is preferable? UNIX filenames for example are not allowed to contain / and nobody ever thought that was a problem. (Not allowing / makes splitting UNIX paths into parts a lot simpler than with key names).
It is a problem for everyone implementing a file browser. Some applications had URL encoding techniques to get / into the file names. The more modern "solution" is to use the Unicode Character 'DIVISION SLASH' (U+2215) for slashes.
Furthermore, keep in mind that the whole reason for existence of Elektra is that UNIX file system semantics are not suitable for configuration. In configuration arbitrary sequences for key part names including / are commonly used and thus should be supported. E.g. in YAML / are allowed characters in key name parts and also in TOML there are "Quoted keys".
We are already breaking (almost) EVERY piece of code interfacing with Elektra by adding the : after the namespace. At the current state of affairs, only cascading keys without any array parts and also excluding some other special character things will continue to work without requiring change. Backwards compatibility is long out the window.
Yes, and I am super-grateful that you are doing this. I would not have been able to do this alone.
I would not be so sure about that. In fact I know of only one part of Elektra that truely assumes that keyAddBaseName escapes everything and that is elektra-kdb and the tools that create mountpoints.
Another part is the SSIDs. This was not a synthetic example but Broadcom has implemented this for their bluetooth devices. So they rely on that they can add any base name.
This is easy. But combining it with "keyAddBaseName shall accept all strings" is very hard or almost impossible.
Yes, very hard but also a very nice unique feature that not even UNIX had before. We can be proud of that.
E.g. in YAML / are allowed characters in key name parts and also in TOML there are "Quoted keys".
This is actually what I thought we could introduce, see below.
I think all of this can be solved by introducing something similar to "quoted keys". I am going to call them literal key parts.
If a key part starts with an @
(could be another character of course), the following key part is a literal key part. A literal key part can contain ANY byte sequence including zero bytes.
Their unescaped form is as follows:
@
character.@
characters are encoded as @@
.@0
.@
may not occur otherwise.The escaped form is a bit different, to allow easier printing:
@
character./
. Only another @
followed by a /
or a zero byte (end of string), terminates the _literal key part.@XX@
, where XX
is the hexadecimal form of the byte. Therefore @
characters must be encoded as @40@
.Now keySetBaseName
and keyAddBaseName
accept ANY valid unescaped key part as arguments. They set/add their argument unmodified as the last part of the unescaped key name. Therefore keyBaseName
will return the same argument. In addition keySetBaseName
and keyAddBaseName
turn the unescaped part into its escaped form and set/add it as the last part of the escaped name.
To solve the problem of "we want to add anything as a key part", we introduce:
size_t keySetLiteralBaseName (Key * key, char * bytes, size_t size);
size_t keyAddLiteralBaseName (Key * key, char * bytes, size_t size);
These functions accept ANY byte sequence as their bytes
argument. They then turn the argument into an unecaped literal key part by prefixing it with an @
, replacing other @
with @@
and zero bytes with @0
. This is then passed to keySetBaseName
or keyAddBaseName
.
To make decoding literal key part easier we could add decoding functions to the public API as well.
I like this much better, because it makes it very clear when the user intends to add arbitrary things to the key name and when this is not the case.
keySet/AddLiteralBaseName is a proposal to add a new feature. Currently we do not have a way to encode null bytes in key names. When somebody needs this, we can get back to your proposal.
For now, however, we need good semantics for keyAdd/SetBaseName. As outlined very long, this ideally should be any string, as otherwise many plugins and applications will be buggy because they do not expect that some strings will not work. #_10
is a valid key name part, it can be used as key name part in basically all storage plugins or applications, we cannot discuss around that. It is definitely not a good idea to let a storage plugin fail because it found #_10
somewhere as key name part during parsing. And it also not a good idea to require them to use a different API with a size argument if they do not have null bytes anyway. Actually, that such cases as #_10
or /
do not need special handling is the reason for the existence of keyAdd/SetBaseName.
Btw. maybe it best if we meet, e.g. on Tuesday?
Currently we do not have a way to encode null bytes in key names. When somebody needs this, we can get back to your proposal.
This is the least important point of the proposal. It only allows zero bytes, because I don't see a reason why it wouldn't.
Anyways, there already is a use case:
Another part is the SSIDs. This was not a synthetic example but Broadcom has implemented this for their bluetooth devices. So they rely on that they can add any base name.
Apart from the fact that SSIDs have nothing to do with bluetooth, EEE 802.11 clearly defines an SSID as any byte sequence up to 32 bytes long. So either Broadcom already does some pre-processing to handle zero bytes or their code is already buggy. In any case, this is clearly a use case for encoding zero bytes in key base names.
In any case I have no idea about possible future extensions to the set of special characters, but for the current set, I think the solution implemented in #3115 might work as is.
If we try the scenario from https://github.com/ElektraInitiative/libelektra/issues/2698#issuecomment-554680422 in this version, the following happens:
Calling keyAddBaseName (key, "#_10")
results in both the last part of key->key
and of key->ukey
being #_10
. Why? Because #_10
is a valid array element. Yes, if #_10
was an SSID we wouldn't see it as an array element, but globbing with #
would match it, but does that really matter?
If we take an invalid array element like #10
or #abc
, keyAddBaseName
will escape and the unescaped name's last part will be #10
or #abc
, while the escaped name's last part will be \#10
or \#abc
.
I have not yet thought about the theoretical rules we need to make this work for future extensions, but for now I think we can leave it like this. It would be nice to have array elements without the starting character, but does it really matter? Storage plugins are free to encode arrays differently, if they have a problem with the #
character and for the command-line we should just add kdb array*
commands that take normal integer parameters, e.g. kdb arrayset /my/array 10 a
.
Btw. maybe it best if we meet, e.g. on Tuesday?
Tuesday is bad for me... I will send you an email.
Apart from the fact that SSIDs have nothing to do with bluetooth,
I meant Blueray devices (that have wifi capability).
EEE 802.11 clearly defines an SSID as any byte sequence up to 32 bytes long. So either Broadcom already does some pre-processing to handle zero bytes or their code is already buggy. In any case, this is clearly a use case for encoding zero bytes in key base names.
Yes, you are right. But it is a new feature. Any string in keySetBaseName is not a new feature.
Because #_10 is a valid array element. Yes, if #_10 was an SSID we wouldn't see it as an array element, but globbing with # would match it, but does that really matter?
No that does not matter.
It would be nice to have array elements without the starting character, but does it really matter?
No, the most important thing is that any string can be fed to keySetBaseName, and keyBaseName returns what was fed before. With the need of start characters in arrays I can definitely (and happily) live.
[...] and for the command-line we should just add kdb array* commands that take normal integer parameters, e.g. kdb arrayset /my/array 10 a.
Sounds like a good plan.
After re-examining everything, I think there is only one conflict. The following to statements cannot be unified.
In our particular case. The first statement follows from the guarantees for keyBaseName
et al. together with the fact the you don't want keyAddBaseName
to reject an argument. In addition, the proposal "unesaped key parts starting with #
are always array indices" is a special case of 2.
This essentially means to keep 1, which we want to, we becomes necessary to always re-check a full unescaped key part, if you want to test a certain property. In particular, you have to check the full unescaped key part to know that it is an array index. I find this an unfortunate situation, but it seems unavoidable.
This also means that we could still do the whole "escaped key parts consisting entirely of digits are array indices" thing. The side effect would be that in the SSID example keyAddBaseName (key, "#_10")
would result in an escaped name ending in /10
instead of /#_10
. But since you already accepted the fact that keyAddBaseName (key, "#abc")
yields an escaped name ending in /\#abc
, I assume this would not be a problem either?
Implementing this would be a bigger task, so it would be a follow-up PR to #3115. I would need to check that everything that works with array indices actually uses the unescaped name, in addition to updating everything that creates keys with array parts.
Actually 2. is not necessarily needed. We can also store a 3rd string which contains the basename to be returned (maybe only then when it is actually needed to not waste too much memory).
Implementing this would be a bigger task
This is not good, we also need to somehow come to an end with this. Maybe the 3rd string will rescue us, we can optimize this later (if it is needed).
Either we make a full list of prioritized requirements in a decision or we meet? The decision will be handy anyway so that people later will understand why we did it like this and which requirements we sacrificed for other requirements.
Actually 2. is not necessarily needed.
Like I said, it was a proposal. It would be nice to have, but unfortunately it doesn't work.
We can also store a 3rd string which contains the basename to be returned (maybe only then when it is actually needed to not waste too much memory).
We'd need to store a whole 3rd version of the key name, otherwise keySetBaseName (key, NULL)
would work properly.
Right now there are not many properties a key part can have. At least not properties that are interesting across all of Elektra. "Is it an array index part?" is in fact the only one I can think of. If in future we have more properties (e.g. if we add the concept of "contextual placeholders" to the core), we can think about optimizations.
This is not good, we also need to somehow come to an end with this.
I know, which is exactly why I said I won't do it in #3115. If we decide we need it, I'll open another PR. This second PR will be a lot of work and it will be a breaking change. However, it should be a much smaller breaking change and (most) other simultaneous PRs shouldn't be affected as much as they will when we merge #3115.
Either we make a full list of prioritized requirements in a decision or we meet?
I will write documentation, including a decision, when the code of #3115 is done. The open question right now are:
/
and \
, which characters can have special meaning at the start of a key part?
In #3115 these currently are #
, %
, .
and @
.#
means array part, if the rest of the key part is only digits or first n
underscores and then n + 1
digits (n >= 0
) and nothing else. A part of just #
has no special meaning in general (it can mean array globbing to some parts of Elektra). Any other part starting with #
means the key name is invalid.%
means empty part if the part is just %
. Any other part starting with %
means the key name is invalid..
has special meaning only if the part is just .
or exactly ..
. All other parts starting with .
currently have no special meaning, i.e. the key name is not invalid. (Previously a part starting with .
made a key inactive)@
is reserved. Any part starting with @
means the key name is invalid.#
then just digits or the same as the unescaped version, i.e. #
and underscores?
In #3115 we always require array parts to start with #
and their canonical representation is the same as the unescaped one.My suggested answers are:
#
, %
, .
, @
, $
, &
and ?
should be enough for possible future uses. In particular I think we should not reserve _
that will only cause problems.$
, &
and ?
are reserved for future use like @
.kdb
tool should just have special array commands.Slightly unrelated thing: It might be possible to turn directoryvalue
into a library (or even make it part of the core) and give it a private interface that allows creating keys that would not be possible through the public API. Then we wouldn't have to worry about collisions because of the name __directoryvalue
. One of the reserved characters would be used for that. I haven't thought much about it, but it might be a good idea, since most of the storage formats are good for human readability don't support keys with both a value and keys below.
Other than / and \, which characters can have special meaning at the start of a key part? In #3115 these currently are #, %, . and @.
This is what this issue basically is about: to collect characters which potentially have a special meaning (now or reserved).
Should escaping the special characters from 1 always be allowed or only when required?
Yes, always be allowed.
Do we want that key parts consisting entirely of digits are interpreted as array parts? If so, what is the canonical representation? Just digits as well, starting with a # then just digits or the same as the unescaped version, i.e. # and underscores?
This heavily depends on which properties of base names we lose. From user perspective it is "nice" if no #
is needed, from API perspective it is very important that keySet/AddBaseName
just works.
In particular I think we should not reserve _ that will only cause problems.
And to change the globbing to *
?
Having the starting character makes it easier to spot array parts when looking at key names.
I do not think it is a big issue: numbers within a string are also quite easy to spot:
user/here/is/a/test/1/not/a/number/4/here/is/a/number
Then we wouldn't have to worry about collisions because of the name __directoryvalue.
If correctly implemented we do not need to worry also if implemented as plugin, the plugin could escape already present __directoryvalue.
My dream is that keySetBaseName would escape _directoryvalue (one underscore or @ would be enough), but the directoryvalue plugin would use keyAddName to get a raw _directoryvalue inside the key name. (Which would be protected because underscore or @ is a reserved character not to be used for applications).
Hopefully we can discuss this today, I hope this way we can come easier to conclusion.
This heavily depends on which properties of base names we lose.
As I said, we shouldn't loose anything compared to the current state of #3115, but we also wouldn't necessarily gain a lot. The main question is how much more work do we want to invest into the key name change. Lets talk about this today.
And to change the globbing to
*
?
*
already has meaning to the globbing library. It means any key part. #
matches just array parts and _
matches non-array parts.
I do not think it is a big issue: numbers within a string are also quite easy to spot
This might just be me, but I didn't spot the /1/
at first.
If correctly implemented we do not need to worry also if implemented as plugin, the plugin could escape already present __directoryvalue.
See #3256 for an alternative solution using metakeys.
My dream is that keySetBaseName would escape _directoryvalue
I am strongly against that. The core should not be concerned in any way with what plugins do, or what requirements they have. It also sets a bad precedent. What if another plugin uses _averyspecialkey
for something? Are we going to update keySetBaseName
every time this happens?
Outcome of talk:
characters above will be escaped if they are on the first position, except: #
The current behaviour (in #3115) of keySetBaseName
and keyAddBaseName
is defined by this function:
We only escape when it is needed to avoid unnecessary escaping in the escaped name.
Also since you asked, if there is a 1:1 relation between escaped and unescaped key name:
No, there is not. We decided to allow unnecessary escaping:
Should escaping the special characters from 1 always be allowed or only when required?
Yes, always be allowed.
This means that e.g. /key/%abc
and /key/\%abc
are different escaped key names for the same unescaped key name. Namely the one with cascading namespace and the two parts key
and %abc
.
If we want to ensure a 1:1 relation, we could either say that escaping is only allowed when required (don't like that), or remove unnecessary backslashes during the canonicalization step.
In any case the 1:1 relation requires that keySetBaseName
and keyAddBaseName
behave as described above, i.e. it has to know when to escaping is required.
Personally, I don't think the 1:1 relation is required. We only need a n:1 relation to make ksLookup
work. It normally shouldn't be a problem, if we look up /key/\%abc
and get back /key/%abc
or vice versa.
Thank you for investigating the round-trip behavior. Yes, the lookup is not a big deal (it is already 1:n because of cascading) but that keys with different names remove themselves when appending them to the keyset is ugly. And of course that it does not round-trip is another important goal: Every KeySet should after serializing and serializing be exactly the same, and this would not be given if \%
and %
are "identical" in the internal representation.
Do you maybe have a solution to the ksAppend and round-tripping problem?
Its a matter of definition. What uniquely identifies a key?
The technical answer is the unescaped name. It is what a KeySet uses to differentiate between keys.
Users just have to get used to knowing when two key names are the same. It's kind of similar to spotting that the fractions 2/3
and 4/6
are the same.
If you still want the 1:1 relation, escaping must only be allowed when necessary. That is not more effort to implement, but might be annoying to users.
Its a matter of definition. What uniquely identifies a key?
Until now saying "the name identifies a key" was enough because there was 1:1 between escaped and unescaped. They were only stored both as performance improvement. (space vs. time)
I would prefer if it can stay this way.
If you still want the 1:1 relation, escaping must only be allowed when necessary. That is not more effort to implement, but might be annoying to users.
Why would it be annoying?
My idea of \#123
vs. #123
was that it is not the same key name and does not have the same meaning (one means the literal #123
, the other is the 124th array entry). They both would have different escaped and unescaped names.
If they now have the same unescaped name, we can simply reject keyAddName(k, "\\#123")
. There is no benefit of having it, or is there?
My idea of
\#123
vs.#123
was that it is not the same key name and does not have the same meaning (one means the literal#123
, the other is the 124th array entry). They both would have different escaped and unescaped names.
That example does not demonstrate anything. Obviously \#123
and #123
have different unescaped versions, otherwise it wouldn't be possible to get the unescaped part #123
in any way.
Arrays are also a bad example, because we said #
must introduce a valid array part or be escaped, to avoid easy mistakes. Instead lets look at %
.
If the escaped part is /%/
, we get an empty unescaped part and if it is /\%/
we get the unescaped part %
. Obviously both of these must be valid escaped key parts. So far so good.
The question now is about the parts /%abc/
and /\%abc/
. Both map to the unescaped part %abc
. Are they both allowed, or is one of them invalid? If both are allowed, we don't have a 1:1 relation.
This is what I meant with:
Should escaping the special characters from 1 always be allowed or only when required?
To which you replied:
Yes, always be allowed.
Which I interpreted as both /%abc/
and /\%abc/
are allowed. I realise now that you might have meant "Yes, always required", i.e. only /\%abc/
is valid.
Anyways... We have two options here:
/\%abc/
is the valid./%abc/
is valid.The choice can be made independently for each of the reserved characters. Only one of these options can be chosen per character to retain the 1:1 relation.
In #3115 it is currently option 1 for #
and option 2 for .
and %
. For @
both options are the same, it always requires escaping. There are no valid parts starting with @
. I'll have to check everything else is correct, but I think we have a 1:1 relation in #3115 right now.
Thank you for the detailed explanation. Yes, it is not so important that /%abc/
is escapable with a single back slash. Round-tripping is much more important. We can also use something else to escape this (in the layer above.)
There are no valid parts starting with @.
How will it work when we assign @
a meaning?
I'll have to check everything else is correct, but I think we have a 1:1 relation in #3115 right now.
Thank you :100:
How will it work when we assign @ a meaning?
You are right, to ensure future compatibility, we have to choose option 1. It is easy to see:
/\%abc/
, /%/
and /\%/
are valid not, but /%abc/
isn't. All parts starting with @
are invalid right now, but starting with \@
is allowed and will still be correct, if some parts starting with @
become valid.
I deleted the rule set that I posted earlier, because after playing with an ANTLR grammar I noticed that the rules don't help. I think having a 1:1 relation AND ensuring backwards compatibility when giving reserved characters meaning is basically impossible.
Let's take @
as an example. Currently only parts starting with \@
are valid, those starting with @
are always invalid. Parts of the form \@XYZ
(XYZ
is arbitrary) are unescaped into @XYZ
.
Now we decide to introduce meaning for @
. All parts of the form @XYZ
(XYZ
arbitrary) for which the predicate foo(XYZ)
(foo
is some string predicate) is true are now unescaped into bar(XYZ)
(bar
is some function that maps strings to strings). To keep the 1:1 relation, we must obviously ensure that no other part is unescaped into bar(XYZ)
. But that means some currently valid keys will become invalid, because currently it is possible to obtain keys with arbitrary unescaped names. This breaks compatibility. (Note: It might be possible to keep the 1:1 relation and loose round-trip behaviour between the versions "no meaning for @
" and "meaning for @
", but thats just a different breaking change.)
I might have a solution to this, but I already know you won't like it, as it involves disallowing certain unescaped names.
The explanation of my solution became quite long, so I put into a file in #3115.
The basic idea is:
The detailed explanation and what it means can be found here. You can add comments here: https://github.com/ElektraInitiative/libelektra/commit/86d8e7806bfb92e95a7c519e09e95fa278eadb05
Thank you very much for working on this! @sanssecours can you maybe take a look at it if you also like the idea?
I might have a solution to this, but I already know you won't like it, as it involves disallowing certain unescaped names.
Btw. it is not that "I won't like it", it is that nobody will like Elektra if all storage plugins are buggy because they cannot accept some strings.
are buggy because they cannot accept some strings
It's not buggy, if we define it that way. JSON requires escaping similar to C-strings. Nobody complains about that. Any Elektra plugin using JSON would just require using a few more escapes.
While it should be visible from the more detailed definitions, I maybe should have explicitly stated that no actual functionality is lost. It is just slightly more complicated in a few cases.
For every sequence of bytes there is still a single unescaped key name that represents this sequence of bytes. It is just that the original sequence of bytes and the unescaped key name representing it will no longer always be the same.
It's not buggy, if we define it that way. JSON requires escaping similar to C-strings. Nobody complains about that. Any Elektra plugin using JSON would just require using a few more escapes.
You are comparing apples with bananas here. JSON is a serialization format, we are discussing a data structure here (KeySet). On serialization level, storage plugins still need to escape/unescape key names and key values so that separators can be distinguished.
While it should be visible from the more detailed definitions, I maybe should have explicitly stated that no actual functionality is lost. It is just slightly more complicated in a few cases.
Without API it is difficult to tell what "slightly more complicated" means.
It is just that the original sequence of bytes and the unescaped key name representing it will no longer always be the same.
I think we could live with that.
I am only worried that this change will be too big (it seems like also a modification in all storage plugin will be needed) and we will not be able to finish it. Maybe it is better to forget the natural sorting which brought most of this trouble?
Maybe it is better to forget the natural sorting which brought most of this trouble?
The natural sorting has nothing to do with this. AFAIK, it isn't even mentioned anywhere in the proposal. The problem is about reversing characters for future use while avoiding breaking changes.
Without API it is difficult to tell what "slightly more complicated" means.
The API is in the proposal. In this case "slightly more complicated" just means choosing between two operations (pushL
and pushU
, names are placeholders) instead of always using the same one (keyAddBaseName
). And sometimes ignoring a leading backslash when comparing key parts to strings.
I am only worried that this change will be too big
Of course it is a big change, but it will reduce future breaking changes.
we will not be able to finish it
Well that entirely depends on the deadline...
As discussed, I think we should rather forget about this issue (reservation of special characters). Instead higher-level escaping (like arrays #
, %
) or reservation of special characters (@
) should be directly done in the storage plugins.
Well that entirely depends on the deadline...
If you want to make a master thesis about future-proof software libraries, you are free to reopen this topic :wink:
As discussed [...]
That's not what I remember from the discussion. I thought we agreed to finish #3115 and then think about this issue again.
I will finish #3115 and I will also add some examples and plain English explanations to the proposal.
Instead higher-level escaping (like arrays #, %) or reservation of special characters (@) should be directly done in the storage plugins.
This is frankly a stupid solution...
Everytime we add something new to key names we will have to update all storage plugins. That is exactly the reason you used against my solution (which only requires this once).
It also doesn't avoid updating the plugins now, since AFAIK none of them actually handle key names 100% correct (array parts).
If you really don't want my proposal, please give a reason against it, instead of proposing a solution that solves nothing and is nearly as much effort.
If you want to make a master thesis about future-proof software libraries
No, I don't want to do that... I just want to make Electra's key names future proof, which is a much smaller and simpler task.
That's not what I remember from the discussion. I thought we agreed to finish #3115 and then think about this issue again.
Yes, we did not finish the discussion but:
It also doesn't avoid updating the plugins now, since AFAIK none of them actually handle key names 100% correct (array parts).
Please report bugs. Using arrays as specified in doc/decisions/array.md should be correct. Then #0
is a "simple name without meaning" and arrays only get meaning with a key directly below that has metadata "array".
If you really don't want my proposal, please give a reason against it,
I like the proposal very much. But things like #1074 (and see https://github.com/ElektraInitiative/libelektra/projects/9 for more) are IMHO much more critical for the success of Elektra.
instead of proposing a solution that solves nothing and is nearly as much effort.
What do you mean? My last proposal was to simply not reserve characters and leave it up to storage plugins to do what they want. I do not see any effort in "implementing" this proposal (it is a NOP).
No, I don't want to do that... I just want to make Electra's key names future proof, which is a much smaller and simpler task.
Maybe this is already enough for a master thesis if you finish the proofs. I did not look into related work though, maybe someone else already did something like this.
Using arrays as specified in doc/decisions/array.md should be correct.
That might be the case, but the tutorial for storage plugins disagrees and even shows that yamlcpp
doesn't work this way.
But things like #1074 are IMHO much more critical for the success of Elektra.
I am concerned with changes to key names, because that is what I am already doing and I have a lot of insight right now. Additionally, we already have a breaking change in key names, so it doesn't matter that much, if we break more things.
The user/dir mountpoints would be nice yes, but I have no idea how to solve that. It would probably take me as long or longer to find and implement a solution, as it would take me to implement my proposed changes to key names.
My last proposal was to simply not reserve characters
You mentioned reserving @
.
leave it up to storage plugins to do what they want
That only leads to incompatibilities. Of course we can't stop storage plugins from using a completely different format, but not giving any guidelines is just asking for problems.
That might be the case, but the tutorial for storage plugins disagrees and even shows that yamlcpp doesn't work this way.
@sanssecours can you look into this? If we recognize arrays just because of the structure (names with #0
) the KeySet will not round-tip.
It would probably take me as long or longer to find and implement a solution
Yes, but user/dir mountpoint is something every Elektra user is frequently confronted with. If we ever need reserved characters is not known.
You mentioned reserving @.
Yes, storage plugins can reserve and escape whatever they want.
That only leads to incompatibilities. Of course we can't stop storage plugins from using a completely different format, but not giving any guidelines is just asking for problems.
I fully agree. Maybe some people in some years will hate us for not reserving some characters (but maybe also not). But I am quite sure it will not contribute to the success of Elektra 1.0. To be clear: I am totally not against implementing it but other things (https://github.com/ElektraInitiative/libelektra/projects/9) are IMHO much higher in the priority list. You are free to disagree and implement it anyway. I would not reject the PR as long as it does not get too complicated for storage plugins (and all plugins get fixed). But there is a high risk that other stuff (which is essential for the success) does not get implemented and Elektra will not be used in important projects like KDE.
@sanssecours can you look into this? If we recognize arrays just because of the structure (names with
#0
) the KeySet will not round-tip.
I updated the behavior of YAML CPP and the Directory Value plugin in PR 3357. The pull request also updates part of the documentation to take the mandatory array
metadata into account.
I have now updated the proposal in #3447. I hope it is now easier to understand (it is still very long).
@markus2330 I'm gonna put this here even though the previous discussion was in #3223, because I think it fits better.
I just discovered/remembered that in the current version of Elektra [1], it's already the case (and implemented that way) that "any ESCAPED Key Name part that start with @
makes the whole Key invalid". This means we could -- without implementing any of the other changes from the proposal -- use field names starting with @
for special purposes in storage plugins, like the proposal suggests with $meta
and $value
.
@bauhaus93 This might help you when you implement and metadata in toml
(and also for non-leaf value keys, to some extent).
For those that haven't read the proposal, the idea would essentially be (for JSON):
{
"key": {
"@value": "127.0.0.1",
"sub": "xyz",
"@meta": {
"check/ipaddr": "ipv4"
}
}
Turns into two keys:
key = 127.0.0.1
key/sub = xyz
and key
also has this metadata:
check/ipaddr = ipv4
~Since both key/@value
and key/@meta
are reserved key names (or even invalid in #3447), this cannot cause any conflicts.~
[1] master
only mentions this in the documentation, but does let you create such Keys. #3447 on the other hand ~forbids these Keys entirely~ requires that @
is escaped at the start of key name parts in escaped key names. The documentation also has a similar provision for _
, but that is not enforced in #3447 either.
Sorry for the confusion, but I had to clarify parts of the comment above. The part about the @
only applies to the escaped Key Name. The unescaped name (as a C-String) "\01\00key\00@meta\00"
is still valid in #3447, like it ever was. This means @meta
cannot be used for metadata without further considerations in the storage plugin.
However, since it has always been a reserved name and we will require escaping in #3447, I think it is still fair to use it in a storage plugin. IMO a user should not expect that a reserved name can always be used without issues. But of course we will need to take some precautions. One possibility would be this (JSON again):
{
"key": {
"@value": "127.0.0.1",
"@sub": "xyz",
"@meta": {
"check/ipaddr": "ipv4"
}
}
Turns into two keys (note the \
for escaping):
key = 127.0.0.1
key/\@sub = xyz
But this
{
"key": {
"@@value": "127.0.0.1",
"@@sub": "xyz",
"@@@sub": "abc",
"@@meta": {
"check/ipaddr": "ipv4"
}
}
Turns into (again \
for escaping):
key/\@value = 127.0.0.1
key/\@sub = xyz
key/\@@sub = abc
key/\@meta/check\/ipaddr = ipv4
The rules here would be:
@
-> use the field name and add with keyAddBaseName
@
-> remove a first @
and use the rest for keyAddBaseName
@
-> check if field name has special meaning (e.g. @value
or @meta
):
keyAddBaseName
(e.g. @sub
)(Note: keyAddBaseName
takes an unescaped name and does the required escaping to keep the key name valid)
I do not understand the aim of the last two comments, is this a proposal about the directoryvalue plugin replacement?
(note the \ for escaping)
For @
we usually used @
for escaping, so to simply discard the first @
if there are several. Would be good if this is also documented in the key name description. We could change to \
(if it makes the escaping for key names in general easier understandable) but this needs changes at least in the base64 plugin. (null plugin gets deleted in #3491)
I do not understand the aim of the last two comments, is this a proposal about the directoryvalue plugin replacement?
The @value
part is yes. But it is more general. @meta
could be used for storing metadata inside a format with a single nested object structure, like JSON or TOML (still not sure about YAML, because tags).
Basically, because everything in the JSON file refers to some key somewhere, there is simply no place to store metadata. Unless impose some limitations on the structure of the JSON file, but then you probably can't just mount some config file. [1]
Like I said, you need to impose restrictions to fit metadata into JSON (AFAIK the same goes for TOML). So yes, my proposal in the comments above does impose a restriction on the JSON files, but applications that use @
at the start of a JSON field are probably quite rare.
For @ we usually used @ for escaping, so to simply discard the first @ if there are several. Would be good if this is also documented in the key name description. We could change to \ (if it makes the escaping for key names in general easier understandable) but this needs changes at least in the base64 plugin.
Not sure how the base64
plugin is relevant here, but like I to be very clear:
On the current master
(6a1535c094c7be2f441bd0e7b0dda1c50e97ee1d) keyNew ("/elektra/@abc", KEY_END)
works and creates an unescaped name with the parts elektra
and @abc
.
In #3447 keyNew ("/elektra/@abc", KEY_END)
returns NULL
because the name is deemed invalid, but keyNew ("/elektra/\@abc", KEY_END)
works and creates an unescaped name with the parts elektra
and @abc
.
Not sure what keyNew ("/elektra/\@abc", KEY_END)
does on master.
keyAddBaseName (k, "@abc")
works in both cases, the only difference is what escaped name it creates.
I also don't remember, why I made this change. There probably was a good reason, but in any case it is done now and I don't think changing back would be easy.
[1] As always, the problem here is that Elektra wants to do everything and please everyone. On the one hand we want to allow storage plugins that can read arbitrary files in a standard format (e.g. JSON) and turn them into KeySets. On the other hand we want to allow applications to use arbitrary strings as a part of a key name. This might work for some time, but at least when you get to metadata it breaks for certain formats.
JSON is a single tree with values only at the leaf nodes. KeySets are a tree with values at every node, and some of the node have a link to whole separate tree as well. You might be able to convert between the two, but one of them must impose limitations on the other to have a lossless conversion.
Like I said, you need to impose restrictions to fit metadata into
@kodebach does this mean you are in favour of restricting the formats s.t. metadata can be stored?
I never stumbled upon this, because I mostly used dump and then mmapstorage and only rarely imported/exported configs using other plugins. This is quite unfortunate. As a user I would rather be able to import/export to standard spec (losing metadata) while storing the real data (incl. metadata) using an arbitrary elektra internal format.
I think restricting the format could be confusing to users. For example, if Elektra provides a non-experimental JSON plugin, then I would expect it to be able to handle arbitrary spec-conforming inputs. It's unclear to me how the user would be able to check that his config files are conforming to the restricted spec.
@markus2330 what do you think about imposing restrictions on the format?
EDIT: I agree that "Elektra wants to do everything and please everyone" is a problem.
does this mean you are in favour of restricting the formats s.t. metadata can be stored?
We need to define our priorities. What exactly is an absolute must, and where can things change. As it is right now, the only way for many formats is to store metadata in comments (@bauhaus93 did that in #3500 for TOML). IMO this is not a solution, but merely a clunky work around. The worst part is that is abuses the one part that should never be touched by a parser and can drastically affect the readability of a storage file. Some formats (namely JSON) also do not have comments.
Very basic restrictions we could impose (using JSON as an example again) would be that all files must look like this:
{
"kdb": {
"elektra": {
"version": 1
}
},
"meta": {
"elektra/version": {
"type": "long"
}
}
}
Other "restrictions" include interpreting certain key names like @meta
in special ways and providing escape methods.
The kdb
object contains the actual Keys and meta
contains the metadata. (Note: to solve non-leaf keys with values, you'd need another object)
Instead of a two JSON objects we could also use separate files. But that would need support from at least the resolver and maybe even the backend code/plugin. Separate files would make it possible to import standard JSON files, without any metadata. How non-leaf keys with values would be handled in this case is not clear.
As a user I would rather be able to import/export to standard spec (losing metadata) while storing the real data (incl. metadata) using an arbitrary elektra internal format.
I agree, but importing/exporting everything goes a bit against the mountpoint idea of Elektra.
I already know, nobody wants to implement this, but there is a very general solution:
If a mountpoint uses a storage plugin that doesn't not support all of Elektra's features, the backend plugin creates a secondary fallback storage file in a format that does support all features (e.g. quickdump
, mmapstorage
seems like a waste of disk space here). When writing such a mountpoint, we simply write with both plugins. When reading we first read the secondary fallback file and then the primary file. This would require a merge process, but we will take the primary file as a source of through so the merge is quite simple.
I think restricting the format could be confusing to users.
Definitely. But it is clear to me that currently Elektra allows to much and something has to be restricted.
Even the solution with multiple files requires some kind of restriction to config file names to avoid collisions.
Originally I thought the easiest solution would be to restrict Elektra's Key Names, since that is the part we have the most authority over. But apparently there are applications that use WiFi SSIDs [1] directly as a Key Name Part, so that doesn't work. At least not without huge effort, as my proposal showed... IMO it would still be the best solution for all of this to just say "Key Name Parts are not arbitrary sequences of non-zero bytes anymore" and be done with it. For comparison, POSIX only allows the following characters in filenames:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -
And that seems to be enough for people. All of the file systems listed on Wikipedia (except the ones without directories) have some kind of restriction on the filenames.
I want to stress again: All of this boils down to the fact that we have taken "Key Name Parts are arbitrary sequences of non-zero bytes" as an axiom [2]. Changing this would be every easy and it would fix everything. It could even be a tiny restriction e.g. "Key Name Parts are arbitrary sequences of non-zero bytes, except those that both start with @$%
and end in !%&
at the same time".
Just reserving a single Key Name Part e.g. "Key Name Parts are arbitrary sequences of non-zero bytes, except !@&
is not", would give us lots of possibilities. Heck, even the restriction "the last Key Name Part in a Key Name must not end with @%&
" should suffice. Any kind of restriction [3] no matter how tiny would give us leeway for implementing all sorts of things.
Of course, many restrictions make solutions possible, but good and usable solution might require bigger restrictions.
Sorry, if this seems like a rant, but at some point it just gets really frustrating that everyone agrees there is a problem, but every solution is deemed to have a flaw or is seen as too complicated.
[1] According to Wikipedia until the 802.11-2012 a WiFi SSID was "zero to 32 octets (32 bytes) long" including zero-bytes and "Wireless network stacks must still be prepared to handle arbitrary values in the SSID field", even though SSIDs must now be valid UTF-8. So just keyAddBaseName
ing a WiFi SSID without any additional logic would not be standards compliant anyway.
[2] That is actually, why the new documentation in #3447 starts with Key Name Parts. It is easiest (and IMO the most logical) to explain the current Key Names by starting with Key Name Parts and going from there. But the old way of
take a Unix/POSIX file system path and add a namespace, but also there is this special thing like empty parts and there are escape sequences, and then there is the unescaped version -- it is semi-hidden but should probably be used for everything other than printing -- oh and arrays are a thing, also some parts are reserved but not actually reserved reserved so we don't actually use them for anything low-level, ....
shows a lot better how insanely complex and complicated Key Names are.
[3] In case you are wondering, yes #3447 imposes some new restrictions, but those are useless, because they do not affect the Unescaped Key Name (except for the namespace) in any way.
Separate files would make it possible to import standard JSON files, without any metadata. How non-leaf keys with values would be handled in this case is not clear.
We discussed this regarding the cache implementation and decided against it. The current implementation of the resolver has stronger consistency guarantees than can be achieved when using two files. The resolver uses the atomic rename()
function to ensure that a process either reads the old or the new file, but always a consistent one. Unfortunately this cannot (to my knowledge) be done with multiple files while having the same guarantees. I would be reluctant to use multiple files where we need to ensure consistency.
I agree, but importing/exporting everything goes a bit against the mountpoint idea of Elektra.
True.
All of this boils down to the fact that we have taken "Key Name Parts are arbitrary sequences of non-zero bytes" as an axiom [2].
I agree, this is the part where restrictions make the most sense. As you've said, filenames are quite restricted and it still works really well. Unless anyone knows a good reason not to restrict Key names (maybe @markus2330)?
Sorry, if this seems like a rant
No, and thank you for discussing this. Unless @markus2330 knows some reasons why we can not change this, I am fully open to your suggestion (like restricting !@&
).
I would be reluctant to use multiple files where we need to ensure consistency.
That is a good reason, against multiple files yes...
I am fully open to your suggestion (like restricting
!@&
).
More realistically we could impose the restriction: "A Key Name must not contain the Key Name Part @elektra
" (possibly a more obscure prefix, if that helps). I think using elektra
might make the whole thing a bit more reasonable to users.
Personally, I think we should leave room for the future, but if restricting all Key Names is too much, we could also restrict what kdbGet
can return and what kdbSet
can accept. So the restriction above would only apply to what the last plugin during kdbGet
returns and to the KeySet
given to kdbSet
. Basically we'd only restrict what Keys can be stored from an application's point of view. This would be similar to what we already have with system:/elektra
(although this time we'd actually enforce the restriction).
Thank you very much for these valuable discussions. I started #3514 to summarize the important decisions scattered around everywhere. Please prefer to directly comment on the text in #3514, so that we can come to conclusions.
The @value part is yes. But it is more general. @meta could be used for storing metadata inside a format with a single nested object structure, like JSON or TOML (still not sure about YAML, because tags). Basically, because everything in the JSON file refers to some key somewhere, there is simply no place to store metadata.
I do not think anymore that it should be our goal to circumvent restrictions of file formats. If JSON does not support meta-data and non-leaf values, Elektra with a JSON file mounted, simply forwards its restrictions on KeySets. We can document them, like proposed in #3504 and the storage plugins should fail on such KeySets but nothing more. (Of course someone can come up with a super-JSON file format with less restrictions but this is beyond our current man power and as such irrelevant.)
but applications that use @ at the start of a JSON field are probably quite rare.
Who knows?
keyAddBaseName (k, "@abc") works in both cases, the only difference is what escaped name it creates.
Sounds good.
but at least when you get to metadata it breaks for certain formats.
Yes, we need to drop the idea that any format can support anything.
It's unclear to me how the user would be able to check that his config files are conforming to the restricted spec.
I think that most applications will not be very demanding anyway (e.g. only have leaf-values and do not need any meta-data except of type), so all we discuss here is only relevant to exceptional applications. For these applications, I think it is best that application developers develop their own storage plugins or they use our best storage plugins, which support such things.
@markus2330 what do you think about imposing restrictions on the format?
It is not a good idea. We should support configuration file formats as standardized. It can be a big problem if some valid JSON/XML files get rejected by Elektra. But on the other hand, we do not need to support anything more than standardized. To try this was a mistake from the past (trying to please anybody).
We need to define our priorities. What exactly is an absolute must, and where can things change.
doc/GOALS.md in #3514
The worst part is that is abuses the one part that should never be touched by a parser and can drastically affect the readability of a storage file.
I agree it is not the best idea, it already had plenty of problems in the ini
plugin. badcec5407df6947a67750afd8f0946245c5a5bb
Instead of a two JSON objects we could also use separate files.
See below.
For comparison, POSIX only allows the following characters in filenames:
This is the "portable Filename Character Set" and I doubt that there is any file system in use anymore which only supports this. In my experience any interpretation of the file names only calls for trouble (e.g. JFS had such features).
And that seems to be enough for people.
For Elektra's source it nearly is #1615 :wink:
But real applications immediately wanted more... (e.g. spaces, umlauts, ....)
Changing this would be every easy and it would fix everything.
It is not as simple as that: it makes storage plugins much more complicated. It simply moves logic to other places.
except those that both start with @$% and end in !%& at the same time
This idea is similar to CDATA
. It is not a pretty solution and unhandy to use in practice.
HERE
documents are a bit better, as the user can define the escape sequence.
But for what feature exactly do you think we need it? I think we should rather forget about these features and properly use meta-data to give semantics to keys.
Sorry, if this seems like a rant, but at some point it just gets really frustrating that everyone agrees there is a problem, but every solution is deemed to have a flaw or is seen as too complicated.
My point is: maybe we do not need a solution because the problem does not exist.
must now be valid UTF-8
Can they still have null-characters? Btw. I came up with the SSID example because this was a real requirement of a real customer (broadcom). Null-characters were not required, it is obvious that a char*
without size cannot support this, so we did not even discuss about this.
shows a lot better how insanely complex and complicated Key Names are.
Yes, but better they are like this than to have such logic in every application and storage plugin.
In case you are wondering, yes #3447 imposes some new restriction
Yes, the key name can have any restrictions. There always was the restriction of dangling escape, so a few more restrictions do not matter. #3447 will be a huge improvement for Elektra.
Unfortunately this cannot (to my knowledge) be done with multiple files while having the same guarantees
It can be done by writing a journal log where you write everything you plan to do and in case of any problems you can recover by replaying the journal. To implement this is very much beyond our man power. It also does not really fit to a configuration system (nor file systems), it is a technique for structured databases.
I made it a non-goal in multiple_file_backends.md 59066aa99bae707d0f0178841513fd3b8590b5bc
Basically we'd only restrict what Keys can be stored from an application's point of view. This would be similar to what we already have with system:/elektra (although this time we'd actually enforce the restriction).
We can enforce the restriction of system:/elektra
. But why do you want to restrict what applications are storing in their part of their hierarchy?
I think it is best that application developers develop their own storage plugins
Thereby defeating one of the main benefits of Elektra: "The admin/user decides the config format not the application developer".
It can be a big problem if some valid JSON/XML files get rejected by Elektra
That was never my suggestion. Of course we should accept every valid JSON. The question is what KeySet
do we produce. If we use "@value"
for non-leaf values as suggested above, we could still accept every valid JSON. The resulting KeySet
would just have a slightly different structure some of the time.
For comparison, POSIX only allows the following characters in filenames:
This is the "portable Filename Character Set" and I doubt that there is any file system in use anymore which only supports this. In my experience any interpretation of the file names only calls for trouble (e.g. JFS had such features).
And that seems to be enough for people.
For Elektra's source it nearly is #1615
But real applications immediately wanted more... (e.g. spaces, umlauts, ....)
They still don't allow /
or empty names. Most also have maximum lengths for file names and for paths. The point is applications find ways to work around limitations. And in our case we can even help them with libraries.
It is not as simple as that: it makes storage plugins much more complicated. It simply moves logic to other places.
I do not agree. Storage plugins are already extremely complex, relatively speaking this would only be a minor inconvenience. (assuming we provide a good storage
library with helper functions)
This idea is similar to CDATA. It is not a pretty solution and unhandy to use in practice.
HERE documents are a bit better, as the user can define the escape sequence.
Both CDATA
and HERE
documents need to reserve some syntax, from Elektra's current point of view there is no difference. Both versions are a restriction on key names.
Also both are much more powerful than what I proposed. My proposal would only "escape" a full key name part at a time.
I think we should rather forget about these features and properly use meta-data to give semantics to keys.
You just said, that some storage plugins might not support metadata. So it seems this is not an option.
My point is: maybe we do not need a solution because the problem does not exist.
Of course, if you just change your mind and say a storage format can limit Elektra's abilities the problem goes away. But that fundamentally changes Elektra's promise to users. We would limit all users just to allow some developer to safe a few lines of code dealing with encoding special values.
Can they still have null-characters?
Probably, NUL is a valid UTF-8 character.
without size cannot support this
Well, the maximum size of an SSID is 32 bytes, so you don't have to pass the size arround, as long as the buffer is always 32 bytes and zero-padded.
Yes, the key name can have any restrictions.
Please make up your mind. This whole time you have argued that there cannot be any restrictions.
There always was the restriction of dangling escape, so a few more restrictions do not matter.
"Dangling escape" is not a restriction. It is a syntax error in the escaped key name. Escaped key names never were arbitrary strings. This is a totally different thing.
We can enforce the restriction of
system:/elektra
.
We could require a flag in keyNew
to allow such names. Otherwise an application that accepts key names from the user, would have to sanitize the input. See also the kdb import
problem in #3299.
But why do you want to restrict what applications are storing in their part of their hierarchy?
I don't want to restrict what applications are store, but how they are storing it. This gives (storage) plugins more freedom.
I propose following characters to be reserved and escapable (to remove special meaning) within key names:
What do you think?