epics-base / pvDataCPP

pvDataCPP is an EPICS V4 C++ module
https://epics-base.github.io/pvDataCPP/
Other
6 stars 16 forks source link

validate field names #12

Closed mdavidsaver closed 8 years ago

mdavidsaver commented 9 years ago

Check for invalid characters in field names (eg '.') as they are created. Restrict character set to ascii alpha numeric and '_' (can extend later).

dhickin commented 9 years ago

Checking for "."s in field names is reasonable as we've definitely disallowed this to allow getting subfields of structure subfields (timeStamp.nanoseconds, for example).

However, other than that I don't believe we've ever made any specification on the set of allowed characters. I think this is definitely a case where we need to leave time for discussion.

My immediate reaction is that the set of characters is too restrictive. Should "-" be included, for example?

Anyway it's not something that should go into the 4.5 release - definitely, as you say, an enhancement, not a bug fix. I suggest we hold off a merge into master until after 4.5. It will give time to discuss a specification and will make the release simpler.

anjohnson commented 9 years ago

I don't disagree with the suggestion that we hold off to allow time for discussion, but there is no technical reason to avoid merging this into master as soon as the discussion has concluded. We use release branches precisely so we can continue doing development on master at the same time as a release is being prepared.

On the subject of what characters to allow in field names, I agree with Michael's idea of starting with the basic set of alpha-numerics plus underscore, we can expand later if necessary.

On the implementation though wouldn't isascii(c) && isalnum(c) || c == '_' be simpler? In that case c should be an int though.

gregoryraymondwhite commented 9 years ago

Parenthesis? Hash? “@“? Any reason why not?

On Sep 23, 2015, at 11:09 PM, Andrew Johnson notifications@github.com wrote:

I don't disagree with the suggestion that we hold off to allow time for discussion, but there is no technical reason to avoid merging this into master as soon as the discussion has concluded. We use release branches precisely so we can continue doing development on master at the same time as a release is being prepared.

On the subject of what characters to allow in field names, I agree with Michael's idea of starting with the basic set of alpha-numerics plus underscore, we can expand later if necessary.

On the implementation though wouldn't isascii(c) && isalnum(c) || c == '_' be simpler? In that case c should be an int though.

— Reply to this email directly or view it on GitHub.

ralphlange commented 9 years ago

Sounds like a case for a solid religious discussion at the November meeting. I will be packing enough German umlauts.

anjohnson commented 9 years ago

In V3, record field names become C identifiers, thus they must start with a letter or underscore and may be followed by any number of underscores or ASCII alphanumeric characters. The 3.15 server-side filtering code added some specific use of other characters in channel names, which was only possible because those characters could never have been used in field names. If we allow other characters in V4 field names we might reduce the possibility of adding extra functionality at a later date.

mdavidsaver commented 9 years ago

@anjohnson I've factored out the range tests as xisalnum() for the present. I generally avoid is*() from ctypes.h for network messages to avoid potential problems with system locale, although in this case that may not be relevant. I do see a note in the ctypes.h manpage that isascii() is spec'd by posix, not C89.

I must also admit to having avoided learning about C locale handling, so I'll defer to your judgment on safety and portability.

mdavidsaver commented 9 years ago

I agree that this should not be in a 4.5 release, however I'd like to merge this now to avoid further inadvertent errors. @arkilic had mongodb names with '.' sneaking in.

mdavidsaver commented 9 years ago

... Any reason why not?

None what so ever.

I would strongly encourage the adoption of an existing spec (C identifier) or convention (epics record name).

anjohnson commented 9 years ago

@mdavidsaver You're right, isascii() is not available on VxWorks, and without that qualifier the locale would probably adversely affect isalnum().

anjohnson commented 8 years ago

F2F Meeting agrees that for now field names should be restricted to follow the C89 identifier rules. This github issue does not propose or discuss limitations on PV names.

mdavidsaver commented 8 years ago

Updated to implement C89 identifier syntax (regex "[A-Za-z][A-Za-z0-9]*").

Strictly speaking this test should also check and fail for C keywords (eg. "goto" or "double") as keywords may not be used as identifiers. I'm inclined not to enforce this technically (just discourage it) as doing so would not be useful unless java and c++ keywords were also blacklisted. The union of these three sets has 104 entries.

FYI this isn't entirely academic, testFieldBuilder uses "double" as a field name.

mdavidsaver commented 8 years ago

Since there was agreement last week, unless someone wants to veto I'll merge this on Wednesday.

anjohnson commented 8 years ago

The Perl DBD-file parser in 3.15 has a list of C++ reserved keywords (which might be out of date by now) and prevents their lower-case versions from being used as record field names. This was found to be necessary with the aSub record type's field named NOT which lower-cases to a C++ keyword, so it was impossible to write an aSub record subroutine in C++. The record structure in the generated header file now uses the original (upper-case) name if the lower-case version is reserved, while the parser aborts if the original name is reserved.

I don't see a particular need to do that same check in pvData though, problems would only arise if the name were to be used as a member name in a structure that gets output to a header file.

dhickin commented 8 years ago

The request itself looks good.

However we definitely shouldn't go any further in not allowing reserved C keywords. (I hadn't realised this was intended - I assumed the references to C89 identifiers meant using this for the allowed character set).

Specifying the allowed character set is valid and useful, particular to avoid issues like we hit with "."s. There is no good reason to disallow keywords from any language and in any case pvData is independent of any language binding. At what point to we stop (do we ban Python keywords?).

The only reason I could possibly see is to avoid the situation where a programmer tries to name a variable to match a field name, but this should be handled by the programmer by choosing a suitable variable name.

There might be an argument for avoiding pvData meta language keywords. However I don't think this causes an issue and we have seen already people naturally use double as a field name.

I suggest we just say that (unless we extend at a later date) the allowed field names are "[A-Za-z_][A-Za-z0-9_]*" which is completely clear and avoid specifying in terms of C89.

mdavidsaver commented 8 years ago

With a vote of 3-0 I consider that the keywords are out leaving only the simple lexical definition ([A-Za-z][A-Za-z0-9]*).