facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.55k stars 2.1k forks source link

Tag error enum #116

Closed nemequ closed 8 years ago

nemequ commented 8 years ago

The enum in error_public.h is anonymous. It would be nice if you included a tag or a typedef, if only to help make -Wswitch / -Wswitch-enum usable. AFAIK there is currently no way to get the compiler to warn if a switch does not include a case for every enum value. I would like to be able to do something like

size_t res = …;
if (ZSTD_isError(res)) {
  switch ((ZSTD_ErrorCode) -res) {
    case ZSTD_error_No_Error:
      /* … */
      break;
  }
}

Note the cast in the controlling expression; without it the type will just be size_t, and the compiler doesn't realize that there is any association to the enum so it will not emit a warning when there is no case for a particular enum value.

FWIW, I still think it would be easier to do something like

typedef enum {
  ZSTD_error_No_Error = 0,
  ZSTD_error_GENERIC,
  ZSTD_error_prefix_uknown,
  …
} ZSTD_ErrorCode;

ZSTD_ErrorCode ZSTD_getError(size_t code);

ZSTD_getError would work just like ZSTD_isError does now; you could still do if (ZSTD_getError(code)) { … } since ZSTD_error_No_Error would be 0 and errors would be positive integers, but it would make the API a bit easier to understand since you hide the rather weird detail of negating an unsigned type.

Another thing that would be nice is a consistency check. For example, note that each of the values I wrote in the enum above (chosen because they're the first three entries in the enum in zstd, not to make this point) have different capitalization conventions.

Cyan4973 commented 8 years ago

Thanks @nemequ, these are very good points.

Just updated v05x branch following your suggestions :

As a consequence, I'm also wondering about the existing ZSTD_getErrorName() function. Currently, it takes a size_t as input type. Now that a ZSTD_errorCode enum type exists, I'm wondering if it's still the right input type.

Thinking harder about it, I note that ZSTD_isError() and ZSTD_getErrorName() are part of the "stable" API zstd.h, while the enum list and type remain in the experimental section zstd_static.h, which should not be required for basic tasks. It is likely to remain there for some time, since the list will probably need some updates (i.e not stable) in the foreseeable future.

t-mat commented 8 years ago

I post my opinion for the discussion.

In this case, we treat return value size_t as some kind of anonymous/hidden union union { size_t value; ZSTD_errorCode hidden_error; }. I suppose size_t and ZSTD_isError() are public (in the meaning of OO language term) type and its accessor, on the other hand ZSTD_errorCode and ZSTD_getError() are private type and accessor. Since ZSTD_getErrorName() is "public" function, public type size_t would be good enough for this purpose.

nemequ commented 8 years ago

I would probably make ZSTD_getErrorName() take a ZSTD_errorCode. It isn't obvious how to convert an error code back to a size_t, and even if it were it feels much stranger to have to do so than it would to provide the function with a ZSTD_errorCode. That said, I don't think it really matters much; you should go with the one which you feel more comfortable with.

Calling it ZSTD_errorCode instead of ZSTD_ErrorCode surprised me a bit. The other types (CCtx, DCtx) start with a capital… It's okay if you want to make a distinction between structs and enums; just wanted to call your attention to the consistency issue.

I thought the whole zstd API was currently unstable in the sense that you haven't committed to anything yet? Given that the binary format isn't stable, I don't see why you should worry about the API being stable… If you haven't committed to anything yet, IMHO you should think about getting rid of isError in favor of getError now. If you want, I think enums can be incomplete, so you could do something like

/* zstd.h */
typedef enum ZSTD_errorCode_ ZSTD_errorCode;
ZSTD_errorCode ZSTD_getError(size_t code);

/* error_public.h */
enum ZSTD_errorCode_ {
  ZSTD_error_no_error,
  …
};

A couple minor points: if you want to allow people to use error_public.h (which the "public" certainly implies), you should prefix the include guards (ZSTD_ERROR_PUBLIC_H_MODULE instead of ERROR_PUBLIC_H_MODULE). Also, it might be wise to make it easier to distinguish between internal, experimental, and private headers… I would suggest just using "zstd" for public, "zstd_experimental__" for experimental, and "zstd_internal" or "zstd_private__" for internal, or subdirectories would work just as well.

luben commented 8 years ago

@nemequ :-1: for the subdirectories, currently you can build the lib with "cc -shared -o libztd.so *.c", let's keep it simple.

nemequ commented 8 years ago

@luben you would still be able to. I'm only talking about for the headers.

luben commented 8 years ago

Yes, I understand now. Sorry for the confusion

Cyan4973 commented 8 years ago

If you haven't committed to anything yet, IMHO you should think about getting rid of isError in favor of getError now. If you want, I think enums can be incomplete, so you could do something like

I suspect it would kill your central proposition to make ZSTD_getError() a replacement for ZSTD_isError(), since the result could no longer be interpreted as a 0 or 1 signal.

Plus, anyway, it's possible to have pointers to incomplete types, not incomplete type per se as function argument or result.

if you want to allow people to use error_public.h

Well, I'm not so sure about that.

While today the difference between zstd.h and zstd_static.h is "rather stable" vs "experimental, will probably change", the final goal is to have : zstd.h : suitable for DLL zstd_static.h : not suitable for DLL, only use in the context of static linking.

And the point is, I'm not sure if error_public.h will ever be part of the DLL interface. It might as well remain forever into zstd_static.h, because it's difficult to guarantee that the list of enums is complete and stable. It's less of a problem for static linking, since most issues can be caught at compile time. For DLL, on the other hand, there are a number of problematic scenarios possible if the list used to compile the program is different from the list used to compile the library.

I thought the whole zstd API was currently unstable in the sense that you haven't committed to anything yet?

It's not so clear cut. Building trust around a library implies a minimum of stability. Last time I changed something within zstd.h, it was just to add a compressionLevel field to zstd_compress(). It produced way more fuss and problems than one would like to deal with.

At least with this separation into 2 parts, the message that "a part of API is stable, a part is still experimental" seems to be better accepted. Now, user know that advanced features are not yet stable, but still feel secure using basic ones.

ZSTD_isError() is one of the most used functions of the API. Even within ZSTD project itself. Removing it would create a lot of short terms problems. If needed, I would prefer a longer transition period, like LZ4 : deprecate section, deprecate warnings, commented declaration, etc.

But, anyway, considering the issue already mentioned, that ZSTD_errorCode enum can't be part of zstd.h because it's not stable, and as a consequence, cannot be used as a result type of a function declared within zstd.h, it seems there is no transition scenario at this point.

Calling it ZSTD_errorCode instead of ZSTD_ErrorCode surprised me a bit. The other types (CCtx, DCtx) start with a capital… It's okay if you want to make a distinction between structs and enums; just wanted to call your attention to the consistency issue.

Good point. I will have another look into it.

Cyan4973 commented 8 years ago

I tried to look for naming convention but found nothing that would determine a clear direction.

The initial concept regarding ZSTD naming convention is :

Naturally, quickly arrived exceptions :

So, with ZSTD_errorCode, I just restored the initially defined rule. I did not considered that functions and types should have different naming conventions, although they obviously could (and enums could have their own one). But I miss a good enough reason to create such distinctions.

Btw, I found this article on coding and naming convention a pretty good read : http://www.joelonsoftware.com/articles/Wrong.html Quoting :

This is the real art: making robust code by literally inventing conventions that make errors stand out on the screen.

Bottom line : coding and naming convention is about spotting errors more easily, making human code review more efficient.

nemequ commented 8 years ago

I suspect it would kill your central proposition to make ZSTD_getError() a replacement for ZSTD_isError(), since the result could no longer be interpreted as a 0 or 1 signal.

You could still use it like a boolean without knowing anything about the values. All values other than 0 indicate an error; it doesn't matter what the name(s) corresponding to the integer value are.

Plus, anyway, it's possible to have pointers to incomplete types, not incomplete type per se as function argument or result.

You're right; I was thinking it might not be the case for enums… Obviously you can't for structs since the storage size is unknown, but I forgot that sizeof(enum) == sizeof(int) necessarily true according to the C standard, it's determined by the ABI.

For DLL, on the other hand, there are a number of problematic scenarios possible if the list used to compile the program is different from the list used to compile the library.

At the API level there isn't really an issue unless you remove a member. Obviously you shouldn't ever do that, just like you shouldn't remove any other element of the API (once the API has been declared stable, of course).

For ABI, in theory things are slightly more complicated, but not much. If you keep the values below CHAR_MAX there really isn't an issue anywhere. Or you could add a max entry which is larger than you'll ever need which would force the compiler to choose a type wide enough to contain that value. Since it's unlikely you'll ever need more than 127 error codes I think you can just ignore the enum size issue.

In practice, virtually everyone just uses sizeof(int) for the enum size. My understanding is that most (all?) Linux architectures require it, as does Windows. Basically, the only place you're likely to find a compiler that will use anything other than sizeof(int) for an enum is on a system that doesn't do dynamic loading.

It's not so clear cut.

Given that the binary format isn't stable I'm surprised anyone would think that the API/ABI would be stable, but it's up to you.

If needed, I would prefer a longer transition period, like LZ4 : deprecate section, deprecate warnings, commented declaration, etc.

Even for a stable API, I don't see the need for moving it to a deprecated section prior to deprecating it. Deprecation is warning people that code shouldn't be used, warning people that soon you're going to warn them that code shouldn't be used is silly.

Btw, I found this reading on coding and naming convention to be pretty good : http://www.joelonsoftware.com/articles/Wrong.html

The part about Hungarian notation makes me really pine for Ada. It's one of my favorite programming languages, and I never get to use it :(

One of the best resources I've found for API design is CERT's C Coding Standard. Especially the API recommendations (except API08-C: Avoid parameter names in a function prototype, that one is BS).

Bottom line : it's not about looking pretty, it's about making code review more efficient.

Making code easier to review (a.k.a. readability) is the most important thing, and it should always be the priority, but there there are two other big parts of good API design: making it easy to write good code, and making it hard to write bad code.

Making it hard to write bad code is definitely an art form, especially in a language like C where you have typedefs instead of subtypes or derived types. A lot of it is about working with the compiler to generate warnings or errors if something is used incorrectly. From a very quick review, there are several places where Zstd falls short here:

On a related note, you should make use of GCC's nonnull function attribute wherever appropriate. I like to use assert(param != NULL) too, since it makes a good fallback for other compilers which don't support the attribute, but the attribute is nice because it can catch things at compile time (and at runtime in ubsan).

The biggest trick to making it easy to write good code is to avoid surprises. Names should be predictable and consistent, and things should do what you expect without reading the documentation. That means that you should follow de facto standards when possible, even when you (think you) have a more clever solution.

Having to actually RTFM indicates a failure of the API just like having to RTFM for a GUI application indicates a design failure. For some people it is a very important realization that, for programmers, the API is the UI.

There are also places where it's not obvious what the API does, or the API isn't consistent, both of which code harder to read:

Cyan4973 commented 8 years ago

You could still use it like a boolean without knowing anything about the values.

Made a quick test, for completeness. Using a typedef to an incomplete enum : error: invalid use of incomplete typedef ‘ZSTD_ErrorCode’

Bottom line : using the enum as a return type requires to get access to the enum definition. A pity, this would indeed have been nice to rely only on int == 0 | any conditions.

At the API level there isn't really an issue unless you remove a member.

Sure, removing a member, changing order of members, are easy to flag as bad practices. I was more thinking about the consequences of adding a member. Due to the current construction, adding a member pushes the value of ZSTD_error_maxCode. So, should someone decide to rely on it (even if he shouldn't, that's not the question : as this part of the API is exposed, he can) for example for his own homemade version of ZSTD_isError(), a new error code could be mistaken as a valid result.

Such issue could be avoided by forcing ZSTD_error_maxCode to a large enough value, so that it will likely not change in the future.

Actually, thinking about it, there is another objective implied by the zstd.h and zstd_static.h separation : the first one shall expose public symbols easy to understand, safe and usable by "junior developers", the second one exposes advanced functions and types, opening new possibilities but also with more complex behaviors and potentially risky side effects if incorrectly misused. Basically, getting into zstd.h means respecting quite a set of conditions :

zstd_static.h gets all the rest.

The enum is currently planned to remain within zstd_static.h because I have little idea of what kind of mis-usages become possible once exposed. I understand that once it's stable, it reduces potential problems, but I'm also pretty sure it's not yet stable.

Judging by the current usages of zstd I know of, all developers are interested in knowing if the function result is successful of not (ZSTD_isError()), but almost none care about which error precisely it is. So it looks logical to categorize this requirement as an "advanced" feature.

My understanding is that most (all?) Linux architectures require it, as does Windows. Basically, the only place you're likely to find a compiler that will use anything other than sizeof(int) for an enum is on a system that doesn't do dynamic loading.

Well, as pointer by @t-mat, there is also -fshort-enum gcc compilation flag.

It seems this could be counter-balanced by using a large enough ZSTD_error_maxCode to force compiler to select an int type even with this flag enabled.

On the other hand, requiring an enum to be an int rather than a char is only useful in the context of DLL ABI. For static linking, it doesn't matter much, and could even be considered a waste of space for some limited environments.

One of the best resources I've found for API design is CERT's C Coding Standard. Especially the API recommendations

Thanks for the link !

A lot of it is about working with the compiler to generate warnings or errors if something is used incorrectly.

I fully agree with the objective. Beware though of over-engineering : the API (especially zstd.h) should remain dead-symbol to use, even by Junior C developers and wrapper callers. Sometimes, a conceptually correct construction also increases complexity, increasing risks of misuses and basically deterring non-experts developers from using the library at all. So a right balance must be found.

There are a lot more interesting points in your post, but it's difficult to answer them all in a single post.

Cyan4973 commented 8 years ago

Actually, there is a potential migration path : makes ZSTD_isError() prototype the copycat of ZSTD_getError().

This is only possible if the enum list is part of zstd.h. But I don't feel it's the right time for this.

However, the good news is that, if the enum list ever become a part of zstd.h in the future, the change of ZSTD_isError() prototype will be impactless (from an API standpoint, for ABI see comments regarding sizeof(enum)). Which means existing code using ZSTD_isError() will still work as intended. Zero pain migration.

Check parameters, at least in the public API. assert that they're not null, or within the expected range, etc.

Sounds like a good advise. Note that I had some bad experience in the past with assert() not being disabled in release mode, costing a large toll on performance. Sure it was easy to put the blame on the user, saying it's up to him to know how to disable them, but that simply did not worked : too many times, the code was used verbatim, without any effort to look for NDEBUG, so assert() were still there, and product performance and reputation suffered.

So I believe NDEBUG should in fact be automatically set (in release mode) to avoid such issue.

On a related note, you should make use of GCC's nonnull function attribute wherever appropriate

Seems like a logical continuation.

keep in mind that changing the return value of a function from void to something else is neither an API or ABI break; if you're not using it, it may be best to just return void here and keep your options open in the future.

I'm surprised that it doesn't break API/ABI, but if that's the case, then okay.

I already mentioned the capitalization issue with CCtx/DCtx vs. errorCode.

I'm not going to fight for this. If you prefer ErrorCode, then it's okay. I see no drawback, so fine.

nemequ commented 8 years ago

Well, as pointer by @t-mat, there is also -fshort-enum gcc compilation flag.

If you go the route of adding an entry for the max code, nothing would change as long as the max is <= 127.

Note that I had some bad experience in the past with assert() not being disabled in release mode, costing a large toll on performance. Sure it was easy to put the blame on the user, saying it's up to him to know how to disable them, but that simply did not worked : too many times, the code was used verbatim, without any effort to look for NDEBUG, so assert() were still there, and product performance and reputation suffered.

Yeah, I've been there, too :(

Like I mentioned before, I think of removing assertions as an optimization. It shouldn't be done in user-facing API (since you clearly cannot trust users), but if you're confident that the assertion can never be triggered it's not a good idea to remove it from tight inner loops.

If you do remove them from inner functions, the nonnull attribute can really help, though of course it's limited to detecting NULL, not any invalid values. When combined with a good test suite, possibly fuzzing, and -fsanitize=undefined you should be able add checks higher up (outside of the loop) to make sure. With recent versions of GCC (5+?), UBsan will add instrumentation to functions which are annotated with nonnull, but when you're not compiling with UBsan there is 0 cost for the attribute (actually, it could potentially help the compiler and/or a static analyzer notice dead code if you're checking for NULL somewhere else).

Another option would be to do something like

#if !defined(ZSTD_DEBUG) && !defined(NDEBUG)
#define NDEBUG
#endif

(before including assert.h, of course). That would make it so assertions are only enabled if the user explicitly defines ZSTD_DEBUG. It changes the default to safe instead of secure, which I generally prefer to avoid, but many programmers are more willing to give up some security for a little speed.

You could also have two levels of assertions; do something like

#if !defined(ZSTD_DEBUG)
#define zstd_assert(expr)
#else
#define zstd_assert(expr) assert(expr)
#endif

Use assert for checking arguments in user-facing API, and zstd_assert on internal functions. When you're building zstd for unit tests, fuzzing, etc., you can define ZSTD_DEBUG, but the inner assertions would still be disabled for everyone else, even if they forget to define NDEBUG.

So I believe NDEBUG should in fact be automatically set (in release mode) to avoid such issue.

AFAIK it depends on the build system. Autotools and CMake won't, but I believe Visual Studio does. Also, thinking ahead to when shared libraries are available for Linux, it's worth noting that most Linux packages are actually compiled in debug mode, then the debugging symbols are stripped into a separate package. In this case, assertions would still be checked.

Again, removing assertions is an optimization; it makes sense for inner loops where you control the input. For user-facing API where you don't control the input the performance penalty probably doesn't matter since the checks only need to be run once, not hundreds of thousands of times.

I'm surprised that it doesn't break API/ABI, but if that's the case, then okay.

To be clear, changing it to void would be an API break. But no, changing it from void is considered safe, at least from C/C++.

I'm not going to fight for this. If you prefer ErrorCode, then it's okay. I see no drawback, so fine.

I prefer consistency; the only other types in zstd right now are ZSTD_CCtx and ZSTD_DCtx, which start with capitals. You can brush the difference aside by declaring that the coding standard is to start with lowercase for enums and uppercase for structs (you could even do something different for callbacks). Treating typedefs to enums and structs differently is a bit odd, but it's not completely unreasonable. However, if you want the convention to be lowercase for everything be aware that, good intentions aside, those two types violate the convention.

Cyan4973 commented 8 years ago

I feel we have answered some of these points, especially regarding error public enum type. There are other points mentioned later in the discussion, but they address larger topic, and do no longer correspond to initial title. Maybe another issue should be created if some of them deserve another study.

nemequ commented 8 years ago

Agreed.