fenbf / cppstories-discussions

4 stars 1 forks source link

2021/strong-types-pesel/ #21

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Strong Types in C++: A Concrete Example - C++ Stories

When you create a model for your domain, C++ offers you flexibility and increates type-safety with so-called Strong Types. Rather than working with simple built-in types, you can create a set of well-defined classes that better suits your needs. In a new blog post, you can see one concrete example of such a design practice.

https://www.cppstories.com/2021/strong-types-pesel/

ninja-on-rye commented 3 years ago

"Useful but still does not free us from checking if sizeof( unsigned long long ) > 36 on all systems" Actually, long long and unsigned long long are guaranteed to be at least 64 bits. Also, sizeof gives multiples of sizeof(char), so you wouldn't be checking if > 36.

Philippe-Allain commented 3 years ago

Choosing what flavor of int type to use to store 36 bits is done at compile time, not at runtime, so there is no overhead due to invocation of sizeof(int), which, by the way, does not return a number of bits.

The first question is actually to know whether you want to store your 'long number' as a number or as a string. The PESEL's constructor is finally fed by a string, and so is the TLongNumberFor's constructor.

You will never add nor multiply two social security numbers, so trying to store this thing as a number does not seem natural. A standard string is probably the type one would naturally choose, allowing use of non-digit characters as well. Extracting date of birth or sex is also easier in that case. I understand the idea to save memory space by using BCD format for string data made exclusively of digits. But this does not make an object that can be seen as a long number. The name TLongNumberFor is misleading.

I also don't understand what 'strong type' means. It's nothing else than a simple class.

2kaud commented 3 years ago

There is uint64_t which is always unsigned 64 bits.

If the number has some bits with built-in meanings then a union can be used together with specifying bit fields. As an example using 16 bits:

union Instructn {
    struct sf0 {
        uint16_t op = 0;
    } if0;

    struct sf1 {
        uint16_t dreg : 3;
        uint16_t dmode : 3;
        uint16_t op : 10;
    } if1;
};

This allows either the entire number to be set/get or its individual elements. This allows for easy validation of individual elements and the easy extraction of those elements. The downside, of course, is that you need to know the 'endiness'. However with c++20 there is std::endian which provides a compile time constant. If the data consists of mixed numeric/alpha, then we'd simply use an appropriate struct. Eg for UK National Insurance numbers have format aannnnnna which we'd store as:

struct NI {
    char lead[2] {};
    uint32_t num {};
    char trail {};
};

with num being a union if the parts of the number have a specific meaning.

If the number has meaning for some of its digits (not bit positions), then a struct can be used to represent this as necessary - either as uintnn_t or char x[n].

IMO, every different type of 'number' etc should have it's own type with it's own specific set/get elements as needed.

ninja-on-rye commented 3 years ago

"There is uint64_t which is always unsigned 64 bits" If it exists - it's optional. However, uint_least_64_t is guaranteed to exist.

dangolick commented 3 years ago

I can't see any good reason for the obfuscation: static const auto kNumOfBytes = (kMaxNumbers >> 1) + (kMaxNumbers & 0x01);

Why "hide" what your doing with bit twiddling. There is no performance advantage, since kNumOfBytes is const, it is calculated at compile time.

But even in the case of RecomputeIndex where it is called at runtime any decent compiler will convert division by const 2 to a shift.

The only advantage I can see of storing in BCD is your are saving two bytes. Its relatively rare to need to extract a decimal digiit by position but this can be easily accomplished by (x/10^n)%10. On any cpu with integer division this will be only slightly slower than the bcd code. (If you create a table for the powers of 10 the conversion is fewer operations and may well be faster).

irfan-mirza commented 3 years ago

Hello Bartlomiej, I have never used unary operators nor were we taught those in the university. As a result, I have very little understanding of what they do. Would it be possible for you to create a separate post to explain what all the bit twiddling in this post is actually doing. If you decide to do that, that would be great.

xtofl commented 3 years ago

I'm a big fan of using the type system to streamline your interfaces, and to prevent bugs. Even a simple 'wrapper' may already prevent accidents, and that is, for me, the true meaning of 'strong types'. This expresses the ideas directly in code.

auto matrix::element(int i, int j) const {
  return rows[i][j];
}

matrix m{{1,2,3},{4,5,6}};

assert(m.element(2,0) == 2);

Something as simple as that can be made more solid by defining rows and columns as actual types:

struct Row { int v; };
struct Column {int v; };
auto matrix::element(Row r, Column c}{
  return rows[r.v][c.v];
}

matrix m{{1,2,3},{4,5,6}};
assert(m.element(Row{2},Col{0}) == 2);  // <<<< this can't be right!?

This is a trivialized example of a situation we had with more dimensions, and more types of elements. By introducing strong types like that, we effectively found 3 bugs!

The type system is my best friend.

mirazabal commented 3 years ago

Hi, I don't want to sound pedantic but there is a bug precisely because the range of the data is not checked at least at functions int GetNumberAt( int position ) const and void SetNumberAt( int position, int val ) as the position variable could be negative (i.e., it is an int) and the range is not checked...

BogCyg commented 3 years ago

Hi and thanks for the comments! In this post we show a real but short & simple example of designing with the strong types – therefore the classes TLongNumberFor and PESEL are just examples and for real applications should be verified/modified/extended (e.g. there are real cases that PESEL doesn’t convey a valid birth date, etc.). The main idea of strong types is to replace the built-in types with wrapper like classes that better express our intentions and usually lead to better design. These also let us avoid some errors like in a function void SetUserData( int age, int pesel, … ), which can be easily confused when called SetUserData( 33, 71123455 ) or mistakenly as SetUserData( 71123455, 33 ). If we now change our design to something like SetUserData( Age age, PESEL pesel, … ) then at least we avoid the pitfalls like this. The cost is additional code. However, frequently it can be reused, e.g. TLongNumberFor can be wrapped into a class to express ISBN or other IDs. There is also another technique which I’m big fun of – programming by contract. Simple exemplary preconditions are implemented with assert in functions GetNumberAt and SetNumberAt. They verify the range in this case, although a too pedantic programmer might ask why we don’t use strong types also to express the positions? Anyway, I highly recommend this style of programming – it saved my skin many times ;) – but I also recommend contracts which came with C++20. Happy day & happy coding!

jasonzio commented 3 years ago

Anything values known at compile time should be typed constexpr. A few of the class statics, for example, fall in that category.

A 6-octet std::array is vastly smaller than an 11-character string; if your program is dealing with literally millions of instances of the class, that memory adds up.

And while a 6-octet array is only 25% smaller than a 64-bit int, again, when storing millions of them in a database, especially when the field might appear in multiple indexes (since it is intended to be a unique identifier). that savings adds up. It isn't so much about reducing disk space as it is about reducing in-memory footprint; the more pages of index you can keep in memory, the faster you go. At small levels of scale, the difference is unimportant. At larger levels of scale, the difference can be significant.

The actual lesson of this post isn't about the data representation; it's about building a class that accurately models the abstract datum, providing only the operations actually required on that datum.

When you build programs around that model, rather than some concrete type, you give yourself the freedom to try different representations of the abstract type without altering any of the code that uses it. You can run the experiment of string vs uint_64 vs std::array<NibbblePair, 6>. You can, based on real world data, change your mind a year after deployment and write simple one-time conversion code.