Enable C++17 support - Githubissues

henrygab commented 3 years ago

Problem: RAM and PROGMEM (click to expand)

ATTiny devices are extremely constrained in RAM. This makes it critical to get as much data into `PROGMEM` as possible. However, `PROGMEM` requires use of special accessor functions to get data into RAM for immediate use (else user is forced to write tons more code for all use of the data). Moving data into `PROGMEM` requires that the data be evaluated at compilation time, so the data can be placed in ROM. Thus, the more robust support the compiler provides for `constexpr`, the more data that can be determined at compile time, thus saving precious RAM.

Why C++17Trivial changes to enable (click to expand)

C++17 greatly improves not only `constexpr` capabilities, but also relaxes template restrictions, allowing simpler metaprogramming that is sometimes necessary to (politely) explain to the compiler that the data really and truly is a compile-time constant.

Platform.txt change summary (click to expand)

`avr-gcc` doesn't support `-std=c17`, but it does not matter. Per [docs](https://gcc.gnu.org/onlinedocs/gcc/Standards.html#C-Language), `-std=c11` is treated identically to `-std=c17`. So, no update to `CFLAGS`. > A version with corrections integrated was prepared in 2017 and published in 2018 as ISO/IEC 9899:2018; it is known as C17 and is supported with -std=c17 or -std=iso9899:2017; the corrections are also applied with -std=c11, and the only difference between the options is the value of __STDC_VERSION__. `CPPFLAGS` has two changes: First, the change from `-std=gnu11` to `-std=c++17`, for the stated reasons. Second is the addition of `-fno_sized_deallocation`, since it's not a relevant feature for ATTiny and addresses the following compiler warning: ``` ...\hardware\avr\1.5.2\cores\tiny\new.cpp:29:6: warning: the program should also define 'void operator delete(void*, unsigned int)' [-Wsized-deallocation] void operator delete(void * ptr) { ^~~~~~~~ ```

The good news: Initial compilations for ATTiny85 appear to generate identical assembly listings and .HEX files. Please consider updating to support C++17 ... it will make seeding data in `PROGMEM` much easier. Thank you!

SpenceKonde commented 3 years ago

Hi, thanks - I'm a bit confused though. Can you provide any code where the C++ 17 standard support makes relevant differences? I'm only competent wrt straight C stuff and inline assembly - and that may be giving me too much credit..., not all that wack shit they added in C++. You could say I'm not a class-y programmer ;) At the start you say that it makes it easier to put constants into progmem. But then at the bottom you say it generates identical hex files, which would mean that it has the same behavior as before, we're just shuffling deck chairs, in which case I shouldn't merge it.).

Can you in general make a bit clearner what cases this would be expected to change behavior in?

(as an aside, I don't think I would ever describe a word I have said to the compiler as "polite". Like I said, I'm not a classy programmer)

henrygab commented 3 years ago

OK, I can definitely try. It's late, so I'll likely have some mistakes below....

First, my mention of getting the "same" results above related to compilation of existing examples both with and without this change. In that case, you really want the binaries to be identical ... it means the compilation had no regressions. Code that is written to use C++14 or C++17 features would not compile without the change, so doesn't have that baseline.

I will use expandable/collapsible sections to make it easier to parse my attempt at this, trying to focus on some features of particular interest to PROGMEM optimization.

constexpr functions

C++11 enabled [constexpr functions](https://en.cppreference.com/w/cpp/language/constexpr). Unfortunately, in C++11, they were really restricted. As in, single-return-statement restricted. This forced all complex if/else statements to be rewritten as multiple chained ternary (`?:`) operators. C++14 greatly relaxed the restrictions, allowing `constexpr` functions to declare (and reuse) variables, have multiple statements, and more. This one feature alone is probably worth the update. See also https://github.com/AnthonyCalandra/modern-cpp-features#relaxing-constraints-on-constexpr-functions

constexpr if

[`constexpr if`](https://en.cppreference.com/w/cpp/language/if#Constexpr_If) is a really useful feature ... it allows a template's return type to be `auto`, while excluding return types (discarding false `constexpr if` expressions). This can reduce duplicate code.

Template metaprogramming

Getting memory-compact definitions of structures into PROGMEM can sometimes be quite difficult, especially when needing to change the memory layout between PROGMEM and normal memory.

Many more useful bits...

But, the ones that can be relevant to space optimization (and PROGMEM use) may include: * [Structured Bindings](https://github.com/AnthonyCalandra/modern-cpp-features#structured-bindings) -- useful to more easily initialize aggregate types * [Return type deduction](https://github.com/AnthonyCalandra/modern-cpp-features#return-type-deduction) -- esp. with `constexpr if` in templates....

Ok, that's a starting point, anyways.

An example that's challenging...

In C++11 (where stdlib is not available, as on many constrained mcus), consider the following. Given an aggregate structure: ```C++ struct Element { uint32_t Foo; uint8_t Bar; } ``` Each `Element` instance takes 8 bytes (due to alignment / padding bytes). But, if wanting to store a really large array of those in PROGMEM, maybe you'd want a container defined similar to: ```C++ template< size_t ELEMENT_COUNT> class ElementArray_P { uint32_t FooValues[ELEMENT_COUNT]; uint8_t BarValues[ELEMENT_COUNT]; // and operator overloads, iterator definition to generate Element values on-the-fly } ``` Let's say you get the operator overloads, and even define a working iterator ... so that it's easy to loop through while only creating a single `Element` in RAM at any time. Next, you realize you really need to support aggregate initialization to get stuff into PROGMEM, which makes the next part more difficult. Finally, you want to minimize rewriting the data structure initialization. For example, you'd want the following to all be equivalent: ```C++ Element alpha[2] { {0x89ABCDEF, 0x12}, {0xFEDCBA98, 0xF3} }; /* aggregate initialization */ Element beta[2] { 0x89ABCDEF, 0x12, 0xFEDCBA98, 0xF3 }; /* aggregate initialization */ ElementArray_P<2> theta { {0x89ABCDEF, 0x12}, {0xFEDCBA98, 0xF3} }; /* aggregate initialization */ ElementArray_P<2> theta { 0x89ABCDEF, 0x12, 0xFEDCBA98, 0xF3 }; /* aggregate initialization */ ``` Getting this to work (esp. since std:array isn't usable) w/o C++17 features ... well, let's just call it difficult. :) Template metaprogramming with liberal `constexpr` usage is required, and C++17 features make this somewhat less onerous.

SpenceKonde commented 3 years ago

Thanks for that - If I'm being honest, most of those points went over my head... it's becoming increasingly clear (and initially was not) that while I write a significant amount of code that is compiled as C++... basically the only thing I do that isn't compatible with straight C is operator overloading. I struggle with classes and namespaces, and templates are black magic to me. But it sounds like, for those who dabble in such sorcery (sourcery?) it would be a valuable addition.

I will put those changes into megaTinyCore first (there is a release I need to get out this weekend for megaTinyCore; I've improved the upload speed by something like a factor of 8-to-1 when using one of the programmer options... and I want to get that into the hands of users.

SpenceKonde commented 3 years ago

I put C++17 mode into megaTinyCore so I could run the CI. There are indeed a few cases where it resulted in a difference in the compiled code - something about the servo library gets handled ever so slightly more efficiently (I mean, I looked at the listing, and saw where it was and all, but it was deep enough in that I couldn't figure out what the heck they were doing.). It uses 2 words less there, and 1 word more on all the tinyNeoPixel ones. But yeah, this needs testing with CI running here before I can puit something like that in. We will see if anyone squawks about problems on megaTinyCore, as well...

henrygab commented 3 years ago

[admittedly, low priority, but fascinating!]

There are indeed a few cases where it resulted in a difference in the compiled code - something about the servo library gets handled ever so slightly more efficiently (I mean, I looked at the listing, and saw where it was and all, but it was deep enough in that I couldn't figure out what the heck they were doing.). It uses 2 words less there, and 1 word more on all the tinyNeoPixel ones.

I admit to extreme curiosity. I tried looking through the action logs in megaTinyCore to find how you determined the size changed (due to build changes). However, I didn't see anything that saved the build artifacts from the automated builds.

Did you do this manually somehow? If you have a ZIP of the old vs. new build artifacts, I'd love to take a look. Or, if you have specific example sketches that you could point to, I could build locally?

SpenceKonde commented 3 years ago

https://github.com/SpenceKonde/megaTinyCore/pull/424

That's how I figured out that there were differences. Downloaded full report CSV link there.

Openeed in sublime text, first line copied to new fil searched with regex to grab I think the names of the exam,ple sketches and the
find all cut paste into another scratch document, find replace the <br/>\n with ,,,, (each sketch has 4columns. That became first line of original CSV fie and the second scratch document I think needed some trivial cleanup with regex, tjhen became second. Saved,. opened in excel loooked for the changes. Then I manually compiled one of each of the two with and without change to scompare.

the examples for the tinyNeoPixel library gain 1 instruction word in size. The exa,,mples for Servo lose 2.

SpenceKonde commented 3 years ago

had to use winmerge on the assembly listings to find the difference., but ofc very annoying a bunch of addresses were off by 2 or 4

Thinking about it, if I had to do that again, I'd make a copy of them, then regex away all the line addresses. and jmp targets and compare those, , then go back and search for the scenery I now knew was around tjhe site with the differences.

SpenceKonde commented 3 years ago

And, by the way, over in megatinycore, this caused https://github.com/SpenceKonde/megaTinyCore/issues/528

henrygab commented 3 years ago

Thanks for update, I find that G++ change interesting.

Yes, both C++14 and C++17 added some new allocation and deallocation functions.

It should be simple with any conforming compiler...

Quoting cppreference.com for new(): > Thus, replacing the throwing single object allocation functions is sufficient to handle all allocations. and quoting cppreference.com for delete(): > Thus, replacing the throwing single object deallocation functions (1,3) is sufficient to handle all deallocations. But, those are for **_user_** replacements of the functions, and presumes use of the standard library. Here, I think you are providing the standard library?

There may be a latent bug...

The new C++17 alignment-aware allocators should not be called when the object being allocated has an alignment requirement `<= __STDCPP_DEFAULT_NEW_ALIGNMENT__`. If they are being called in "legacy" code, then one of three things is likely: 1. The alignment requirement was reduced relative to the prior compiler??? 2. There is a bug causing the compiler to call the alignment-aware allocator 3. The legacy code had a bug, because it was relying on UB ... an allocation having greater alignment than was guaranteed. Do you know what the value is being used for `__STDCPP_DEFAULT_NEW_ALIGNMENT__` in megatinycore?

Prototypes of new/delete functions...

If you are providing all the underlying `new()` and `delete` operators, then you might need to support the following alloc/free, for C++17 compliance: ```C++ // For C++11, only need the following: void* operator new ( std::size_t count ); void* operator new[]( std::size_t count ); void* operator new ( std::size_t count, const std::nothrow_t& tag ) noexcept; void* operator new[]( std::size_t count, const std::nothrow_t& tag ) noexcept; void operator delete ( void* ptr ) noexcept; void operator delete[]( void* ptr ) noexcept; // Since C++14, also need two more delete variants... // but, can just forward to the C++11 versions and ignore // the parameter ... it's only for optimizations #if (__cpp_sized_deallocation >= 201309L) void operator delete ( void* ptr, std::size_t sz ) noexcept; void operator delete[]( void* ptr, std::size_t sz ) noexcept; #endif // Since C++17, there's four more each for new / delete, to support allocation // of objects with alignment greater than __STDCPP_DEFAULT_NEW_ALIGNMENT__. #if (__cpp_aligned_new >= 201606L) void* operator new ( std::size_t count, std::align_val_t al ); void* operator new[]( std::size_t count, std::align_val_t al ); void* operator new ( std::size_t count, std::align_val_t al, const std::nothrow_t& ) noexcept; void* operator new[]( std::size_t count, std::align_val_t al, const std::nothrow_t& ) noexcept; void operator delete ( void* ptr, std::align_val_t al ) noexcept; void operator delete[]( void* ptr, std::align_val_t al ) noexcept; void operator delete ( void* ptr, std::size_t sz, std::align_val_t al ) noexcept; void operator delete[]( void* ptr, std::size_t sz, std::align_val_t al ) noexcept; #endif ``` Note: I've never seen the alignment allocators (those with `std::align_val_t`) used in any code targeting C++14 or earlier. For C++17, although I can image it being useful to be able to allocate DMA-aligned memory for hardware, or page-aligned memory on PCs, I have never actually seen it used.

Reference pages for you...

A couple useful pages: * [Feature testing](https://en.cppreference.com/w/cpp/feature_test) * [align_val_t](https://en.cppreference.com/w/cpp/memory/new/align_val_t), [new](https://en.cppreference.com/w/cpp/memory/new/operator_new), [free](https://en.cppreference.com/w/cpp/memory/new/operator_delete) on cppreference.com * [SO description of relevant constants](https://stackoverflow.com/questions/59172291/order-between-stdcpp-default-new-alignment-and-alignofstdmax-align-t) * [SO discussion of requirements for allocators](https://stackoverflow.com/questions/56713868/overloading-operator-new-with-smaller-default-alignment)

Click to expand / collapse each section. Hope that helps.

pstolarz commented 3 years ago

Moving data into PROGMEM requires that the data be evaluated at compilation time, so the data can be placed in ROM

@henrygab I dont follow your remark regarding the PROGMEM and constexpr issue. constexpr simply says that a statement/function can be evaluated at compile time and therefore doesnt have any representation in binarny (compiled) form (placed in RAM or ROM) corresponding to the statement/function logic. For example constexpr function is reduced to a single const value in a compiled code, and the value need not to have any memory associated with it (that is prvalue).

EDIT: I probably grasp your idea by paraphrasing: make as many const data as possibile and use constexpr funcs/statements to handle them, to move as many logic as possibile to compile time to reduce binarny footprint.

SpenceKonde commented 3 years ago

Do you know what the value is being used for __STDCPP_DEFAULT_NEW_ALIGNMENT__ in megatinycore?

Not a clue!

pstolarz commented 3 years ago

Do you know what the value is being used for __STDCPP_DEFAULT_NEW_ALIGNMENT__ in megatinycore?

__STDCPP_DEFAULT_NEW_ALIGNMENT__ = alignof(max_align_t) = 1 and sizeof(max_align_t) = 12 In other words this MCU doesn't have any alignment requirements.

SpenceKonde commented 3 years ago

this MCU doesn't have any alignment requirements.

Well, no it wouldn't - how does alignof(max_align_t) == 1 and sizeof(max_align_t) == 12 tell you that? cppreference entry doesn't really help me on that,

I can imagine user code that would want overaligned structures in memory. It's harder to imagine cases where that would be required for objects or anything allocated dynamically... The examples I'm thinking of are all where there's a buffer that would have been statically allocated, that gets accessed with some unholy snippet of inline assembly... (which is probably located in an ISR, quite possibly a naked one, otherwise you wouldn't be looking at such desperate measures to make it run a fraction of a microsecond faster). I'm not sure if there are ways to improve performance significantly that require such overalignment and don't involve inline assembly.

henrygab commented 3 years ago

You are right. On AVR, likely no such thing is ever going to be required.

Of course, there are platforms where over-aligned memory is useful....

Consider more than just the alignment required by various instructions in the instruction set ... in other words, anything with a DMA engine. For example, maybe the DMA engine has a 32-bit alignment requirement, and only transmits in multiples of 32-bit. As another example, maybe a compression or encryption offload has a 512-bit alignment requirement, and only acts on buffers that are multiples of 512-bit. Maybe the DMA engine doesn't normally require special alignment ... unless you're using chained memory descriptor lists (with one DMA engine triggered to update the config of the second DMA engine) ... when it gets finicky unless using aligned buffers. Hardware designs vary, and not always in the most software-friendly way.

pstolarz commented 3 years ago

Sorry, maybe I was not so precise saying any, but I didn't expect this sentence will be taken so seriously ;) In this context I meant memory alignment for any (basic, primitive) object type as defined for C malloc() here. Fundamental alignment as specified by alignof(max_align_t) = 1, implies a basic object doesn't require any alignment for this MCU (e.g. alignof(int) = alignof(void*) = alignof(size_t) = 1). I don't exclude there could be some (system) objects that would require this (like DMA for example).

As an example: Intel has quite relaxed requirements for memory alignment of basic types (integers, pointers), but unaligned access (as far as I recall it) is less efficient than aligned one. An unaligned access spans across single memory access cycle therefore a CPU needs to lock the bus to guarantee atomicity. For this efficiency reason alignof(uint32_t) is 4 not 1. On the other hand ESP8266 (Xtensa) has very strict alignment requirements and raises LoadStoreAlignmentCause exception is case of unaligned access (I painfully found out about this recently by this bug report).

BTW I wonder if AVR MCU guarantees memory access atomicity for innate multi-byte objects (e.g. uint16_t). For example in case of storing 2-bytes integer (address) in the memory is it possible to be ISR reported after storing the 1st byte but before the 2nd one.

SpenceKonde commented 3 years ago

AVR does not guarantee atomic access; you have to briefly disable interrupts while reading a variable written by an ISR or writing one read by an ISR, if the code could be called when the ISR is enabled. This was one of the many bugs that could cause time travel in early versions of this core (brief jaunts a millisecond or so backwards or forwards in time, returning to the present on the next call. These have all been fixed, and while cleaning up code in wiring.c around them, I wrote about it yesterday.

DMA is not present on any AVRs that are not branded as XMega.

Yeah - I think we're all on the same page with what the actual alignment requirements of the MCU are - I've gone as far as writing out theoretical ISRs that could perform lightning fast operations with a 256 entry lookup table but required it to be aligned to 256 bytes. (28-30 clock cycles including all interrupt overhead) for say, outputting a sinewave on the DAC at high resolution, where you're controlling the rate at which interrupts are fired to set the frequency, while also doing something else. If the high byte of the address includes the interrupt flag you need to clear, that saves one, and if the table is in flash instead of ram, that adds one clock. 400ksps that the tiny 1-series datasheet claims would give like, 60% of the CPU time entering, in, or exiting the ISR at 20 MHz - painful but viable - while dropping back to 140 ksps like the Dx claims as max leaves the fraction of time spent in the ISR looking almost sane.

pstolarz commented 3 years ago

This was one of the many bugs that could cause time travel in early versions of this core (brief jaunts a millisecond or so backwards or forwards in time, returning to the present on the next call.

I'd be interested to get to know more about it if possible.

SpenceKonde commented 3 years ago

I checked in changes just now over on megaTinyCore's wiring.c.

SpenceKonde commented 3 years ago

Before, as well as in and around, it, there is a ton of stuff about millis() micros() and delay() implementation pitfalls and considerations. The biggest problem with micros isn't that so much as the fact that you need to do division, but division is way to slow. Avoid division like the plague on 8-bit AVRs.

SpenceKonde commented 3 years ago

Changes will be in 1.3.7 DxCore release too. Platform.txt is updated

SpenceKonde commented 2 years ago

This is in 2.0.0-dev branch and that is now available for public testing.

SpenceKonde / ATTinyCore

Enable C++17 support #557