How to make F() more efficient...?

SpenceKonde commented 4 years ago

With 2.1.0 - F() once again turns the text it wraps into a PSTR cast to a __FlashStringHelper - It made me sad.... but we needed the library compatibility.... Now, how can we get that performance back?

One thing that comes to my mind (since research shows that libraries do exist in the wild that have their own functions that call pgm_read_byte() on the characters of a __FlashStringHelper) would be a hacked version of avr/pgmspace.h that took advantage of the memory mapped flash - could we rely on our version being used? If so - is this the golden ticket? Would have the benefit of also making the life of anyone who was using PROGMEM variables better too...

MCUdude commented 4 years ago

Any thoughts on this? https://github.com/arduino/ArduinoCore-megaavr/pull/87

SpenceKonde commented 4 years ago

It is incompatible with existing libraries, see discussion in https://github.com/SpenceKonde/megaTinyCore/issues/216

MCUdude commented 4 years ago

Is it though? Because with this patch, I'm able to compile the MFRC522 examples that previously didn't compile in #216. And it makes sense too.

SpenceKonde commented 3 years ago

Lost track of this thread - and yeah, it is a problem. It works with the MFRC522, but it breaks with other code, where the library provides a method that takes a const __FlashStringHelper* and one that takes a const char* - this is an error because now there are two methods with the same signature.

I can't see a way short of copying the avrlibc pgmspace library and replacing the pgmread macros, which I think would still be less efficient :-/ I think before doing that, it would be good to get an idea of how much worse the performance/flash use is when we just stubbornly stick to the old way.... If we're chasing a tiny reward, it's not a good use of time...

Performance can be tested with just declaring a big array with and without PROGMEM, and a matching volatile buffer in ram, turn on TCB clocked off CLK_PER, copy every value in the array to the buffer, stop TCB and check the count..... that simulates the dominant use case, printing over serial. I'm curious about just how much slower it will be. Reading that assemby is beyond me, but the fact that it's not smart enough to use the lpm Z+ instruction doesn't leave me optimistic about the overall performance.

The flash overhead is only ~60 bytes (if you have any printing of __FlashStringHelpers), with no additional penalty using F() - so the flash usage does not justify additional development effort.

SpenceKonde commented 3 years ago

To clarify above waffling on whether it impacts DxCore.... This impacts DxCore for AVR32DA, AVR32DB, AVR32DD, AVR16DD with the same severity as the tinyAVR 0/1/2-series and megaAVR 0-series parts (as well as any future part where AVR_ARCH == 103). It does not impact the AVR64DA, AVR64DB, AVR64DD, (AVR_ARCH==102) AVR128DA, or AVR128DB (AVR_ARCH__ == 104) since only portions of the flash are mapped to the memory address space; if those are being utilized, they should be declared with the MAPPED_PROGMEM attribute (or on platforms other than DxCore, by defining a section starting at that address with a section_start directive passed to the linker coupled with a function attribute specifying that it is to go in the section so defined. If that is done, they're just treated like any variable and the function won't have a clue that it was actually pulling the data from flash instead of ram.

The one thing that's rather annoying here is that one could solve it totally transparently for all cases, except when library authors tried to get clever - in which case it seeems nearly insoluble.

SpenceKonde commented 3 years ago

The F() case: for (uint8_t i=0; i < 255; i++) { dest[i]=pgm_read_byte_near(&sourcePGM[i]); }

  63a: fc 01         movw  r30, r24
  63c: e4 58         subi  r30, 0x84 ; 132
  63e: ff 4f         sbci  r31, 0xFF ; 255
  640: e4 91         lpm r30, Z
  642: dc 01         movw  r26, r24
  644: ae 5f         subi  r26, 0xFE ; 254
  646: b7 4c         sbci  r27, 0xC7 ; 199
  648: ec 93         st  X, r30
  64a: 01 96         adiw  r24, 0x01 ; 1
  64c: 8f 3f         cpi r24, 0xFF ; 255
  64e: 91 05         cpc r25, r1
  650: a1 f7         brne  .-24

The memory-mapped global variable case: for (uint8_t i=0; i < 255; i++) { dest[i]=sourceRO[i]; }

  6c2: 21 91         ld  r18, Z+
  6c4: dc 01         movw  r26, r24
  6c6: ae 5f         subi  r26, 0xFE ; 254
  6c8: b7 4c         sbci  r27, 0xC7 ; 199
  6ca: 2c 93         st  X, r18
  6cc: 01 96         adiw  r24, 0x01 ; 1
  6ce: 8f 3f         cpi r24, 0xFF ; 255
  6d0: 91 05         cpc r25, r1 
  6d2: b9 f7         brne  .-18      ; 0

Runtimes were: PGM flash readss were 4008 clocks + overhead do 255 bytes. reads of memory mapped flash were 3320 clocks for same.

Screw it, conssidering the difficuty of the solutions, getting charachters into the serial buffer 20% faster? not worth it .... if it was 2:1? Okay then I'd be interested, but this is a piddling benefit considering what we were rtalking abougt doing!

SpenceKonde commented 3 years ago

olnh

SpenceKonde / megaTinyCore

How to make F() more efficient...? #243