SRAM vs. PROGMEM for UBL mesh_index_to_x/ypos

bgort commented 7 years ago

Continuation of discussion started here: https://github.com/MarlinFirmware/Marlin/pull/6420

@Roxy-3D: I'm okay with using PROGMEM, I believe. Messies up the code a bit, but no big deal, really, to save 120 bytes.

Couldn't find an access speed comparison of SRAM vs. PROGMEM anywhere, so whipped up a quick+dirty this morning:

int main(void)
{
  init();

  const uint16_t tcycles=10000;

  unsigned long spos, epos;
  int sram_millis, pmem_millis, diff;
  char pct[10];

  int x,y,t;

  const uint8_t vars=16;

  static float test_sram_floats[vars] = { 2+0*21.54, 2+1*21.54, 2+2*21.54, 2+3*21.54,
                                          2+4*21.54, 2+5*21.54, 2+6*21.54, 2+7*21.54,
                                          2+8*21.54, 2+9*21.54, 2+10*21.54, 2+11*21.54,
                                          2+12*21.54, 2+13*21.54, 2+14*21.54, 2+15*21.54 };

  const static float test_pmem_floats[vars] PROGMEM = { 2+0*21.54, 2+1*21.54, 2+2*21.54, 2+3*21.54,
                                                        2+4*21.54, 2+5*21.54, 2+6*21.54, 2+7*21.54,
                                                        2+8*21.54, 2+9*21.54, 2+10*21.54, 2+11*21.54,
                                                        2+12*21.54, 2+13*21.54, 2+14*21.54, 2+15*21.54 };

  Serial.begin(57600);
  Serial.println("Starting up...");
  Serial.print("Testing ");
  Serial.print(tcycles);
  Serial.println(" cycles.");

  while(1) {
    spos=millis();
    for(x=0,t=0;x<tcycles;x++)
      for(y=0;y<vars;y++)
        t+=test_sram_floats[y];
    epos=millis();
    Serial.print("Final: ");
    Serial.print(t);

    sram_millis=epos-spos;

    spos=millis();
    for(x=0,t=0;x<tcycles;x++)
      for(y=0;y<vars;y++)
        t+=pgm_read_float(&test_pmem_floats[y]);
    epos=millis();
    Serial.print(", ");
    Serial.println(t);

    pmem_millis=epos-spos;

    diff=pmem_millis-sram_millis;
    dtostrf((float)diff/sram_millis*100,4,2,pct);

    Serial.print("Finished ");
    Serial.print(tcycles);
    Serial.print(" cycles:  SRAM took ");
    Serial.print(sram_millis);
    Serial.print(" millis; PROGMEM took ");
    Serial.print(pmem_millis);
    Serial.print(" millis; difference of ");
    Serial.print(diff);
    Serial.print(" millis (");
    Serial.print(pct);
    Serial.println("%)");

    delay(2000);
  }
}

Gives us:

Starting up...
Testing 10000 cycles.
Final: 25472, 25472
Finished 10000 cycles:  SRAM took 2766 millis; PROGMEM took 2818 millis; difference of 52 millis (1.88%)
Final: 25472, 25472
Finished 10000 cycles:  SRAM took 2766 millis; PROGMEM took 2819 millis; difference of 53 millis (1.92%)
Final: 25472, 25472
Finished 10000 cycles:  SRAM took 2768 millis; PROGMEM took 2819 millis; difference of 51 millis (1.84%)

There's a speed difference of roughly 2%, which I assume won't ultimately matter.

Roxy-3D commented 7 years ago

This benchmark code is very helpful!

There's a speed difference of roughly 2%, which I assume won't ultimately matter.

I'm not trying to nit-pick... But there was a 2% speed difference in your very abusive test that was pounding on the PROGMEM array:

const uint16_t tcycles=10000;
  const uint8_t vars=16;
    for(x=0,t=0;x<tcycles;x++)
      for(y=0;y<vars;y++)
        t+=test_sram_floats[y];

In the ubl_movement.cpp file we will be accessing 8 or 10 mesh cell boundary numbers for a movement that crosses mesh cells. That 2% number is going to get heavily diluted with all of the other logic and calculations. It would be interesting to look at the disassembly of what the compiler produced because then we could say something like "an access to a PROGMEM float costs 15 us and the access to a RAM based float costs 8 us." But obviously... we have enough information to continue the thought process.

I'm okay with using PROGMEM, I believe. Messies up the code a bit, but no big deal, really, to save 120 bytes.

One idea would be to pre-initialize 10 elements in each array, and then do an #if to check the size of GRID_MAX_POINTS_X and if it was more than 10 fill out up to 15. That way... We wouldn't have line after line of nested #if's. And we would not be wasting that much program memory. Not many people run a mesh smaller than 8x8. And even if they did... we wouldn't be eating that much program memory. And my guess is there are not going to be that many people that run a 13x13 mesh. If a user goes over 10x10, my guess is they usually will go to 15x15. But still... even in the 11x11 case, this approach would only waste 8 floats in program memory.

Your thoughts?

bgort commented 7 years ago

I'm not trying to nit-pick... But there was a 2% speed difference in your very abusive test that was pounding on the PROGMEM array:

It's not nitpicking! It was meant to be an abusive-but-indicative-of-relative-access-times test.

It would be interesting to look at the disassembly of what the compiler produced because then we could say something like "an access to a PROGMEM float costs 15 us and the access to a RAM based float costs 8 us." But obviously... we have enough information to continue the thought process.

I was considering doing that until I saw it was 2%, which seems low enough to not matter (to me); looking at the code, we should consider reusing some of whatever we retrieve from progmem by storing in a temporary float (there are a few places where the same memory location is accessed more than once); we can pretty quickly rough-profile that by changing the sketch above - might end up being unnecessary, or even slower.

Anyway, it's not too hard to look at the disassembly, compare instructions, add up clock cycles, etc. I'll see about pulling that out of Atmel Studio in a few minutes; looks like it allows you to do it live, even.

My Atmel-ICE just arrived early, so want to play with it a bit first.

One idea would be to pre-initialize 10 elements in each array, and then do an #if to check the size of GRID_MAX_POINTS_X and if it was more than 10 fill out up to 15. That way... We wouldn't have line after line of nested #if's. And we would not be wasting that much program memory. Not many people run a mesh smaller than 8x8. And even if they did... we wouldn't be eating that much program memory. And my guess is there are not going to be that many people that run a 13x13 mesh. If a user goes over 10x10, my guess is they usually will go to 15x15. But still... even in the 11x11 case, this approach would only waste 8 floats in program memory.

I say we just do all 15; it's cleaner than using #ifs and there's really no practical shortage of flash on the 2560. I mean, we're talking about 120 bytes, right?

Roxy-3D commented 7 years ago

I say we just do all 15; it's cleaner than using #ifs and there's really no practical shortage of flash on the 2560.

OK!

I mean, we're talking about 120 bytes, right?

Probably the smallest mesh size that makes any sense is 3x3. So that would be a waste of 2x(15-3)x4 = 96 bytes of program memory. That just doesn't matter if it helps keep the code clean and easy to understand.

bgort commented 7 years ago

Probably the smallest mesh size that makes any sense is 3x3. So that would be a waste of 2x(15-3)x4 = 96 bytes of program memory. That just doesn't matter if it helps keep the code clean and easy to understand.

Agreed.

Roxy-3D commented 7 years ago

More nit-picking! :) That 96 byte number is wrong. We would be collapsing this constructor:

  unified_bed_leveling::unified_bed_leveling() {
    ubl_cnt++;
    for (uint8_t i = 0; i < COUNT(mesh_index_to_xpos); i++)
      mesh_index_to_xpos[i] = UBL_MESH_MIN_X + i * (MESH_X_DIST);
    for (uint8_t i = 0; i < COUNT(mesh_index_to_ypos); i++)
      mesh_index_to_ypos[i] = UBL_MESH_MIN_Y + i * (MESH_Y_DIST);
    reset();
  }

To not have the two loops with the calculations and assignments. We probably get more than half of that 96 bytes back from doing that! :)

bgort commented 7 years ago

Good point! Hadn't considered that.. though if we're going to, we should probably also consider whatever increase we get from using the progmem helpers/functions/however it's done, too. Maybe we can do a before-after comparison when we're finished.

bgort commented 7 years ago

Inside a proper Arduino sketch (instead of my bastard half-AVR, half-Arduino program):

Starting up...
Testing 10000 cycles.
Final: 25472, 25472
Finished 10000 cycles:  SRAM took 2767 millis; PROGMEM took 2796 millis; difference of 29 millis (1.05%)
Final: 25472, 25472
Finished 10000 cycles:  SRAM took 2767 millis; PROGMEM took 2798 millis; difference of 31 millis (1.12%)
Final: 25472, 25472
Finished 10000 cycles:  SRAM took 2767 millis; PROGMEM took 2797 millis; difference of 30 millis (1.08%)

No idea why the difference would be less, but it is. Same code.

Couldn't get debugging to work properly outside an Arduino sketch.

bgort commented 7 years ago

Yeah -- the default Atmel Studio 7 (serial) 'debugger' is not very good at all. It's about worthless, in fact.

bgort commented 7 years ago

Here's what they each look like:

   t=test_sram_floats[1];
  ea:   80 91 04 01     lds r24, 0x0104 ; 0x800104 <__data_start+0x4> ##2
  ee:   90 91 05 01     lds r25, 0x0105 ; 0x800105 <__data_start+0x5> ##2
  f2:   a0 91 06 01     lds r26, 0x0106 ; 0x800106 <__data_start+0x6> ##2
  f6:   b0 91 07 01     lds r27, 0x0107 ; 0x800107 <__data_start+0x7> ##2
  fa:   89 83           std Y+1, r24    ; 0x01 ##2
  fc:   9a 83           std Y+2, r25    ; 0x02 ##2
  fe:   ab 83           std Y+3, r26    ; 0x03 ##2
 100:   bc 83           std Y+4, r27    ; 0x04 ##2

### 16 cycles ###

    t=pgm_read_float(&test_pmem_floats[1]);
 102:   8c e6           ldi r24, 0x6C   ; 108  ##1
 104:   90 e0           ldi r25, 0x00   ; 0 ##1
 106:   9e 83           std Y+6, r25    ; 0x06 ##2
 108:   8d 83           std Y+5, r24    ; 0x05 ##2
 10a:   8d 81           ldd r24, Y+5    ; 0x05 ##2
 10c:   9e 81           ldd r25, Y+6    ; 0x06 ##2
 10e:   9c 01           movw    r18, r24 ##1
 110:   f9 01           movw    r30, r18 ##1
 112:   85 91           lpm r24, Z+ ##3
 114:   95 91           lpm r25, Z+ ##3
 116:   a5 91           lpm r26, Z+ ##3
 118:   b4 91           lpm r27, Z ##3
 11a:   9f 01           movw    r18, r30 ##1
 11c:   8f 83           std Y+7, r24    ; 0x07 ##2
 11e:   98 87           std Y+8, r25    ; 0x08 ##2
 120:   a9 87           std Y+9, r26    ; 0x09 ##2
 122:   ba 87           std Y+10, r27   ; 0x0a ##2
 124:   3e 83           std Y+6, r19    ; 0x06 ##2
 126:   2d 83           std Y+5, r18    ; 0x05 ##2
 128:   8f 81           ldd r24, Y+7    ; 0x07 ##2
 12a:   98 85           ldd r25, Y+8    ; 0x08 ##2
 12c:   a9 85           ldd r26, Y+9    ; 0x09 ##2
 12e:   ba 85           ldd r27, Y+10   ; 0x0a ##2
 130:   89 83           std Y+1, r24    ; 0x01 ##2
 132:   9a 83           std Y+2, r25    ; 0x02 ##2
 134:   ab 83           std Y+3, r26    ; 0x03 ##2
 136:   bc 83           std Y+4, r27    ; 0x04 ##2

### 53 cycles ###

Obviously a lot more instructions, and more cycles.

To sort out the instructions: http://www.atmel.com/images/Atmel-0856-AVR-Instruction-Set-Manual.pdf

Here's our list with cycle counts:

lds - 2 cycles
std - 2 cycles
ldi - 1 cycle
ldd - 1 cycle from data space, 2 cycles from progmem
movw - 1 cycle
lpm - 3 cycles

So in our contrived and probably-non-real-world example, we have 16 cycles to access SRAM vs. 53 cycles to access the same value stored in progmem. That works out to 0.000001s to access SRAM vs 0.0000033125s in progmem, or a difference of 3.3125x longer.

That was a fun exercise; haven't done anything like that in probably 20+ years.

Roxy-3D commented 7 years ago

So at 16 MHz... That is difference of (53-16) cycles. Assuming 10 accesses per mesh cell (which is probably too high)..

(53-16) x 10 / 16 Mhz = 0.000023125 seconds penalty per mesh cell for putting this in PROGMEM. That is tolerable! Nobody is going to notice that.

bgort commented 7 years ago

Yep. I just updated my last post with the timings and difference.

Yeah, I think it's tolerable...

Roxy-3D commented 7 years ago

I don't see how a LDI can be one cycle. There is the op-code byte that gets fetched. And then the data gets fetched. So how is that done in 1 cycle? The AVR chip isn't pipe lined, is it?

What happened to #6420 ??? It was ready to merge... and now it can't be merged.

bgort commented 7 years ago

I just verified in the manual - 1 cycle. Not sure how it works.

bgort commented 7 years ago

What happened to #6420 ??? It was ready to merge... and now it can't be merged.

@thinkyhead decided to make some cosmetic-but-PR6420-breaking changes to configuration_store.cpp. I resolved the conflict using the github.com interface and it seems to be okay on this end, now. Is it not okay after refreshing on your end?

bgort commented 7 years ago

I'm going to start on this PROGMEM thing unless you already have, or you want to do it? Shouldn't take long at all.

Roxy-3D commented 7 years ago

I'm going to start on this PROGMEM thing unless you already have, or you want to do it? Shouldn't take long at all.

Please do! I'm fighting matrix math stuff and how to rotate coordinate systems right now. And will be for a number of days.

bgort commented 7 years ago

Sounds like fun!

Roxy-3D commented 7 years ago

If you can... It might make sense to start with a pristine, current copy of RCBugFix. And when you have the PROGMEM changes checked out and verified working... If that is the only thing in the Pull Request we can hurry up and merge it before any conflicts appear.

bgort commented 7 years ago

That's what I did -- started with current RCBugFix. Almost finished; headed to testing in a few minutes.

Roxy-3D commented 7 years ago

Almost finished -- headed to testing in a few minutes.

G29 W is a very simple way to verify things are as you expect. That will print the X & Y mesh line locations.

bgort commented 7 years ago

I'll do that and also print something small.

bgort commented 7 years ago

The 15 mesh point bug has struck again! Had to go back to 11x11.

Otherwise, everything with the progmem changes is working fine. G29 W is fine, and printing a small thing with no sign of trouble now. PR on the way.

bgort commented 7 years ago

Closing -- no longer needed.

Going to move on to the bltouch-repeatability-EMI-pause-resume issue, I think.

Roxy-3D commented 7 years ago

@bgort Do you have a Delta printer sitting around????

bgort commented 7 years ago

No, but I've been wanting one; they look cool, but most of what I've seen out there looks like junk. What do you recommend?

Roxy-3D commented 7 years ago

I don't really know much about Delta Printer models... I have a very low quality Geeetech Delta that I recommend you stay clear of! The reason I was asking is it would be really nice if UBL worked on Delta's for the Golden Master release. That kind of seemed like something you would enjoy messing with. And I was thinking I could join in in about 3 or 4 days after getting my current stuff done and synced up.

What about BL-Touch probes? Do you have a machine with one? If you don't have one and have a way to put one on a machine, I can send you one.

bgort commented 7 years ago

What about BL-Touch probes? Do you have a machine with one? If you don't have one and have a way to put one on a machine, I can send you one.

I do have one - yes. But thank you very much.

I don't really know much about Delta Printer models... I have a very low quality Geeetech Delta that I recommend you stay clear of! The reason I was asking is it would be really nice if UBL worked on Delta's for the Golden Master release. That kind of seemed like something you would enjoy messing with. And I was thinking I could join in in about 3 or 4 days after getting my current stuff done and synced up.

Yeah, I've looked at quite a few deltas and just haven't been excited by anything (other than the concept). Then again, I'm not excited by many of the cartesians either, at this point.

lrpirlet commented 7 years ago

@bgort @Roxy-3D Question, are we going to write the PROGMAM any time we execute G29 ?

My concern is with the wearout of that memory... please comment after reading https://electronics.stackexchange.com/questions/67038/what-are-the-implications-of-using-progmem...

Thanks in advance.

bgort commented 7 years ago

No, we write only when flashing in this case. No need to worry about extra write cycles.

Roxy-3D commented 7 years ago

It is a non-issue. It is stored as part of the program. These data arrays used to be part of the firmware's data set that was generated in RAM at startup. (When the UBL object was instantiated, the constructor filled in the SRAM values with correct numbers.) Now... The values are pre-calculated by the compiler and are stored as part of the Flash image. Just as the compiler predetermines what code is needed for this or that function, we have the compiler doing the same thing for this data array that describes the floating point values of the mesh lines.

If you update the firmware 10,000 times, then you can start to worry. But even so... It isn't this change that would be causing you to wear out the Flash memory.

lrpirlet commented 7 years ago

@bgort @Roxy-3D OK, thanks, I missed the fact that those are pre-calculated values that will not change between code compilation.

BTW, Sorry for asking... I do claim that the stupid question is the one unasked... I do realize that some questions may upset those people in the known... so... Thanks for answering with kindness.

bgort commented 7 years ago

No worries at all! Happy to help.

github-actions[bot] commented 2 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

MarlinFirmware / Marlin

SRAM vs. PROGMEM for UBL mesh_index_to_x/ypos #6424