Closed bgort closed 7 years ago
This benchmark code is very helpful!
There's a speed difference of roughly 2%, which I assume won't ultimately matter.
I'm not trying to nit-pick... But there was a 2% speed difference in your very abusive test that was pounding on the PROGMEM array:
const uint16_t tcycles=10000;
const uint8_t vars=16;
for(x=0,t=0;x<tcycles;x++)
for(y=0;y<vars;y++)
t+=test_sram_floats[y];
In the ubl_movement.cpp file we will be accessing 8 or 10 mesh cell boundary numbers for a movement that crosses mesh cells. That 2% number is going to get heavily diluted with all of the other logic and calculations. It would be interesting to look at the disassembly of what the compiler produced because then we could say something like "an access to a PROGMEM float costs 15 us and the access to a RAM based float costs 8 us." But obviously... we have enough information to continue the thought process.
I'm okay with using PROGMEM, I believe. Messies up the code a bit, but no big deal, really, to save 120 bytes.
One idea would be to pre-initialize 10 elements in each array, and then do an #if to check the size of GRID_MAX_POINTS_X and if it was more than 10 fill out up to 15. That way... We wouldn't have line after line of nested #if's. And we would not be wasting that much program memory. Not many people run a mesh smaller than 8x8. And even if they did... we wouldn't be eating that much program memory. And my guess is there are not going to be that many people that run a 13x13 mesh. If a user goes over 10x10, my guess is they usually will go to 15x15. But still... even in the 11x11 case, this approach would only waste 8 floats in program memory.
Your thoughts?
I'm not trying to nit-pick... But there was a 2% speed difference in your very abusive test that was pounding on the PROGMEM array:
It's not nitpicking! It was meant to be an abusive-but-indicative-of-relative-access-times test.
It would be interesting to look at the disassembly of what the compiler produced because then we could say something like "an access to a PROGMEM float costs 15 us and the access to a RAM based float costs 8 us." But obviously... we have enough information to continue the thought process.
I was considering doing that until I saw it was 2%, which seems low enough to not matter (to me); looking at the code, we should consider reusing some of whatever we retrieve from progmem by storing in a temporary float (there are a few places where the same memory location is accessed more than once); we can pretty quickly rough-profile that by changing the sketch above - might end up being unnecessary, or even slower.
Anyway, it's not too hard to look at the disassembly, compare instructions, add up clock cycles, etc. I'll see about pulling that out of Atmel Studio in a few minutes; looks like it allows you to do it live, even.
My Atmel-ICE just arrived early, so want to play with it a bit first.
One idea would be to pre-initialize 10 elements in each array, and then do an #if to check the size of GRID_MAX_POINTS_X and if it was more than 10 fill out up to 15. That way... We wouldn't have line after line of nested #if's. And we would not be wasting that much program memory. Not many people run a mesh smaller than 8x8. And even if they did... we wouldn't be eating that much program memory. And my guess is there are not going to be that many people that run a 13x13 mesh. If a user goes over 10x10, my guess is they usually will go to 15x15. But still... even in the 11x11 case, this approach would only waste 8 floats in program memory.
I say we just do all 15; it's cleaner than using #ifs and there's really no practical shortage of flash on the 2560. I mean, we're talking about 120 bytes, right?
I say we just do all 15; it's cleaner than using #ifs and there's really no practical shortage of flash on the 2560.
OK!
I mean, we're talking about 120 bytes, right?
Probably the smallest mesh size that makes any sense is 3x3. So that would be a waste of 2x(15-3)x4 = 96 bytes of program memory. That just doesn't matter if it helps keep the code clean and easy to understand.
Probably the smallest mesh size that makes any sense is 3x3. So that would be a waste of 2x(15-3)x4 = 96 bytes of program memory. That just doesn't matter if it helps keep the code clean and easy to understand.
Agreed.
More nit-picking! :) That 96 byte number is wrong. We would be collapsing this constructor:
unified_bed_leveling::unified_bed_leveling() {
ubl_cnt++;
for (uint8_t i = 0; i < COUNT(mesh_index_to_xpos); i++)
mesh_index_to_xpos[i] = UBL_MESH_MIN_X + i * (MESH_X_DIST);
for (uint8_t i = 0; i < COUNT(mesh_index_to_ypos); i++)
mesh_index_to_ypos[i] = UBL_MESH_MIN_Y + i * (MESH_Y_DIST);
reset();
}
To not have the two loops with the calculations and assignments. We probably get more than half of that 96 bytes back from doing that! :)
Good point! Hadn't considered that.. though if we're going to, we should probably also consider whatever increase we get from using the progmem helpers/functions/however it's done, too. Maybe we can do a before-after comparison when we're finished.
Inside a proper Arduino sketch (instead of my bastard half-AVR, half-Arduino program):
Starting up...
Testing 10000 cycles.
Final: 25472, 25472
Finished 10000 cycles: SRAM took 2767 millis; PROGMEM took 2796 millis; difference of 29 millis (1.05%)
Final: 25472, 25472
Finished 10000 cycles: SRAM took 2767 millis; PROGMEM took 2798 millis; difference of 31 millis (1.12%)
Final: 25472, 25472
Finished 10000 cycles: SRAM took 2767 millis; PROGMEM took 2797 millis; difference of 30 millis (1.08%)
No idea why the difference would be less, but it is. Same code.
Couldn't get debugging to work properly outside an Arduino sketch.
Yeah -- the default Atmel Studio 7 (serial) 'debugger' is not very good at all. It's about worthless, in fact.
Here's what they each look like:
t=test_sram_floats[1];
ea: 80 91 04 01 lds r24, 0x0104 ; 0x800104 <__data_start+0x4> ##2
ee: 90 91 05 01 lds r25, 0x0105 ; 0x800105 <__data_start+0x5> ##2
f2: a0 91 06 01 lds r26, 0x0106 ; 0x800106 <__data_start+0x6> ##2
f6: b0 91 07 01 lds r27, 0x0107 ; 0x800107 <__data_start+0x7> ##2
fa: 89 83 std Y+1, r24 ; 0x01 ##2
fc: 9a 83 std Y+2, r25 ; 0x02 ##2
fe: ab 83 std Y+3, r26 ; 0x03 ##2
100: bc 83 std Y+4, r27 ; 0x04 ##2
### 16 cycles ###
t=pgm_read_float(&test_pmem_floats[1]);
102: 8c e6 ldi r24, 0x6C ; 108 ##1
104: 90 e0 ldi r25, 0x00 ; 0 ##1
106: 9e 83 std Y+6, r25 ; 0x06 ##2
108: 8d 83 std Y+5, r24 ; 0x05 ##2
10a: 8d 81 ldd r24, Y+5 ; 0x05 ##2
10c: 9e 81 ldd r25, Y+6 ; 0x06 ##2
10e: 9c 01 movw r18, r24 ##1
110: f9 01 movw r30, r18 ##1
112: 85 91 lpm r24, Z+ ##3
114: 95 91 lpm r25, Z+ ##3
116: a5 91 lpm r26, Z+ ##3
118: b4 91 lpm r27, Z ##3
11a: 9f 01 movw r18, r30 ##1
11c: 8f 83 std Y+7, r24 ; 0x07 ##2
11e: 98 87 std Y+8, r25 ; 0x08 ##2
120: a9 87 std Y+9, r26 ; 0x09 ##2
122: ba 87 std Y+10, r27 ; 0x0a ##2
124: 3e 83 std Y+6, r19 ; 0x06 ##2
126: 2d 83 std Y+5, r18 ; 0x05 ##2
128: 8f 81 ldd r24, Y+7 ; 0x07 ##2
12a: 98 85 ldd r25, Y+8 ; 0x08 ##2
12c: a9 85 ldd r26, Y+9 ; 0x09 ##2
12e: ba 85 ldd r27, Y+10 ; 0x0a ##2
130: 89 83 std Y+1, r24 ; 0x01 ##2
132: 9a 83 std Y+2, r25 ; 0x02 ##2
134: ab 83 std Y+3, r26 ; 0x03 ##2
136: bc 83 std Y+4, r27 ; 0x04 ##2
### 53 cycles ###
Obviously a lot more instructions, and more cycles.
To sort out the instructions: http://www.atmel.com/images/Atmel-0856-AVR-Instruction-Set-Manual.pdf
Here's our list with cycle counts:
So in our contrived and probably-non-real-world example, we have 16 cycles to access SRAM vs. 53 cycles to access the same value stored in progmem. That works out to 0.000001s to access SRAM vs 0.0000033125s in progmem, or a difference of 3.3125x longer.
That was a fun exercise; haven't done anything like that in probably 20+ years.
So at 16 MHz... That is difference of (53-16) cycles. Assuming 10 accesses per mesh cell (which is probably too high)..
(53-16) x 10 / 16 Mhz = 0.000023125 seconds penalty per mesh cell for putting this in PROGMEM. That is tolerable! Nobody is going to notice that.
Yep. I just updated my last post with the timings and difference.
Yeah, I think it's tolerable...
I don't see how a LDI can be one cycle. There is the op-code byte that gets fetched. And then the data gets fetched. So how is that done in 1 cycle? The AVR chip isn't pipe lined, is it?
What happened to #6420 ??? It was ready to merge... and now it can't be merged.
I just verified in the manual - 1 cycle. Not sure how it works.
What happened to #6420 ??? It was ready to merge... and now it can't be merged.
@thinkyhead decided to make some cosmetic-but-PR6420-breaking changes to configuration_store.cpp. I resolved the conflict using the github.com interface and it seems to be okay on this end, now. Is it not okay after refreshing on your end?
I'm going to start on this PROGMEM thing unless you already have, or you want to do it? Shouldn't take long at all.
I'm going to start on this PROGMEM thing unless you already have, or you want to do it? Shouldn't take long at all.
Please do! I'm fighting matrix math stuff and how to rotate coordinate systems right now. And will be for a number of days.
Sounds like fun!
If you can... It might make sense to start with a pristine, current copy of RCBugFix. And when you have the PROGMEM changes checked out and verified working... If that is the only thing in the Pull Request we can hurry up and merge it before any conflicts appear.
That's what I did -- started with current RCBugFix. Almost finished; headed to testing in a few minutes.
Almost finished -- headed to testing in a few minutes.
G29 W is a very simple way to verify things are as you expect. That will print the X & Y mesh line locations.
I'll do that and also print something small.
The 15 mesh point bug has struck again! Had to go back to 11x11.
Otherwise, everything with the progmem changes is working fine. G29 W is fine, and printing a small thing with no sign of trouble now. PR on the way.
Closing -- no longer needed.
Going to move on to the bltouch-repeatability-EMI-pause-resume issue, I think.
@bgort Do you have a Delta printer sitting around????
No, but I've been wanting one; they look cool, but most of what I've seen out there looks like junk. What do you recommend?
I don't really know much about Delta Printer models... I have a very low quality Geeetech Delta that I recommend you stay clear of! The reason I was asking is it would be really nice if UBL worked on Delta's for the Golden Master release. That kind of seemed like something you would enjoy messing with. And I was thinking I could join in in about 3 or 4 days after getting my current stuff done and synced up.
What about BL-Touch probes? Do you have a machine with one? If you don't have one and have a way to put one on a machine, I can send you one.
What about BL-Touch probes? Do you have a machine with one? If you don't have one and have a way to put one on a machine, I can send you one.
I do have one - yes. But thank you very much.
I don't really know much about Delta Printer models... I have a very low quality Geeetech Delta that I recommend you stay clear of! The reason I was asking is it would be really nice if UBL worked on Delta's for the Golden Master release. That kind of seemed like something you would enjoy messing with. And I was thinking I could join in in about 3 or 4 days after getting my current stuff done and synced up.
Yeah, I've looked at quite a few deltas and just haven't been excited by anything (other than the concept). Then again, I'm not excited by many of the cartesians either, at this point.
@bgort @Roxy-3D
Question, are we going to write the PROGMAM any time we execute G29
My concern is with the wearout of that memory... please comment after reading https://electronics.stackexchange.com/questions/67038/what-are-the-implications-of-using-progmem...
Thanks in advance.
No, we write only when flashing in this case. No need to worry about extra write cycles.
It is a non-issue. It is stored as part of the program. These data arrays used to be part of the firmware's data set that was generated in RAM at startup. (When the UBL object was instantiated, the constructor filled in the SRAM values with correct numbers.) Now... The values are pre-calculated by the compiler and are stored as part of the Flash image. Just as the compiler predetermines what code is needed for this or that function, we have the compiler doing the same thing for this data array that describes the floating point values of the mesh lines.
If you update the firmware 10,000 times, then you can start to worry. But even so... It isn't this change that would be causing you to wear out the Flash memory.
@bgort @Roxy-3D OK, thanks, I missed the fact that those are pre-calculated values that will not change between code compilation.
BTW, Sorry for asking... I do claim that the stupid question is the one unasked... I do realize that some questions may upset those people in the known... so... Thanks for answering with kindness.
No worries at all! Happy to help.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Continuation of discussion started here: https://github.com/MarlinFirmware/Marlin/pull/6420
@Roxy-3D: I'm okay with using PROGMEM, I believe. Messies up the code a bit, but no big deal, really, to save 120 bytes.
Couldn't find an access speed comparison of SRAM vs. PROGMEM anywhere, so whipped up a quick+dirty this morning:
Gives us:
There's a speed difference of roughly 2%, which I assume won't ultimately matter.