code clean up and some optimisations

Tamaren1 / Freevalve_Arduino

Codebase for the Arduino Freevalve Project- found here: www.youtube.com/c/wesleykagan

94 stars 26 forks source link

code clean up and some optimisations #1

Open jrsall92 opened 3 years ago

jrsall92 commented 3 years ago

have a look in the comments as well, I think there's more that can be done to improve performance.

RedBeardErik commented 3 years ago

Can you explain the code logic the comments? How are you using time? Why not just base the open and close events on a straight pulse count based on the tooth wheel? Are you trying to relate valve duration to RPM at this point?

ricallinson commented 3 years ago

Other than my one minor question nice code cleanup dude! +1

valeriu-balaban commented 3 years ago

General advice, avoid using print inside an interrupt. Print statements take a considerable long time and in this case being on time is crucial for the success of the program. The workaround is to use a flag variable that you set inside the interrupt to indicate a change and in the loop function check for the flag, print, and reset. Also, all variables that are used both inside an interrupt and outside must be declared volatile to instruct the compiler no to cache their value.

void loop() {
   if(magnet_detected){
      Serial.print(degree)

      magnet_detected = 0;
   }
}

void magnet_detect() {
   hallcounter++;
   timeGap = millis() - timeGap;

   if (timeGap >= last_timeGap * 3 / 2) {
      hallcounter = 1;
      cycle = -1 * cycle; 
   }
   last_timeGap = timeGap;

   degree = hallcounter * 6;

   if (cycle > 0)
         digitalWrite(INTAKE_V,  degree <= 180); 
   else
         digitalWrite(EXHAUST_V, 180 <= degree);

  magnet_detected = 1;
}

Note, you can use a bool expression to set a pin value which makes the code more readable than nested ifs.

badger200 commented 3 years ago

@valeriu-balaban Wow my mind is 100% blown right now!🤯

I’m a high performance street car turbo engine tuner, and also somewhat of a computer programmer. This video randomly appeared on my YouTube recommended tonight and I was intrigued to see a GitHub and wanted to see what’s up and Aha! It’s buzzing with activity in the past 4 hours!

I just learned a few things from your improvements that were just the kind of thing I need for the iOS jailbreak programming I’m doing right now.

The idea of controlling my race engine’s valves with code instead of a custom camshaft grind like I have now, is truly next level! 👏🏼👏🏼👏🏼👏🏼

I’ve always posted my tuning results publicly and I always hated what I call “Secret Hoarders” - racers who won’t dare share any details for fear that if anyone learns their “secret”, then they’ll easily be defeated and won’t feel so special anymore.

So it’s a triple-whammy for me to have this video with the code operated valves, tuning an engine performance, and using open source code & discussion to do so!🏆

jrsall92 commented 3 years ago

Can you explain the code logic the comments? How are you using time? Why not just base the open and close events on a straight pulse count based on the tooth wheel? Are you trying to relate valve duration to RPM at this point?

Valid questions all of them. I didn't change the logic, I just gave it a clean up and mved thing around a bit. From what I can see the timing is depedent on the RPM. Time is used to decide wheter the sensor is above a tooth/magnet or not, if not it will take more time than before to register the next tooth/magnet, approx. 1.5*(time from tooth n-2 to n-1). The count already happens, it's the hallcounter. Hope these answer your questions.

General advice, avoid using print inside an interrupt. Print statements take a considerable long time and in this case being on time is crucial for the success of the program. The workaround is to use a flag variable that you set inside the interrupt to indicate a change and in the loop function check for the flag, print, and reset. Also, all variables that are used both inside an interrupt and outside must be declared volatile to instruct the compiler no to cache their value.
void loop() {
   if(magnet_detected){
      Serial.print(degree)

      magnet_detected = 0;
   }
}

void magnet_detect() {
   hallcounter++;
   timeGap = millis() - timeGap;

   if (timeGap >= last_timeGap * 3 / 2) {
      hallcounter = 1;
      cycle = -1 * cycle; 
   }
   last_timeGap = timeGap;

   degree = hallcounter * 6;

   if (cycle > 0)
         digitalWrite(INTAKE_V,  degree <= 180); 
   else
         digitalWrite(EXHAUST_V, 180 <= degree);

  magnet_detected = 1;
}
Note, you can use a bool expression to set a pin value which makes the code more readable than nested ifs.

I left the prints where they were since the OP is still testing the system and the prints can be very usefull, otherwise I agree with you on that abd I like the way you're using the comparison in the digital write. Although, since this is a time critical system, we should test which version would be faster.

Neywiny commented 3 years ago

tl;dr compiler no dummy In fact, doing (timeGap >= last_timeGap * 3 / 2) Is the opposite of an optimization. And here is why: The original guard compiles down to (with timeGap in r20-23):

 lds    r17, 0x013D ; 0x80013d <last_timeGap+0x1>
 lds    r18, 0x013E ; 0x80013e <last_timeGap+0x2>
 lds    r19, 0x013F ; 0x80013f <last_timeGap+0x3>
 movw   r26, r18
 movw   r24, r16
 lsr    r27
 ror    r26
 ror    r25
 ror    r24
 add    r24, r16
 adc    r25, r17
 adc    r26, r18
 adc    r27, r19
 cp r20, r24
 cpc    r21, r25
 cpc    r22, r26
 cpc    r23, r27
 brcs   .+46        ; 0x598 <_Z13magnet_detectv+0xc8>

Fairly lengthy, but that's because we're using 32 bit values on an 8 bit cpu. Here's what *3/2 does, with the 2 (!!) function calls to different multiplication functions put inline

 lds    r19, 0x013D ; 0x80013d <last_timeGap+0x1>
 lds    r20, 0x013E ; 0x80013e <last_timeGap+0x2>
 lds    r21, 0x013F ; 0x80013f <last_timeGap+0x3>
 ldi    r26, 0x03   ; 3
 ldi    r27, 0x00   ; 0
 call   0xa0c   ; 0xa0c <__muluhisi3>
   call 0xa2e   ; 0xa2e <__umulhisi3
     mul    r26, r18
     movw   r22, r0
     mul    r27, r19
     movw   r24, r0
     mul    r26, r19
     add    r23, r0
     adc    r24, r1
     eor    r1, r1
     adc    r25, r1
     mul    r27, r18
     add    r23, r0
     adc    r24, r1
     eor    r1, r1
     adc    r25, r1
     ret ; end of __umulhisi3
   mul  r26, r21
   add  r25, r0
   mul  r27, r20
   add  r25, r0
   mul  r26, r20
   add  r24, r0
   adc  r25, r1
   eor  r1, r1
   ret ; end of __muluhisi3
 lsr    r25
 ror    r24
 ror    r23
 ror    r22
 cp r12, r22
 cpc    r13, r23
 cpc    r14, r24
 cpc    r15, r25
 brcs   .+46        ; 0x594 <_Z13magnet_detectv+0xc4>

It's twice as many instructions and has two calls. Significantly better than floating point operations, but not better than the original. And remember, the call instruction takes 4 cycles, and the return takes 4. Original code takes 24 cycles, "optimized" takes 65 (assuming branch is taken, else 23 and 64). As in, ~2.7x as long to execute that segment.

As for the digitalWrites, the compiler still only does one call to write for each pin and needs to conditionally store a 0 or 1 in the argument anyway. So I get it, and honestly I prefer the single line version, but if you're going to be optimizing digitalWrite it'll have to be done with direct port manipulation.

The long and short of it is that the compiler knows what's up. Before sacrificing understanding from the author, maybe see if you need to. The rest of it I completely agree with. As it is the ISR is way too long, and Serial commands are blocking unless they've changed that up on me so it will take too long too. And millis() uses interrupts to keep track of time. So the longer we're in an interrupt, the higher the chance something is missed, so this is actually important.

ricallinson commented 3 years ago

@Neywiny - Great analysis. What's your opinion on using timeGap * 10 >= lastTimeGap * 15 insead?

Neywiny commented 3 years ago

Thank you. For that you basically, take the downsides of *3 / 2, and double them. It's doing 2 calls to __mul... meaning 8 calls/returns (I did check the assembly, that's not speculation/opinion). As in, if the AVR ISA manual is correct in putting each at 4 cycles, it spends more time just calling and returning than the original code did in total, let alone everything else it's doing.

If I'm counting correctly, it works out to 112 cycles. Which is 4.66x. I should write a tool for this... counting cycles by hand is not exactly easy to keep track of.

And the worst part is that this is all just for the if statement guard. I suppose the lesson here why I personally don't use the Uno for time sensitive applications. Even with hardware multiplication it's just too slow.

valeriu-balaban commented 3 years ago

Indeed a nice and detailed analysis. Didn't expect that, my intuition was that the compiler will optimize the two and produce the same byte code but I guess the Arduino does not enable any compiler optimizations.

In fact, doing (timeGap >= last_timeGap * 3 / 2) Is the opposite of an optimization.

The main advantage of Arduino is simplicity, a lot of resources, and makes the project accessible to a larger community. I think we can throw any code inside the interrupt as long as the interrupt is executed fast. Now, the question is how fast is fast and how to measure it. For an engine running at N RPM, the crankshaft does N / 60 rotations per second, but each rotation triggers 60 interrupts, thus the microcontroller must service N interrupts per second. Assuming an interrupt executes in 2000 clock cycles and Arduino runs at 16 MHz then the maximum number of interrupts is 16M / 2k = 8k interrupts (which is also maximum RPM). Moreover, once the engine reaches 8k RPM, no code inside the loop function will run as the Arduino will have no spare cycles. If we want to run the engine just up to 2k RPM then we will have 8k cycles available for the interrupt, same as about sending 5 characters over the serial at 115200 baud rate, and maybe that's the reason he could not rev the engine towards the end of the video.

@Neywiny made a good point, the interrupt code must be precisely tracked in terms of cycles. Also, the loop code I wrote above is also not good, there are not enough cycles in the Arduino to send data over serial after each interrupt. This complicates debugging but does not mean an Arduino can not control the engine valves.

Note, pushing the interrupt latency to 100% CPU utilization is not a good practice (it just made the math easier above) since interrupts might not be serviced on time or at all. Usually, the CPU utilization of all interrupts is kept below 20% to 50% depending on the application.

To measure the duration of the interrupt we can use the TCNT0 Timer 0 register which in Arduino is incremented every 64 cycles, info based on the micros() source code. Moreover, since TCNT0 is an 8-bit register and not a long int the byte code should end up pretty compact.

volatile uint8_t interrupt_duration = 0;
volatile uint8_t magnet_detected = 0;

void loop() {
   if(magnet_detected){
      magnet_detected = 0;

      if(interrupt_duration > 16){  // Interrupt longer than 1024 clock cycles
          Serial.print("!");
      }
   }
}

void magnet_detect() {
  uint8_t start_tcnt0 = TCNT0;

  // Interrupt code here

  interrupt_duration = TCNT0 - start_tcnt0;
  magnet_detected = 1;
}

Note, not sure if Timer 0 is initialized by default when the code does not call any timing related function.

Neywiny commented 3 years ago

Similar to your TCNT0, we can actually pull the least significant byte of the millis() result faster. This is because the millis() function has some steps for ensuring all 4 bytes are returned correctly, but if we don't care about that (which we don't) we can avoid a few instructions. (For those curious, it disables/re-enables interrupts and saves/restores status register). It's important to work in values that the ALU operates in. Using 8-bit values is crucial here. And for anyone asking, because this does take some getting used to, no matter what the original value was it can keep track of time as long as the difference is < 256. For example, if the last interrupt fired at 184 milliseconds, and we pull the least significant byte of millis() which gets 183, the math works out. 183-184 = -1, which is represented as 255. We can verify this using 16-bit numbers. 184+255=439, in which case the lower byte is 183. The engine would need to be spinning very slowly for us to overflow that. We need < 170 mS between interrupts for our *1.5 comparison, so it needs to be doing one revolution every ~10 seconds. Which is fine. By working in 8-bit values execution time is ~1/4 what it would be otherwise, and more things can stay in registers which is good for speed.

Likewise, doing things like multiplying by 6 just for readability is wasteful. I prefer to say something like degree_6 to remind myself there's something about a 6 in there. That's also what comments should be used for. I like to have things along the lines of // remember this is divided by 6. This means the constants for comparison are smaller, which is nice. On AVR that's not a big deal but some ISA's have instructions shorter than register widths.

What does this all mean? It means that I have gotten the code to ~51 cycles with optimizations and clean up. Total. Depending on what's needed for debugging in loop(), a lot of these values should get moved to be static inside magnet_detect. Cleans up the namespace, but no speed up.

const uint8_t hall = 2;
uint8_t degrees_6 = 0;
uint8_t hallstate = 0;
uint8_t lasthallstate = 0;
uint8_t triggerTime;
uint8_t last_triggerTime;
uint8_t timeGap;
uint8_t last_timeGap;
uint8_t state;
uint8_t cycle;
extern uint32_t timer0_millis; // need to tell compiler it'll find this variable later

void setup() {
  pinMode(hall, INPUT);
  Serial.begin(115200);
  pinMode(13, OUTPUT);    // 13 is the pin the exhaust valve is connected to
  pinMode(12, OUTPUT);    // 12 is the pin the input valve is connected to
  attachInterrupt(0, magnet_detect, RISING);//Initialize the intterrupt pin digital pin 2
}

void loop()
{
  //leave this here just to be extra extra super duper sure we'll still have millis() running
  Serial.println(millis());
  delay(255);
}

void magnet_detect() {
  degrees_6++;
  triggerTime = timer0_millis;// note you need a "extern uint32_t timer0_millis;" somewhere
  timeGap = triggerTime - last_triggerTime;
  last_triggerTime = triggerTime;
  if (timeGap >= last_timeGap + last_timeGap / 2) {
    degrees_6 = 0; // changing this from 0 won't be the end of the world
    cycle += 1;// ++, +=, who cares
  }
  last_timeGap = timeGap;

  if (cycle % 2 == 0)
    if (degrees_6 <= 30) { // adding lower bounds will add a clock cycle or two
      PORTB |= 1 << 5;
    } else {
      PORTB &= ~(1 << 5);
    }
  else {
    if (30 <= degrees_6) {
      PORTB |= 1 << 4;
    } else {
      PORTB &= ~(1 << 4);
    }
  }
}

If you count the cycles of each instruction its > 51, but remember we're only doing 1 degree_6 comparison. gpio setting, and return each time.

Where do we go from here? Once the final values are known, we can do some tricks to generate a 256 byte table, aligned to 256 bytes, indexed by an 8-bit register. The most significant bit being the cycle 0-1, the remaining being the degrees_6 as they're < 127. This collapses the entire if (cycle % 2 == 0) onwards as an LDI, MOV (if the offset is already here, no need), LD, OUT which is 3-4 cycles + flash access time, unless we have enough ram to keep it in RAM. This saves us ~6-7 cycles. If we need the flash, this might not be worth it? I'm unsure on the access times.

Now this is pod racing.

whitfijs-jw commented 3 years ago

Similar to your TCNT0, we can actually pull the least significant byte of the millis() result faster. This is because the millis() function has some steps for ensuring all 4 bytes are returned correctly, but if we don't care about that (which we don't) we can avoid a few instructions. (For those curious, it disables/re-enables interrupts and saves/restores status register). It's important to work in values that the ALU operates in. Using 8-bit values is crucial here. And for anyone asking, because this does take some getting used to, no matter what the original value was it can keep track of time as long as the difference is < 256. For example, if the last interrupt fired at 184 milliseconds, and we pull the least significant byte of millis() which gets 183, the math works out. 183-184 = -1, which is represented as 255. We can verify this using 16-bit numbers. 184+255=439, in which case the lower byte is 183. The engine would need to be spinning very slowly for us to overflow that. We need < 170 mS between interrupts for our *1.5 comparison, so it needs to be doing one revolution every ~10 seconds. Which is fine. By working in 8-bit values execution time is ~1/4 what it would be otherwise, and more things can stay in registers which is good for speed.

Likewise, doing things like multiplying by 6 just for readability is wasteful. I prefer to say something like degree_6 to remind myself there's something about a 6 in there. That's also what comments should be used for. I like to have things along the lines of // remember this is divided by 6. This means the constants for comparison are smaller, which is nice. On AVR that's not a big deal but some ISA's have instructions shorter than register widths.

What does this all mean? It means that I have gotten the code to ~51 cycles with optimizations and clean up. Total. Depending on what's needed for debugging in loop(), a lot of these values should get moved to be static inside magnet_detect. Cleans up the namespace, but no speed up.
const uint8_t hall = 2;
uint8_t degrees_6 = 0;
uint8_t hallstate = 0;
uint8_t lasthallstate = 0;
uint8_t triggerTime;
uint8_t last_triggerTime;
uint8_t timeGap;
uint8_t last_timeGap;
uint8_t state;
uint8_t cycle;
extern uint32_t timer0_millis; // need to tell compiler it'll find this variable later

void setup() {
  pinMode(hall, INPUT);
  Serial.begin(115200);
  pinMode(13, OUTPUT);    // 13 is the pin the exhaust valve is connected to
  pinMode(12, OUTPUT);    // 12 is the pin the input valve is connected to
  attachInterrupt(0, magnet_detect, RISING);//Initialize the intterrupt pin digital pin 2
}

void loop()
{
  //leave this here just to be extra extra super duper sure we'll still have millis() running
  Serial.println(millis());
  delay(255);
}

void magnet_detect() {
  degrees_6++;
  triggerTime = timer0_millis;// note you need a "extern uint32_t timer0_millis;" somewhere
  timeGap = triggerTime - last_triggerTime;
  last_triggerTime = triggerTime;
  if (timeGap >= last_timeGap + last_timeGap / 2) {
    degrees_6 = 0; // changing this from 0 won't be the end of the world
    cycle += 1;// ++, +=, who cares
  }
  last_timeGap = timeGap;

  if (cycle % 2 == 0)
    if (degrees_6 <= 30) { // adding lower bounds will add a clock cycle or two
      PORTB |= 1 << 5;
    } else {
      PORTB &= ~(1 << 5);
    }
  else {
    if (30 <= degrees_6) {
      PORTB |= 1 << 4;
    } else {
      PORTB &= ~(1 << 4);
    }
  }
}
If you count the cycles of each instruction its > 51, but remember we're only doing 1 degree_6 comparison. gpio setting, and return each time.

Where do we go from here? Once the final values are known, we can do some tricks to generate a 256 byte table, aligned to 256 bytes, indexed by an 8-bit register. The most significant bit being the cycle 0-1, the remaining being the degrees_6 as they're < 127. This collapses the entire if (cycle % 2 == 0) onwards as an LDI, MOV (if the offset is already here, no need), LD, OUT which is 3-4 cycles + flash access time, unless we have enough ram to keep it in RAM. This saves us ~6-7 cycles. If we need the flash, this might not be worth it? I'm unsure on the access times.

Now this is pod racing.

All well and good except the interrupt is occurring every 1ms with a 60 tooth wheel at 1000rpm, necessitating the use of a usec time base to span any useful rpm range. Solve the problem, then optimize.

Neywiny commented 3 years ago

Original code used millis(), so I used millis(). If we need micros(), we can do micros(). It's the same problem and optimization, just replace timer0_millis with TCNT0.

whitfijs-jw commented 3 years ago

Again, solve the problem, then focus on optimization. The unsigned 8 bit integers will overflow when using a usec time base. There is no need to focus on optimizing every cycle while experimenting. Logic will almost certainly have to be changed from this point going forward. Focusing on optimizing for speed in the middle of developing the logic will only make that more cumbersome.

Neywiny commented 3 years ago

My understanding from the video was that the logic is fine as the engine did start eventually.

As for overflow, we just up the prescaler to 256. Easy fix. At 10,000 RPM we'll see 100 uSec between interrupts regularly. 256 prescaler gives us 16uS of precision, allowing us to go to 4.096 mS, which is ~250 RPM. we could even switch from starting up to running on the fly with a dynamic prescale switch after a certain RPM from 1024 to 256, allowing us to start the engine at 60 RPM, which is well in range of regular cranking speed. Or, switch functions to use 16-bit values at the start then go down to 8-bit once we're running.

Also, this pull request is on optimization. If you don't want to be looking at optimizations, don't.

ricallinson commented 3 years ago

It's good to know the constraints of critical optimizations so you can keep them in mind when working on the logic. There are some key information points missing in the test code we have been looking at here;

There could be from 1-12 cylinders each with their own intake/exhaust mappings.
The timing wheel could have more or less teeth changing the degrees used.
RPM could be from ~400 to 10,000.

My guess would be the most common use is a v8 with an RPM range from 1000 to 8000.

Neywiny commented 3 years ago

That is true, there's a lot we don't know about the final goals of this project. As you point out, constraints are important and at the moment we don't have very many except "be fast and easy". Luckily, the optimizations still hold true whether we're using microseconds or milliseconds or anything else. We should be working in 8-bit values wherever we can and avoiding any unnecessary function calls. If the timing wheel has 120 teeth my table idea still holds out. It potentially could with switching the high byte as needed for up to 255 teeth but I'm unsure on the speed implications. I do think though that using timing to measure the start of the wheel is not the greatest solution, though it is the easiest for the hardware. It would be nice if the other single pin interrupt could be connected to another sensor with one pulse/revolution, that way none of the timing matters anyway.

As for number of cylinders, that's going to be a limitation on the chip and board. There are 20 pins for us on the R3 board, since we need 2 per cylinder anyway we can budget for my desired second sensor. That gives us 10 pairs to work with. Note that the Analog pins can also be used as digital pins. So max of 10 configurations of intake/exhaust open/closed. I'm not too much of a gear head, but I think often cylinders are run in a lockstep equivalent, so that should be fine. The nice thing about my table idea is that even if each of the 10 pairs has a different mapping for duration and phase and whatnot, you can just tabulate that and it'll take care of it, and to top it all off it'll be done in parallel just as a real camshaft would do it.

The more I think about it the more I like it. Some initial testing shows that compared to 6 calls to digitalWrite, indexing a table is ~15x faster, and no change when it's stored in flash or changing alignment.

ricallinson commented 3 years ago

A second pulse per rotation would solve all the timing issues for me. Going to add it to #3.