MarlinFirmware / Marlin

Marlin is an optimized firmware for RepRap 3D printers based on the Arduino platform. Many commercial 3D printers come with Marlin installed. Check with your vendor if you need source code for your specific machine.
https://marlinfw.org
GNU General Public License v3.0
16.34k stars 19.26k forks source link

Bug in PID integrator anti-windup #4881

Closed AndyZuerich closed 7 years ago

AndyZuerich commented 8 years ago

I am using a stepdown converter for my heated bed. There, I found the following issue: When I drive the bed with a temperature that is out of the range of the heater power, the PID loop starts to oscillate. I think nobody else can directly see the same issue, since typically the PWM frequency for the heated bed is much lower. My heated bed pwm frequency is 30 kHz. First, I encountered the problem due to audible noise (sound variation/periodical duty ratio changing) in my buck converter inductor.

I was now digging into the issue and found the reason:

      pid_error_bed = target_temperature_bed - current_temperature_bed;
      pTerm_bed = bedKp * pid_error_bed;
      temp_iState_bed += pid_error_bed;
      temp_iState_bed = constrain(temp_iState_bed, temp_iState_min_bed, temp_iState_max_bed);
      iTerm_bed = bedKi * temp_iState_bed;

      dTerm_bed = K2 * bedKd * (current_temperature_bed - temp_dState_bed) + K1 * dTerm_bed;
      temp_dState_bed = current_temperature_bed;

      pid_output = pTerm_bed - dTerm_bed + iTerm_bed;
      if (pid_output > MAX_BED_POWER) {
        if (pid_error_bed > 0) temp_iState_bed -= pid_error_bed; // conditional un-integration
        pid_output = MAX_BED_POWER;
      }

Here, the conditional un-integration seems to implement a simple anti-windup protection. But there is a bug:

temp_iState_bed += pid_error_bed;
temp_iState_bed = constrain(temp_iState_bed, temp_iState_min_bed, temp_iState_max_bed);

In case the temp_iState is constrained to temp_iState_max_bed (which is as default 255), then the anti windup is flawed, since the original value "pid_error_bed" is subtracted in the un-integration, instead of the constrained value.

This can be fixed easily by some small code changes.

Besides the heated bed, the same issue applies for hotend PID loops. I think the same issue can happen when heating up to a "normal" temperature. Then, at the initial heating phase the duty ratio is 1 for a while -> integrator saturates. When I remember right, I have seen some "unexplainable" temperature swings during heated bed or extruder warmup phases. There, the heating power was suddenly reduced halfway during the warmup phase. At that time, I explained this for myself with a PID instability. But now, I think this is related to the above mentioned bug. I will do some tests, and I will propose a code improvemend / pull request, soon.

thinkyhead commented 8 years ago

Thanks for looking into this. I don't know enough about the issue (or PID generally) to comment on it, but I wouldn't be surprised if the logic is off. It's a tricky concept. Is there a way we can test the code in simulation, I wonder? Might save some time and make the situation clearer.

Roxy-3D commented 8 years ago

In case the temp_iState is constrained to temp_iState_max_bed (which is as default 255), then the anti windup is flawed, since the original value "pid_error_bed" is subtracted in the un-integration, instead of the constrained value. This can be fixed easily by some small code changes. Besides the heated bed, the same issue applies for hotend PID loops.

Looking at the data types, it is strange that this is 255 as the max because all the computations are carried through as floats. It may be it is just trying to cap the PID algorithm because the final output needs to be in the 0 to 255 range???

I guess it makes sense to truncate or limit the integral term, but it would seem a sliding window of time would be a better approach than just turning off the integration term. @AndyZuerich Do you agree?

Rerouter commented 8 years ago

Made a fix for it with pull request #4903

AndyZuerich commented 8 years ago

@Rerouter: Thanks for your pull request #4903. I think that should fix it. I don' have time for testing within the next days, but I will reply/comment on it as soon as possible.

Looking at the data types, it is strange that this is 255 as the max because all the computations are carried through as floats. It may be it is just trying to cap the PID algorithm because the final output needs to be in the 0 to 255 range???

Yes, indeed, the 255 value and mixing with floats is a bit confusing. And I think the code is there quite unreadable. As I mentioned before, I am using the Arduino PWM with 30 kHz for a PWM output for my heated bed. When I changed the code for this, I found that all output values for the on/off PWM, as it is implemented at the moment, is done like so:

soft_pwm_bed = (bias - d) >> 1; or, for instance: soft_pwm_bed = bias = d = (MAX_BED_POWER) / 2; soft_pwm_bed = MAX_BED_POWER >> 1;

Now, some comments on this for generating some clarity:

There is an outer loop (8 Hz?) for actually implementing the heated bed PWM power. Then, there is the "soft_pwm_bed" variable (unsigned char, i.e. values from 0...255) for a faster PWM frequency (490 Hz), done with a arduino analogWrite. This is here used for limiting the maximum bed power. This might be handy for using a 12 V bed on 24 V supply voltage, i.e. the inner (soft_pwm_bed) pwm can be set accordingly via #define MAX_BED_POWER 255 // limits duty cycle to bed; 255=fullcurrent
in Configuration.h. Now, the actual output of soft_pwm_bed is the following:

Here, I see the following problems:

I think it would help a lot when we would change the output of soft_pwm_bed to be fully on with the value of 255 (and not 127). Then, all the complications here could be avoided and the code would me much more readable. At the moment, I don't know why 127 is the maximum of the analogWrite. Possibly, the maximum counter for generating the PWM in Marlin was modified. The pid control for the hotends shows the same thing:

// Check if temperature is within the correct range                                                                                                    
    soft_pwm[e] = (current_temperature[e] > minttemp[e] || is_preheating(e)) && current_temperature[e] < maxttemp[e] ? (int)pid_output >> 1 : 0;

The pid_output value is divided by 2 (>>1) before writing it to the soft_pwm array.

My impression is that this is some "historic" thing, and nobody did a good refactoring of the 255/127 issue. For instance, I found this discussion: [(http://forums.reprap.org/read.php?146,554826)] .

The next code smell is, that the bed pwm code is separated from the hotend pwm/temperature control. When you look at the changes that rerouter did in his pull request, you can see that everything is done twice. My impression is that all PID things should be done within the same code, and the bed PID should share its routines completely with the hotend PID. The best code is always the code that is not existing...

Now, for me, the current setting of the PWM is working. But the code is unnecessary ugly, here. I will possibly do a refactoring of the 127/255 issue in the future. However, I think this change should not be made in the release candidate version of Marlin, but it should be planned for the next release.

Wurstnase commented 8 years ago

Sometimes, the division by 2 is done explicitly, sometimes by right-shifting (>> 1). This is possibly an optimization, since dividing by 2 might be slower than right-shifting.

Never try to be more clever then the compiler. You aren't!

aljscott commented 8 years ago

When I remember right, I have seen some "unexplainable" temperature swings during heated bed or extruder warmup phases.

On RCBugFix - I think I pulled it last week. This probably explains why I'm seeing some random heater failures and over temps? The printer is shutting down usually a few layers into a dual extruder printer. I've not been there to observe what happens every time. Last night the 2nd extruder was heating up it got to the target temp then it just plummeted and I got a heater failed halt error. Frustrated, I tried another print.. switched to using a single extruder, this time the print was going well go half an hour in and I went to check it and found it had halted with a max temp exceeded error. I dunno if its relevant but it was always the 2nd extruder that failed - so I was actually thinking it might be a thermistor / wiring issue.

thinkyhead commented 8 years ago

I will possibly do a refactoring of the 127/255 issue in the future. However, I think this change should not be made in the release candidate version of Marlin, but it should be planned for the next release.

@AndyZuerich If the bug fix is pretty straightforward, I think it would be good to have in the queue, even if it doesn't get in until 1.1.1. Especially since it addresses a real issue. The focus leading up to 1.1.0-RC8 (any day now) has been on cleanup, cleanup, and more cleanup of longstanding cruft. These areas have indeed been neglected, and only get attention when somebody gets a bee in their bonnet. Only recently I got pulled into a SCARA project, and suddenly all the deficits of bed leveling and kinematics became glaring. I had to clean up a lot just to get to a good starting point to begin. Well, I'm sure you know how it is, especially with ad hoc code.

If nothing else, I hope we can keep the discussion lively and ongoing and keep the momentum. I enjoy seeing these cobwebs getting cleared out.

thinkyhead commented 8 years ago

Yes, indeed, the 255 value and mixing with floats is a bit confusing. And I think the code is there quite unreadable.

Perhaps it makes sense in this area to use fixed-point. Unless the use of float is somehow vital.

AndyZuerich commented 8 years ago

Ok. I suggest the following: I will try to test rerouters pull request within this week. As far as I can see, it should solve the anti-windup issue. However, I would suggest not to use a global array for the integrator overflow (as rerouter suggested), but simply a l local variable. This keeps things more readable and better maintainable. Since this change would only affect a few lines of code in the temperature.cpp file, it can be included in the 1.1.0-RC version. If not, the pull request can still be used by anybody who has the same issue and some coding skills. Then, the next step would be to refactor parts of the heatermanager (255/127 values, duplication, etc...). This is clearly not for the RC-version, but a change that improves the code on the long term. My background is in power electronics modeling, programming and simulation. Therefore, I think I am the right one to tidy things up in the marlin heatermanager, since I am quite familiar with PWM and PID.

I guess it makes sense to truncate or limit the integral term, but it would seem a sliding window of time would be a better approach than just turning off the integration term. @AndyZuerich Do you agree?

@Roxy-3D The integral anti-windup is typically introduced so that during a constant-on phase (when heating the hotend or heated bed from room temp), then the integrator does not integrate (wind-up) to a really huge value. If it would so, then, when the aimed temperature is reached, it would require quite some time to undo (down-integrate) the unnecessary large i-Term. This means, with a correct anti-windup, the PID controller shows much less (initial) overshoot, and can settle down to the correct temperature value much quicker than without anti-windup. From that point of view, I think the small bugfix should be sufficient for now. @Roxy-3D : I don't exactly know how you would implement this with a sliding window, or what the advantage would be. If you like, please send me a private email for some short explanation.

Rerouter commented 8 years ago

Hello again,

It turns out the root cause that made me investigate it was a loose thermistor, which was causing a weird 0.4Hz PWM effect of off an on in spikes, caused in part by the windup, the patch did originally reduce it away from spiking, but since fixing the thermistor i can no longer see the effect for myself, and i'm not too willing to deliberately loosen it again and risk a meltdown. technically an under or oversized heater should cause similar freak outs, without the patch,

The reason i used global variables was only to maintain the style of the code as it was. and is free to change.

Reading over the datasheet of the normal mega2560 micro, it uses either 8 bit or 16 bit PWM depending on the pin used, I would imagine 8 bit is more than enough for something that loses heat this quickly,

AndyZuerich commented 8 years ago

Hi,

ok, now the simplest solution for this issue:

pTerm_bed = bedKp * pid_error_bed;
temp_iState_bed += pid_error_bed;
iTerm_bed = bedKi * constrain(temp_iState_bed, temp_iState_min_bed, temp_iState_max_bed);

I think the constrain() simply does not make any sense here. There is no reason to constrain the integral value to some limit at this place. a) The integral is giving a large output value for the pid_output. Then, the anti-windup (which is implemented later on) would work totally correct, without the "constrain", i.e. iTerm_bed = bedKi * temp_iState_bed; would be enough to make the anti-windup code correct. Actually, the anti-windup then limits the maximum integral automatically to some maximum value. b) The integral is resulting with a negative value. Theoretically, this should never happen, since the p-Term alone should always result in an output temperature lower than the target temperature. The integral term then corrects the long-term offset to a correct value (positive) value pTerm + iTerm (both positive). And even if a negative integral value would be reality (e.g. by lowering the target temperature during operation from e.g. 100C to 50C, with heavily de-tuned ki, kp, kd), then the anti-windup for zero-values should limit the integral term to a small value.

The only explanation that I have for the constrain expression is, that in the past there was no anti-windup implemented at all. At that time, the constrain acted as simple anti-windup replacement (possibly the sliding window, that @Roxy-3D mentioned above). Then, someone added the anti-windup as it is today, but without removing the original constrain(), which then became useless and even introduced the bug that is discussed in this thread.

Rerouter commented 8 years ago

Ok, i can confirm strippling out the constraint from the original, (aswell as the min/max iState variables) works as well as the original code so far, it also saves 24 bytes of ram, and 72 Bytes of program space.

To test i set the PID functional range to 100, and it does indeed begin to ramp down before it reaches set point, indicating its not locked there by the iTerm,

Original pid response original

Patched pid response patched

Constraint Removed from original pid response unconstrained

thinkyhead commented 8 years ago

Very nice. Does removing the constrain obviate the need for #4903, or should that change also be included? Would you (@Rerouter or @AndyZuerich) like to contribute the suggested change?

Rerouter commented 8 years ago

It makes 4903 obsolete, i will knock it up in 9 hours.

thinkyhead commented 8 years ago

Sweet, thanks! This is a most welcome fix.

AndyZuerich commented 8 years ago

Hi,

I just tested to remove the constrain()... it works without problems. Thanks @Rerouter for doing the modification in your pull request - this saves me some time and effort. Then, please don't forget to remove the following define in configuration.h:

#define PID_INTEGRAL_DRIVE_MAX PID_MAX  //limit for the integral term

which is then also obsolete. The less variables we have in configuration.h to set for the normal/standard user, the better it is. The next thing that should be removed (but not now and not for the RC version) is

  #define PID_FUNCTIONAL_RANGE 50 // If the temperature difference between the target temperature and the actual temperature
// is more than PID_FUNCTIONAL_RANGE then the PID will be shut off and the heater will be set to min/max.

PID with anti-windup works for all ranges. There is no need for setting an arbitrary value, and use on/off outside of this interval. I have a deltaprintr-mini hotend. This is heating up so quick that the default value of the PID functional range made the PID loop basically useless, and I had to increase the functional range to 50. Other Marlin users possibly don't understand this and they will be frustrated when their settings of p, i and d constants have not the effect that they expect... so, there are many things to clean up here later on.

thinkyhead commented 8 years ago

I had to increase the functional range to 50

@AndyZuerich Does M303 PID tuning do anything to help mitigate the issue?

As I read about PID, it seems that automating its tuning should be possible. But apparently it's no substitute for having good starting Kp, Ki, Kd values in the first place.

Roxy-3D commented 8 years ago

Does M303 PID tuning do anything to help mitigate the issue?

Please allow me to jump in and maybe @AndyZuerich 's answer can be even more beneficial! The problem is the PID algorithm by itself can't go plus and minus enough to just let it do its thing. The problem is the 'correction' that the algorithm comes up with can't actually be done by the hardware. So the 'answer' is capped by the hardware. (In this case, by the 100% duty cycle of the 0v to 5v wave)

That is why there is a 'range' setting. This 'range' setting is a way of saying "The current temperature is so far off... We can't use the PID algorithm. We are better off just plain turning on the heater at full blast."

For me, the question isn't so much "Can the M303 PID Tuning do anything to help mitigate the issue." For me an additional question to @AndyZuerich is "How should we decide that we are close enough to turn on the PID Algorithm?" or "How do we know it is appropriate to be in PID mode?" And this is too much to hope for, but one more answer would be nice: "When we drop out of Bang Bang mode into PID mode, is there a way to preload the I & D terms so the 'right' thing happens?"

Rerouter commented 8 years ago

@Roxy-3D the PID effective range also has another positive, the cpu is not having to think while in this mode,

the issue it can face is on overpowered hot ends, where killing power when the sensor only has 10C to go will lead to a large overshoot,

Now that we have isolated out what can cause a PID issue, i would imagine it is safe to remove the effective range, or to wind it up to 25 to better cover these corner cases, (10 -15 would cover most hot ends, 50 would cover the grossly overpowered, so strike in the middle)

AndyZuerich commented 8 years ago

Hi,

the PID effective range also has another positive, the cpu is not having to think while in this mode,

I think this is not a real argument pro limiting the PID range. PID is within the functional range always then, when we need the CPU cycles, i.e. during printing. With the #define PID_FUNCTIONAL_RANGE we only save CPU cycles during the heatup phases. Also, I think with the 8 Hz sampling frequency, any computation done here should not hurt too much.

Now that we have isolated out what can cause a PID issue, i would imagine it is safe to remove the effective range, or to wind it up to 25 to better cover these corner cases, (10 -15 would cover most hot ends, 50 would cover the grossly overpowered, so strike in the middle)

I agree. I would remove it completely, since this simplifies the code. Well, there is one case where the range limit is actually doing something: Imagine a heavily de-tuned set of PID parameters. This could be instable and oscillate. With the functional range of e.g. 10 or 25 degrees, the system behaves as a rough bang-bang controller, where the PID range is then the hysteresis range for bang-bang. But: In this case, any user will see that something is wrong, and that he needs to adjust his PID parameters. And it makes it just more complicated that the PID range might also come into play: Instead of the expected 3 parameters (degrees of freedom), the user has to cope with 4 parameters to understand what is going on.

Please allow me to jump in and maybe @AndyZuerich 's answer can be even more beneficial! The problem is the PID algorithm by itself can't go plus and minus enough to just let it do its thing. The problem is the 'correction' that the algorithm comes up with can't actually be done by the hardware. So the 'answer' is capped by the hardware. (In this case, by the 100% duty cycle of the 0v to 5v wave)

This is correct. PID with integration suggests values that are sometimes < 0 for the duty ratio, and sometimes > 1, but obviously the hardware cannot do it, i.e. 0<d<1 must always be fulfilled. This are the cases where anti-windup limits the integrator from accumulating too much of a wrong value.

That is why there is a 'range' setting. This 'range' setting is a way of saying "The current temperature is so far off... We can't use the PID algorithm. We are better off just plain turning on the heater at full blast."

Yes and no. When PID is wound up, then it would to the same thing as the range limit that is implemented currently, i.e. the duty ration is then stick to 0 or 1. In the other case, when the range limit applies, but PID is not wound up, then we have the case that I had: the range limit is too tight for the combination of hardware and PID parameters.

For me, the question isn't so much "Can the M303 PID Tuning do anything to help mitigate the issue." For me an additional question to @AndyZuerich is "How should we decide that we are close enough to turn on the PID Algorithm?" or "How do we know it is appropriate to be in PID mode?" ... we are always close enough for PID tuning, resp. it is always appropriate to be in PID mode. Except: the user has not set the three parameters correctly.

And this is too much to hope for, but one more answer would be nice: "When we drop out of Bang Bang mode into PID mode, is there a way to preload the I & D terms so the 'right' thing happens?"

Well, the p term is immediate and based on a measurement-> needs no preloading. D depends only on the sensor values of "now" and the prevevious step -> needs no preloading, since the sampling time of 8 Hz is fast enough. I term: Preloading it to the "best" value which is appropriate here is just using the correct anti-windup and nothing else. One could try to get a faster response after winding up, by setting the integral value to, for instance, zero when leaving anti-windup mode. But this is "cheating" and will introduce possible ringing issues as discussed above. So, my conclusion is to implement just a pure PID algorithm with anti-windupg and let it do its job. That's it.

@Roxy-3D well there is always hope for something better. A much better and faster control strategy for our heaters compared to PID would be based on "model predictive control" (=MPC). PID uses three parameters which can most of the time be set to get a medium speed settling time and a stable control loop. MPC uses a model for the control loop, in order to precalculate the behavior on control actions in the future. Therefore, MPC can make better suggestions for setting just the right duty ratio values, since the algorithm knows some details of the hardware and measurement system (in our case this would be something like simple models for the heat capacity, thermal resistance to ambient, distance from heater to the temperature sensor and the thermal resistance in between, etc.). But:

I have a suggestion: I am the developer of a power electronics simulation software. With this software, we can easily set up a model to simulate the PID behavior of Marlin and an example hardware. I will try to build such a model and then make it accessible to you. Then, we can play with all the pid-settings and PID functional range to evaluate if, for instance, the PID functional range can be savely removed.

Rerouter commented 8 years ago

OK all the cleanup required to remove the I constraint have been written up into a pull request #4914, including removal of the variables that where used for it.

AndyZuerich commented 8 years ago

thanks @Rerouter

sorry for me being to "verbose" in this issue thread. Some additional comments:

There is a nice summary for PID controllers: caltechbook There, you can find the basic theory about PID controllers, some application examples, etc. Especially interesting is the following:

The document explains other practical PID things, for instance

On the long term, wouldn't it be the nicest thing that we could do to just follow the algorithm as described in the book? I mean, even to use the variable names that are used in the book. Then, we have a short and concise code, and the documentation is given in the book, available as pdf or hard-copy. If anybody else in the future wants to work with this code, he could understand everything just by reading the book chapter.

boelle commented 8 years ago

@AndyZuerich @thinkyhead

was this one fixed or is the issue still there?

Rerouter commented 8 years ago

The bug itself was resolved. However the integrator can still wind up to the output limit, this is currently held in check slightly by the pid functional range, I have been looking into dynamic integral clamping instead. But I am verifying my math before i suggest it here.

AndyZuerich commented 8 years ago

Hi,

yes, I can confirm this. Rerouter, possibly we could close this issue now, since the original bug was removed. Or you can close it as soon as you verified your integral clamping.

Just to let you know: As discussed elsewhere, I am not really happy in general with the PID controller for the heaters. Therefore, I was working on an alternative to PID - a pure digital control algorithm (a "Dahlin controller") which takes into account only three real physical values of the heater setup: the dead time, gain and time constant (in control theory terms: the heater is modeled by a first order plus time delay process). This should perform much better than any PID control (faster response time, less overshoot, and easier to tune). In my new approach, the tuning (measuring) of the three parameters is done with some initial bang-bang cycling periods (if you like to search the web: this approach is called "relay tuning").

In my circuit simulation software, this Dahlin controller works already quite well! The digital controller even takes into account the controller sampling rate, which then could be increased even to something like 1 second (instead of 1/8.0 seconds, as it is at the moment).

At the moment, I have not enough time for implementing my solution into C++/Marlin (too busy at my job), but I will present this solution in the near future. In case the marlin people and users accept this new approach, we could offer this solution initially as alternative to PID.

thinkyhead commented 8 years ago

Dahlin sounds very interesting. I'm'a go look it up right now!

AndyZuerich commented 8 years ago

This book was quite helpful for me to understand all the theory about z transforms and the Dahlin control:

https://books.google.ch/books?id=6ktg72UpQPEC&pg=PA26&lpg=PA26&dq=dahlin+control&source=bl&ots=PX7KZU5G54&sig=-MQd4U8z3Fp0ShEWHxJzhkYDQOA&hl=de&sa=X&ved=0ahUKEwiwic7-i4zQAhXBVRoKHXfOB8gQ6AEINTAG#v=onepage&q=dahlin%20control&f=false

If you cannot find a pdf or hardcopy (I got mine via a university library), please write me a personal email.

thinkyhead commented 8 years ago

Ach du lieber! Why didn't I take a calculus course?

boelle commented 8 years ago

@thinkyhead like @AndyZuerich said you might have fixed this one allready... or ?

Roxy-3D commented 8 years ago

Ach du lieber! Why didn't I take a calculus course?

You still can! But you really need a full year of it to be beneficial. If you ever get motivated and start studying it, you will be talking about limits, summations, dx's and integrals. You will sound like a nerd!

thinkyhead commented 8 years ago

@Roxy-3D I'm very slowly making my way towards Calculus through Khan Academy. Apparently I need a lot of review before I will arrive there.

boelle commented 8 years ago

@thinkyhead is this one going to wait until RC9?

Roxy-3D commented 8 years ago

I'm very slowly making my way towards Calculus through Khan Academy. Apparently I need a lot of review before I will arrive there.

I missed this post. Calculus is not something you learn over a long weekend. It is painful and difficult to learn. But very literally... It changes how you think about things. It is a very good place to spend spare time as a hobby!

thinkyhead commented 8 years ago

Is this one going to wait until RC9?

Until 1.1.1, or until whenever someone makes an acceptable pull request.

thinkyhead commented 7 years ago

I presume this was resolved with changes in the PID integration. If not, please reopen!

github-actions[bot] commented 2 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.