bugfix 2.0.x: Occasional hotend temperature glitch when printing

gloomyandy commented 6 years ago

Description

I've recently been seeing drops in temperature during printing. My hotend temperature will be set at say 200 and will be being maintained at that +/- a few degrees. However from time to time I will see a reading of around 176.

I've spent some time digging into this by adding some debug code into the temperature.cpp file. What I see is that normally over a period of a second or so the min and max raw ADC readings for the temperature sensor are 86/87 however from time to time I get a range of 86/937. Digging into this further I added code to record the min and max time from the call to start the ADC to reading it. Normally these values are around 944/1140 uS. However when I get the above "odd" reading I get a minimum period of only 28uS. My understanding is that the Atmega 2560 ADC conversion time minimum is around 100uS so I suspect this is what is causing the glitch. My guess is that at times other interrupt routines are delaying the start of the ADC conversion resulting in the short conversion time.

This will almost certainly be Arduino specific and probably depends upon what options you have enabled (and hence how long the other interrupt calls run for).

I'm going to add code to ensure that there is always a minimum delta of 100uS between successive calls to the temperature interrupt routine. But is there a better way to fix this?

Steps to Reproduce

Run a print job and closely monitor the hotend temperature reports.

Expected behavior:

Temperature should remain constant within a few degrees of the set point

Actual behavior:

Temperature occasionally drops by 25 or so degrees.

Additional Information

Configuration files are here: https://github.com/gloomyandy/Marlin/tree/bugfix-2.0.x/Marlin

AnHardt commented 6 years ago

There aren't so many things what could delay/interrupt the temperature interrupt. About the only one coming to my mind is the stepper-IRQ Try to move slower in z or go down with steps/mm.

A bit more challenging would be to investigate/change the temp-IRQ. Are the new endstop tests inserted before or after the ADC handling. Moving ADC handling before everything else would give the stepper-IRQ less chance to interrupt the temp-IRQ before the ADCs are handled - should result in less jitter - on cost of more jitter in the soft-PWM, endstops, ... .

ejtagle commented 6 years ago

All ADC peripherals have CONVERSION_DONE bit. This bit should ALWAYS be polled to ensure the read value is the result of a full conversion. Yes. using the timer ISR as "delay" between phases has the risk of something delaying the execution of one temperature ISR (for example, the temperature ISR is preempted by the stepper ISR and the UART isr) and then the delay btween starting the ADC conversion and the read of the result will not be respected. There is an easy fix: As the ADC is being driven by a state machine, if the ADC result is not ready, just instead of reading the value, keep the state machine at the read state, and let the next ISR read the value

gloomyandy commented 6 years ago

I added a simple check on the time since the last call, if it was less than 100uS then I simply did a return right at the start of the isr and in effect waited for the next isr tick. What was surprising was that I think this is happening far more often than I originally thought. Before the above change I was seeing a fluctuation in hotend temperature during printing of around 5 degrees or so on a reasonably frequent basis. After adding the above change not only do I not see the occasional big dip, but I now see hardly any fluctuation in temperature at all, so I suspect that "missing" the conversion time must be happening reasonably often though probably resulting in smaller dips in temperature readings. These dips are probably triggering an incorrect increase in the hotend temperature which is not actually required.

@ejtagle & @thinkyhead is the ADC complete bit currently exposed via the HAL in some way? If so I can try testing for this rather than my rather crude time based approach. If the complete bit is not exposed then I guess we will either need to expose it, or perhaps define the "ADC" macro in some way such that it can return a not ready indication?

p3p commented 6 years ago

There is currently no HAL exposed method for retrieving the conversion state of the ADC peripherals, but it can easily be added, The MACROS are HAL_READ_ADC and HAL_START_ADC in HAL.h, HAL_READY_ADC (or something) would work.

edit: Although I did add the HAL_adc_finished() function to the LPC176x platform it appears, so this must have crossed my mind at the time.

gloomyandy commented 6 years ago

Hmm, I was reasonably happy to produce a PR for a crude time based fix for this, but I'm not sure I fancy creating one that requires adding new functionality to all of the HALs. What is the process for such a change? It seems unlikely that anyone could easily test them all?

I'm also a little concerned about some of the other code in this isr, will all of the other functionality (like the soft PWM code) be OK when you can have a call to call delta that is as low as 28uS? Does anything that hangs off that code require any sort of minimum pulse width?

I suspect that a lot of people will hit this issue as I really don't have anything very extreme in my configuration or machine setup. I'm just using junction deviation, linear advance and UBL with pretty standard step rates (only 16 microsteps, standard belts for X/Y and a standard screw thread on Z), operated at pretty modest speeds. I've had a few thermal runaway halts that I'm pretty sure have been caused by this.

ejtagle commented 6 years ago

@gloomyandy: Just add for each hal a dummy function HAL_IS_ADC_CONVERSION_DONE that returns true, and properly implement it for AVR. I can implement the DUE version. Do not skip the whole timer interrupt, just the temperature measurement part :+1:

boelle commented 6 years ago

i think i'm hit by this issue too

negspikes2 negspikes3

what i tried to "fix" it:

PID tune New wires for thermistor all the way from Ramps to Thermistor New thermistor cartridge from [e3d-online.com] New heater cartridge also from [e3d-online.com] rechecked that all wires in screw therminal are secured in place and no wires are sticking out. rechecked that all screws in screw terminals are tight tried to switch arround the thermistor to plug with no noise on tried new mega board tried another PSU tried with a noname 100K thermistor directly to ramps board, wires are just long enough checked that thermistor pin headers make good contact by bending them just slightly apart tried to drive extruder heater from bed output to rule out a bad FET

boelle commented 6 years ago

@gloomyandy

to test this am i correct that the only 2 files i have to overwrite are

Marlin/src/HAL/HAL_AVR/HAL.h Marlin/src/module/temperature.cpp

?

gloomyandy commented 6 years ago

Yes that should do it!

gloomyandy commented 6 years ago

Assuming you are using AVR of course!

boelle commented 6 years ago

oki... you will know shortly if it did the trick for me

i just need to move wires for nozzle and bed back to where they belong

and btw the 2 graphs i put in are not the worst ones :-P

boelle commented 6 years ago

i'm.... RAmps+Mega

boelle commented 6 years ago

it helped a LOT for the extruder heater, but printing in PLA i have the bed off and now that gets crazy amount of positive spikes, and extruder 2 only have a temp sensor connected for now and that also gets spikes even thou its just hanging outside far away

pos_spikes

both blue and orange should be flat at this point, blue will rise slowly in the start as the hot plastic is put down

gloomyandy commented 6 years ago

Perhaps you should try without my changes but exactly the same hardware setup and see if you still get the spikes on the other sensors, they may not be anything to do with the firmware changes.

boelle commented 6 years ago

without your changes but the same setup i dont get spikes on the bed, look at my graph just a few post up

what is your setup?

boelle commented 6 years ago

would it make sense if i had the excat same setup but just did the same print without your changes?

gloomyandy commented 6 years ago

But that graph was produced before you moved around various wires/connections (you said above that you needed to change some things back before you tested the updated firmware), perhaps in making these changes you are now picking up noise. Best to only change one thing at a time if possible, so worth doing the test with and without the firmware changes with exactly the same hardware setup.

boelle commented 6 years ago

yep i had the extruder heater on the bed output to test if the FET's was going bad and making noise

that was all

but yes... i will let this print run to end, then undo ONLY your changes and do it again with the same gcode

boelle commented 6 years ago

neg_spikes

same gcode and no wires touched - without your changes

boelle commented 6 years ago

with your changes

pos_spikes

boelle commented 6 years ago

and still with no changes to hardware but i printed a fitment/tolrance test and turned the bed on with bed

what is strange is that the orange line has spikes and its just a dummy sensor i have to get ambient temp

gloomyandy commented 6 years ago

Thanks for trying that I've just fixed a couple of stupid copy and paste errors one of which may have been causing the problems you are seeing. Could you give the updated version a try?

boelle commented 6 years ago

yep sure

i'm trying to get my tolerances right (square box with circle hole and a matching disc) so i will be doing a lot of those small prints anyway

boelle commented 6 years ago

still about the same with bed1

gloomyandy commented 6 years ago

I'm surprised at that can you check to make sure that you have the following lines

if (HAL_IS_CONVERSION_DONE()) \
       var += HAL_READ_ADC; \
else \
       adc_sensor_state = (ADCSensorState)(int(adc_sensor_state) - 1);

and in particular that the last line is not

       adc_sensor_state = (ADCSensorState)(int(SensorsReady) - 1);

Thanks

boelle commented 6 years ago

  /*
   * This gives each ADC 0.9765ms to charge up.
   */
  #define ADD_ADC_SAMPLE_IF_READY(var) \
    if (HAL_IS_CONVERSION_DONE()) \
      var += HAL_READ_ADC; \
    else \
      adc_sensor_state = (ADCSensorState)(int(adc_sensor_state) - 1);

  switch (adc_sensor_state) {

    case SensorsReady: {

i copied the whole file so everything is ok... what i do on github is viewing the whole file in raw and i just select everything and copy... and in notepad++ i select everything and hit insert

but yeah its a strange one

boelle commented 6 years ago

whole file as is here: https://pastebin.com/0XdCdaWg

gloomyandy commented 6 years ago

Hmm that is strange. I'll see if I can reproduce the temp glitches on the bed. Could you post the section from your configuration file that has the various temp sensor configurations...

boelle commented 6 years ago

the whole configuration.h: https://pastebin.com/SXvUbXLW

partner is home and asked if i could go help with shopping bags hence the link to whole file

gloomyandy commented 6 years ago

No problem!

I've just fixed a couple of other mistakes (that I missed in the last change), I doubt they will make any difference. But it is probably worth updating. I'm going to try and set things up to match how you have things and see if I get the same glitches.

boelle commented 6 years ago

might be worth noting that i have a dummy extruder in my setup to allow me to see ambient/chamber temp directly in octoprint

the chamber temp feature only sends the temp on serial and i cant see that in octoprint, only if i look at the terminal..

chamber is the orange line

what i did was adding a stepper with free pins and then i changed settings to say i have 2 extruders

so yeah i used free pins to create another extruder stepper and there was also free pins for a 2nd heater and fan.... my goal was to be able to use that and have an electrical heater at some point for ABS prints, but that would prob not work as it would be to slow in heating up. so for now i just use it to check temp

boelle commented 6 years ago

for the fun of it my configuration_adv.h: https://pastebin.com/SL3nMPBi

gloomyandy commented 6 years ago

Thanks! I don't think I can recreate your setup as I have my extruder cooling fan on the mosfet that would be used for E2. But I think I've been able to reproduce the issue with just a single extruder. Will let you know!

boelle commented 6 years ago

I dont have a 2nd extruder heater or fan connected, only the temp sensor for a 2nd extruder, the rest are just dummies created with free pin numbers

:-D

gloomyandy commented 6 years ago

OK I've just pushed a couple of changes that I think may help things. You will need temperature.cpp and temperature.h. If you get chance could you give it a try?

boelle commented 6 years ago

I will do the first thing in the morning while its still reasonble cold all this heat makes my brain go very slow :-(

boelle commented 6 years ago

i had a cold drink and did a test late_test tomorrow i will go back to using my 48A ATX psu but i expect no changes other than psu config :-D

boelle commented 6 years ago

someone on reprap.org suggested that temp readings should be an avg of the last 10 readings i dont know if that would have any negative effects?

but lets say that we have 60 readings every sec (maybe overkill) so instead of reporting 60 readings we report the avg of every 10 and only report back 6 readings... just an example as i dont know how many readings are actually taken

gloomyandy commented 6 years ago

The reading you see is already an average of 16 readings a reading is taken every 10mS and a new averaged sample is produced every 160mS, so you do indeed get approx 6 readings per second. There may be additional processing on this reading before you see it, but I've not looked at that.

boelle commented 6 years ago

my thought was just that do we need that much?

gloomyandy commented 6 years ago

Well I guess you need a reasonably high frequency for the PID controller to be able to maintain the temperature. The averaging will help with any noise issues. Perhaps you could sample less often, but I suspect the cpu saving would not be that great.

brucehvn commented 6 years ago

I think I'm seeing the same problem. Most of the time the hotend temp is very stable, but then I get times where the temperature becomes very erratic and sometimes causes a runaway thermal reset. Like @boelle I've replaced the entire heater block with the latest from e3d which has the thermistor cartridge with the nice plastic connector that hooks to a single run of wire to the RAMPS board. I'm using a version of bugfix-2.0.x by tcm0116 that fixes another bug involving retraction. I'll probably try this fix and see if I see things being more stable.

brucehvn commented 6 years ago

@thinkyhead, has this fix been merged into the main Marlin bugfix-2.0.x branch?

thinkyhead commented 6 years ago

I assume with all this gabbing that someone actually did something. The last comment was July 29. If it hasn't been addressed by now then someone who has a clue needs to get on it.

brucehvn commented 6 years ago

@gloomyandy, @boelle can you confirm that this fix got into the main bugfix branch? Today, I had the temp jump to about 450C in a couple seconds. I was really lucky it came down before the thermal protection kicked in.

gloomyandy commented 6 years ago

Yes the code has been in for some time, I don't think the problem you are seeing is this one. I think you need to open a new issue and provide a lot more details.

brucehvn commented 6 years ago

It could still be the same problem. I've been on a branch from around the time you started this thread due to another issue that was fixed by that fork. Once I verify that other fix has also been merged, then I can go back on the main bugfix-2.0.x branch. But either way, if the problem still exists after I go back on the main branch, I'll put in another ticket.

boelle commented 6 years ago

@brucehvn

at least provide a temp graph like i did, it will tell a lot more than words :-D

brucehvn commented 6 years ago

Ok, luckily I was able to just grab one when I noticed the temp was fluctuating again. I'm using Repetier Server and it doesn't show a large time frame in the graph. This is a fairly small fluctuation compared to some I get. selection_173

brucehvn commented 6 years ago

Another example during the same print after about 20 minutes of pretty stable temps. selection_174

MarlinFirmware / Marlin