[BUG] Printer freezes spontaneously during print jobs via SD card.

tiwanacote commented 2 years ago

Did you test the latest `bugfix-2.0.x` code?

No, but I will test it now!

Bug Description

The motherboard reboots spontaneously during a print job. This is a random issue

If watchdog is activated the printer reboots and when starts again it reads PLR file on SD and shows "Power loss recovery". If watchdog is NOT activated, the printer just freeze, with no LCD encoder response, heaters ON, LCD freeze too.

Hardware:

MKS GEN L V2.1 (Also we have tested SKR MINI)
RepRapDiscount Full Graphic Smart Controller (Using SD from LCD)
Power supply: MeanWell LRS-350-24 (350W and 24V) --> We think this is not the problem
TMC2208 (with UART mode and without) and Pololu

We think that we have narrowed down the origin problem by testing over 15 machines at the same time for a long time, with each of the designed experiments. The failure rate is around 20~30%.

Discarded electromagnetic noise over SD SPI connector cables:

Shortened the EXP1 and EXP2 cables to 10cm length
We took out the power supply from the metal printer housing, far away from the motherboard.
We down the SPI speed: SD_SPI_SPEED to SPI_EIGHTH_SPEED
We have cutted (interrupt) EXP2 RST pin cable.
Setted KILL_PIN and RESET_PIN = -1
Eliminated duplicated GPIOs on other ports like EXP3 and EXP4
We have tried to shield the cables with aluminum foil (Floating and connected to COM potential)
Checked 5V over LCD electronics, no under voltages.
We have turned off bed to avoid power electric noise.

Discard custom configuration issues:

We have used a very stripped down version with not many options enabled, only the fundamental necessary changes (Printer size, drivers selector, etc) without success.

Discard power supply issues and hardware

Printing from PC via USB (Repetier) WITHOUT issues: This is very relevant. This experiment was with Reprapdiscount LCD connected and configured into firmware. We are using one of the most known brands (Meanwell). On the other hand, in other printer models, in which we use MKS TFT (connected by UART) we do not have this issue. Also we have changed the RepRapDiscount Full Graphic Smart Controller from different models and suppliers. We have tested over SKR Mini with same results too.

Tested firmware versions:

2.0.9.1 2.0.9.3 1.1.9.1 -----> WITHOUT ISSUE It is an important experiment, because the SD Card code has changed drastically from version 1.1.9.1 to 2.x.x.x . This discards the hypothesis of damaged or bad quality SD cards

Our most strong Hypothesis : we suspect that problem is into the pipe SD manage~SPI

Bug Timeline

This is an old issue. (2021)

Expected behavior

The print jobs must finish when printing using SD card without trouble

Actual behavior

The motherboard reboots spontaneously during a print job when using SD

Steps to Reproduce

Print from SD (SPI connection). In my case RepRapDiscount Full Graphic Smart Controller. It is a RANDOM issue.

Version of Marlin Firmware

2.0.9.1 and 2.0.9.3. We have tested 1.1.9.1 with good success

Electronics

MKS GEN L , Reprapdiscount smart controller and TMC2208 (Also tested with SKR MINI and Pololu))

Add-ons

Some of the machines uses inductive sensor and bed calibration, others not.

Bed Leveling

No response

Your Slicer

No response

Host Software

SD Card (headless)

Additional information & file uploads

Link to config files

Roxy-3D commented 2 years ago

From Discord messages: https://discord.com/channels/461605380783472640/491105274464043026/974011553215025263

InsanityAutomation — Today at 1:14 PM Would you be willing to test all the way back to https://github.com/MarlinFirmware/Marlin/releases/tag/2.0.7.2 @Tiwanacote ? It would narrow down suspicions a bit, then we just close the gap between 2.0.9 and 2.0.7 to limit what offending changes were eying.

This would be very helpful! If a few (5 or 6) of those machines you are testing could be loaded with v2.0.7.2 it will help narrow the scope of what we are looking at.

tiwanacote commented 2 years ago

Hi @Roxy-3D , yes, I will try tomorrow and leave it working all night, so the result will be post on Friday 13.

roel8032 commented 2 years ago

While printing from SD while having the USB cable connected to the PC the print can stop, when you do some activity on host software, slicing software. Arduino IDE or so. Happened to me several times.

Do not connect the USB cable to the printer when SD printing.

tiwanacote commented 2 years ago

Thanks @roel8032 for the advice, but is not the case.

tiwanacote commented 2 years ago

New report: We have been testing over version 2.0.4.3 all the weekend and we have found the same issue. Version 1.1.9.1 still working without report this problem. Now we are going to test version 2.0.0

Roxy-3D commented 2 years ago

I've been looking at v1.1.9.1 because a lot of people are saying that SD Cards work better on that version as compared to v.2.0.x. I'm seeing a real problem where certain cards have a hard time initializing. There are a number places where the code was compressed but the logic isn't identical. For example:

v1.1.9.1: https://github.com/MarlinFirmware/Marlin/blob/1.1.x/Marlin/Sd2Card.cpp#L180-L184

// send CRC uint8_t crc = 0xFF; if (cmd == CMD0) crc = 0x95; // correct crc for CMD0 with arg 0 if (cmd == CMD8) crc = 0x87; // correct crc for CMD8 with arg 0x1AA spiSend(crc);

and v2.0.x: https://github.com/MarlinFirmware/Marlin/blob/bugfix-2.0.x/Marlin/src/sd/Sd2Card.cpp#L121-L124

    for (int8_t i = 3; i >= 0; i--) spiSend(pa[i]);
    // Send CRC - correct for CMD0 with arg zero or CMD8 with arg 0X1AA
    spiSend(cmd == CMD0 ? 0X95 : 0X87);

The v2.0.x code erroneously assumes the command is either CMD0 or CMD8. And one of the two corrections specified in v.1.1.9.1 corrections are applied. This same type of error happens again here:

v1.1.9.1: https://github.com/MarlinFirmware/Marlin/blob/1.1.x/Marlin/Sd2Card.cpp#L342-L345

 // check SD version
  if ((cardCommand(CMD8, 0x1AA) & R1_ILLEGAL_COMMAND)) {
    type(SD_CARD_TYPE_SD1);
  }

and v2.0.x: https://github.com/MarlinFirmware/Marlin/blob/bugfix-2.0.x/Marlin/src/sd/Sd2Card.cpp#L289-L294

  // check SD version
  for (;;) {
    if (cardCommand(CMD8, 0x1AA) == (R1_ILLEGAL_COMMAND | R1_IDLE_STATE)) {
      type(SD_CARD_TYPE_SD1);
      break;
    }

In this case the v2.0.x code is assuming the status will be returned as an Illegal Command and that the card is now idle. That logic is definitely different than it was in v.1.1.9.1.

Roxy-3D commented 2 years ago

I have done some extensive testing of the 'Print from SD Card' feature. I've used both old normal capacity and new high capacity cards. The two items I was concerned about above seem to be OK. They work fine and are not causing a problem.

I am more and more convinced the 'Media Init Failure' messages were being caused by an intermittingly bad cable to the SD Card socket on the LCD Display.

Let's leave this open for a few more days. But if nobody else jumps in with SD Card problems we can probably close this.

tiwanacote commented 2 years ago

@Roxy-3D it is curious that my testing machines works without the issue while Marlin 1.1.9.1 is uploaded and it appear again when last version is used. Also, when branded SD cads are used (SanDisk, Kingston) the problem disappears or is reduced (We can not guarantee that it disappears ), so we suspect that is a bug related to a communication error handling, which Marlin 1.1.9.1 can manage without problems but 2.0.X not.

Roxy-3D commented 2 years ago

That is possible. But what might be happening is the higher quality SD Cards (like SanDisk and Kingston) don't lock up. After all, they are 'Higher Quality'. I'm wondering what might be different about the access pattern between v1.1.9.1 and v2.x that is irritating the unbranded (and possibly lower quality) SD Cards.

The reading and buffering of the card's data is the same between v1.1.9.1 and v2.x. I'm wondering if it is the actual timing of the signals being used to transmit commands and data between the board and the SD Card that is causing the lock ups.

dewhisna commented 2 years ago

I think I might have hit this same issue today. It was about 3 hours into a 7-1/2 hour print. It's the first time I've encountered a problem with this configuration, but most of my previous prints with this exact setup have all been under 2 hours and maybe it's not happening on shorter prints?

It was on the bugfix-2.1.x branch at 12a869e2ad3b36d6b965be2738308956963e2da4, instead of bugfix-2.0.x. And it was on a BTT Octopus 1.1 board instead of the SKR. It was printing from SD, but the on-board SD card on the Octopus board itself, not the LCD, so definitely no cable interference and also SDIO instead of SPI. And it was a definite reset of the Octopus board without warning.

It was also a "higher quality" SD Card. I believe it was a Kingston (but might have been a SanDisk -- definitely one of the two). I would go check, but it's all sealed up inside the electronics bay of the printer.

I'm also running a MeanWell LRS-350-24, as well. While I can't 100% rule out a power supply issue, there was no noticed brownout or reboot of the Raspberry Pi 2B running OctoPrint that's also being powered from that supply. The RPi didn't reboot and OctoPrint continued on to note the print failure and document it with the timelapse video capture.

All that's in the system logs on the RPi are messages about the Octopus board being present but not responding at the point it reset, but since the RPi itself didn't reboot or anything, I don't think there was an issue with the power supply. Also, the highest loads on the power supply is when the heat bed is initially getting up to temperature, which pulls around 130W. Once it's up to temperature, there's usually little enough load on the power supply to maintain that temperature that the power supply fan often cycles off (i.e. it's not running overly hot or anything).

@roel8032 noted that unplugging the USB connector is need to keep it from locking up, but that really isn't an option for me as I have the RPi running OctoPrint connected permanently there via USB [all sealed up inside the electronics bay of the printer]. I don't print directly from the RPi itself [due to issues with that], but instead I mount the Octopus board's SD Card onto the OctoPi, copy files to print to it, unmount it (it wasn't mounted at the time of the incident), and then launch the print job via the OctoPrint control panel in the browser and then I use the OctoPi to remotely monitor the printer and the video feed from it.

I've just reflashed it to a slightly newer firmware at 4ba35d3284284d99de757483c38dafa392a0b84c to pick up the changes through July, as I was just in the process of branching my local repo to set things up for an SKR3 board for a different printer. I guess I'll be restarting this print again in a few minutes to try again, but I hope this isn't going to happen every time, as it's a 7-1/2 hour print job and I need to run 4 of them.

I also don't think I can even go to the 1.x versions, since there's things in the 2.x family needed for this Octopus board configuration. So I'm hoping this gets resolved soon. But if this is the same issue, maybe the fact that my setup is using SDIO instead of SPI for the SD card could provide some additional clues?

As for what all was connected to mine and talking to the Octopus board at the time of the incident, I have the aforementioned OctoPi connected via serial over USB monitoring temps and such, and I have a BTT TFT35 v3 LCD board on the control panel of the printer, which I use for doing leveling, preheating, etc. It's on a separate serial channel (i.e. the one on the TFT connector instead of the USB port), but is still sending commands to monitor printer status during the print -- however, just passively monitoring things instead of controlling any functionality.

As for this printer, it's a former FlashForge Creator Pro 2016 printer that I gutted and reworked its electronics to use the Octopus board running TMC2209 drivers (if the driver type matters in your investigations) and the TFT35 v3 LCD, added bulkhead connectors for the extruders to make them easier to swap around, switched to a MicroSwiss all-metal hotend with a 50W heater, and switched to Marlin firmware. Since the new control board is the Octopus and since it's also got a RPi running OctoPrint or OctoPi, I call the new printer setup the OctoForge.

thinkyhead commented 2 years ago

@dewhisna — It will be worthwhile (possibly) to enable POSTMORTEM_DEBUGGING and maybe that will give some clue why the board is resetting. A complete reboot is highly unusual, and may be due to watchdog reset, so you may disable USE_WATCHDOG for testing, allowing an infinite loop to persist, if that is the issue. There aren't too many other reasons for a reboot other than a stack overflow, some other kind of memory corruption, or overheating of the board. Be sure to eliminate all environmental factors that could lead to hardware failure, especially focus on board cooling.

tiwanacote commented 2 years ago

We are continue testing in 8 printers at the same time. We have sent to build some adaptor PCB boards to collect SD SPI data with some logic analyzers. I will keep you informed.

dewhisna commented 2 years ago

@thinkyhead -- I didn't see your reply until after I restarted it, so didn't get to enable POSTMORTEM_DEBUGGING, plus I assume for it I would need to enable serial data capturing on my RPi or something to capture that data?? Which serial channel does it use for that data, BTW?

I was wrong in the print times. The first failure was 4-1/2 hours into a 10 hour print. I restarted it around 17h30 yesterday evening and before turning in for the night, went and checked on it, which is when I realized it was a 10 hour print and not 7-1/2 -- I was misremembering its time due to the 7-1/2 hour print that I have in the queue to print after doing these.

It was printing fine around 23h00, when I checked on it, but I awoke this morning to find that at 23h40, it had reset again. This time making it a little over 6 hours. Same exact symptoms. That does seem to indicate it's not a bad sector on the SD card at some particular fixed spot.

Your idea of it being a heat/cooling issue of the Octopus board is a possibility. However, the environment it's in is about as good and stable as can be. The printers are in my basement and the ambient temperature is 22degC, or 71degF, even though we are in the hottest time of the year and supposed to reach 100degF or so outside tomorrow. And in the winter, on the coldest of days of like -40deg [C or F], it's still like 16 to 17 degC in my basement. And I have a cooling fan on the board, so if it is a heat issue, I'm not sure what more I can do about it.

I see from the ST datasheet that this micro has an on-board temperature sensor readable as one of the ADC channels. Are there any provisions in the firmware code to read and report that? I would think reading it once every 5 minutes or so would be more than sufficient. The datasheet says it's not too accurate for absolute measurements, but is repeatable for measuring temp rise.

I'm fairly sure it's a watchdog or something that's intentionally resetting the board, because the TFT35 LCD seems to have gotten a reset signal too. According to the ST datasheets, the NRST line is bidirectional. It's a little undescriptive about what things cause the micro itself to generate the reset, but it seems that both a watchdog timeout that ends in a reset and writing of the magic value to the AIRCR register are both candidates. But, a temperature problem could still cause the micro to run off into the weeds and lead to a watchdog reset.

I'm leery to just disable the watchdog since I can't stand guard over the printer for this long of a print to watch it, and leaving it unattended without the watchdog is too much of a fire hazard.

And with the watchdog enabled, to where everything shuts down and cools off when the reset happens, the print is a total loss. Even if I had all of the power-loss resume functionality enabled and knew the exact point in the gcode where it stopped, it's ABS, and once the bed cools to some magic point between 45 and 55 degC, depending on the particular part's dimensions and contact points with the bed, it will suddenly pop loose from the bed as the ABS contracts.

This print is actually 4 separate pieces and on the first failure, not only did they all pop loose from the bed, but one of the pieces did so in such spectacular fashion that it had flown completely off the bed and was lying in the bottom of the printer (and the piece is 125mm long!).

I do have two more Octopus v1.1 boards here, as I have two more FFCP printers I had planned to do identical rebuilds with. So I suppose at some point, I could try swapping out the board and see if that affects things.

In the meantime, I luckily hadn't done the rebuild on the other two FFCP printers and have now moved this print over to one of them to print, as I have four of them to get printed and was needing to get them finished this weekend.

These parts are actually for use on a Creality Ender 5+ I'm building, where I am planning a swap to a SKR3 board. That board wasn't available when I started the rebuild on the FFCPs, so I went with the Octopus. But if I had known about the SKR3 then, I would have waited another month or two on the rebuild and used it on the FFCPs, as the Octopus is a bit overkill and is a bit more annoying to deal with due to its size and connector differences. Plus, the SKR3 micro is better than the Octopus.

What I'm getting at is that if it is a heat problem on this Octopus board, I'm more inclined to just swap to the SKR3 instead of replacing it with another Octopus. Either way, with these being made in China and with counterfeit parts so rampant these days, that could certainly be the source of problems.

But regardless, I need to get it figured out. If it is a software issue, then the SKR3 and the E5+ build will suffer too, and I have some very long prints planned for that printer that make use of its large print volume.

dewhisna commented 2 years ago

@thinkyhead -- While here working on the configuration settings for my other printer build and comparing them with this setup, I found something that caught my eye. I went back to the RPi log files and sure enough the two reset incidents were within a few seconds of being exactly 2 hours different in length, when looking at the timestamps from OctoPrint of when it received the E1 heating notification from the printer to start the print to when it failed to read data from the serial port at the moment of the reset.

The setting that caught my eye, was this one:

#define PRINTCOUNTER
#if ENABLED(PRINTCOUNTER)
  #define PRINTCOUNTER_SAVE_INTERVAL 30 // (minutes) EEPROM save interval during print
#endif

I had enabled it to write-back EEPROM values every 30 minutes while printing.

Is it possible that the EEPROM write function isn't properly servicing watchdog if the EEPROM write takes too long to complete?

I'm thinking that before I start swapping boards around and digging into hardware, I should first try disabling this write-back operation and see if it resets again. Thoughts?

dewhisna commented 2 years ago

Nope ... it wasn't the EEPROM writeback. It took forever for me to test because of some bizarre (unrelated) nozzle jamming problems... But, setting PRINTCOUNTER_SAVE_INTERVAL to 0 (which if I'm reading the code correctly keeps stats enabled but just disables writeback during printing) had no effect on my reset issue. At least this time it reset 2 hours into the 10 hour print job instead of 6+. So now I guess it's time for me to swap main boards and see if it has any effect... ... this hasn't been a good week for me with 3D printers...

dewhisna commented 2 years ago

Success!! -- Either I just found the culprit for my reset issue or this was a really wild coincidence, and either way it was a serious WTF. I just successfully managed to complete the 10 hour print without it resetting on me!

I searched the web numerous hours figuring there had to be someone else with a similar problem. I wasn't convinced that my problem was hardware or temperature related. And finally I discovered this issue in the BTT SKR 2 repo titled "Random Reboot".

Now, that was the SKR 2 and mine is the Octopus v1.1, but both are STM32F4 variants, so I figured they might very well be using similar if not identical boot loaders. Unfortunately, BTT hasn't posted the source for the boot loader any place that I can locate (if someone knows otherwise, please let me know!).

According to that issue, the resets are caused by having a FIRMWARE.CUR file on the main SD Card, which you would if you've previously reflashed your firmware using their boot loader and haven't deleted it. They were citing reboot times on the order of about 2 minutes, whereas I was seeing multiples of 2 hours, but that may be differences between the two micros and their configuration?

I did have a FIRMWARE.CUR file on my SD Card from the last firmware I flashed. I hadn't bothered deleting it, because I figured it would be good to keep for reference in case I wanted to remember what I had flashed last, especially since I am storing the configuration detail there too.

Anyway, I figured what would it hurt to try and delete it to see if it fixes it? ... So I deleted it, started it printing, and now about 10 hours later, it has successfully finished printing -- finally!

I have no idea if this is a Marlin Firmware issue or a BTT Boot Loader issue or how they could interact with each other. All I know is that from this one test, it sure seems to have solved my problem. I still have one more set of these brackets to print, so perhaps tomorrow I'll know if this was a fluke or really the culprit.

@thinkyhead -- one idea did occur to me. Is there a Marlin bug releasing memory for files that are ignored on the SD Card? OctoPrint does monitor the list of files periodically and maybe it's a problem of running the micro out of memory?

ellensp commented 2 years ago

leaving firmware.cur on sdcard should not cause reboot / crash / freeze Deleting the file is not a real solution.. It is only a clue to the real issue Thus re opening

EvilGremlin commented 2 years ago

While Marlin is running - yep, doesn't matter. But boot on my ZNP Robin Nano fall in bootloop with ELEGOO.CUR on SD

tiwanacote commented 2 years ago

Hi, we think that we have found the origin of the problem (but not the solution at the moment) After tens of hours of work, we have found something curious related to hardware with incidence on firmware. We have noted that some pulses appeared at the same time in the logic analyzer different traces as shows the following picture:

nse_on_SPI

Also we have noted that during normal system operation, it is detected that the SPI clock signal (controlled by hardware) presents “jumps” of pulses. That is, instead of presenting 8 rising and falling edges during a byte read, "5" or "6" edges are detected, but the byte period remains constant, lengthening the last clock pulse.

SPI_clock

Following this bit and frame synchronization error, the SD fails to correctly interpret subsequent messages and therefore cannot be accessed; which ultimately causes it to be unsuccessfully attempted to reboot and activates the microcontroller's watchdog timer, forcing the system to reboot. @Roxy-3D can correct me if something is missing.

We have been investigating deeper, and we have hung the test clamps of an oscilloscope on SPI CLK while executing a GCODE file by UART (Oscilloscope in trigger threshold mode) and we have found EMI pulses . These pulses appear not only on the CLK pin, but also on other pins. Them are produced by stepper motors on some deceleration movements (Not all of them).

EMI_SPI_noise

Our humble opinion is that this is a PCB design mistake, were freewheeling diodes and 5V isolation are not used on stepper motors, so the noise is not filtered and go through all the logic tracks (I would not like to imagine that it enters the steps pin too). This issue is happening on SKR MINI and MKS GEN-L motherboards at least (Only two boards tested). @makerbase-mks and @bigtreetech maybe can give an opinion.

We think that the solution to this issue could be on hardware and firmware. Regarding the last one, we also are trying to generate the contingency routine to manage the SPI desynchronization in Firmware, but as we have explained above, we can not re-init the SD.

Opinions are welcome Thanks

dewhisna commented 2 years ago

@tiwanacote -- since the motors aren't driven by the +5V (or +3.3V) logic voltages, I doubt freewheeling diodes would do much for the logic circuits.

I would think it's more of a ground isolation issue where the current from the motors gets dumped back into the ground plane on the PCB -- still a PCB design mistake as you note.

The grounds have to be common, but they should be joined at one and only one point and have a very low-impedance direct feed from the power supply to the motor circuits. The better boards have a separate power connection (both V+ and ground) for the motors separate from the logic power feed.

I would be curious if you were to isolate the ground pins on the driver boards from the main PCB and tie them together and directly back to the power supply feed, if that would eliminate the noise and help resolve the issue. Maybe do the same with the VM (motor voltage) feed too.

It probably also wouldn't hurt to sprinkle a few more decoupling capacitors around the PCB logic chips either.

tiwanacote commented 2 years ago

@dewhisna , yes, at a hardware level, the best way to solve this is to have a separate power connection as you mentioned.

At a firmware level, we have simulated clock desynchronization (Using SOFTWARE_SPI) in order to emulate a noise SPI communication. We have noted that it is very important to use #define SD_CHECK_AND_RETRY in configuration.h to use CRC for error correction, but some SDs (No branded ones), some times fails when CRC activation is set up in Sd2Card.cpp in cardCommand(CMD59, 1). We are trying to understand why.

github-actions[bot] commented 1 year ago

This issue has had no activity in the last 60 days. Please add a reply if you want to keep this issue active, otherwise it will be automatically closed within 10 days.

jisle15064468204 commented 1 year ago

Has anyone found a solution? 2.1 Does this problem exist?

dewhisna commented 1 year ago

While removing the FIRMWARE.CUR file (i.e. non-printable files) worked once or twice to let me complete the 10+ hour prints I was doing, it wasn't the whole solution, as the problem returned again.

I didn't have the time nor the extra filament to waste on failed print jobs, and since I was experiencing other bugs and quirks in Marlin related to things like filament loading/unloading and filament runout sensor pausing/resuming, I ended up jumping ships and switched to Klipper firmware on that printer. That introduced a whole new set of challenges and changes to my setup/configuration, but I haven't experienced any random resets there and no failed prints.

I'm fairly certain there's a bug somewhere in the Marlin code relating to SDCard I/O during printing. I'm not sure if it's watchdog related, such as getting stuck in retries, or if there's a memory leak and heap/stack smash or something. I often print with the printer unattended, meaning it's too dangerous to disable watchdog timers, so I haven't done much in the way of deep-dive debugging.

As for whether it exists in 2.1? All I know is that it definitely existed in the last snapshot I was trying on the bugfix-2.1.x branch at 4bd4c1f3bc00056da4fe008de9aeda8424422d3f, which was dated 2022-08-12 (or I guess technically 2022-08-11, since that commit was the cron job that bumped the date to 2022-08-12). That was when I gave up and moved to Klipper.

github-actions[bot] commented 1 year ago

This issue has had no activity in the last 60 days. Please add a reply if you want to keep this issue active, otherwise it will be automatically closed within 10 days.

sbaeder commented 1 year ago

I'll check in 2.1.2 which was just released...

mrigi commented 1 year ago

Got this issue today on the "SKR Mini E3 v3" running Marlin 2.1.2. It randomly reboots without sending anything to the serial console.

tiwanacote commented 1 year ago

We have been working very hard on it until version 2.0.9.3. We have spent a lot of time and we think that we have found were the problem appears but we did not solved. Instead, we have implemented two parcial solutions: 1) Use good quality SD (Original SanDisk ones). This reduced a lot the problems but do not eliminates it for complete. 2) We have implemented a routine that identify between a real power loss reboot from a false one (Re-boot by SD failure or whatchdog trigger).

The partial solution was implemented in the following way: a) When Watch dog ISR function trigger a reboot we save into EEPROM the X,Y,Z,E position, SD position and a boolean failure-flag to distingish a false reboot from a power loss failure. b) When restart, we load the stored variables. If failure-flag = true we call the Marlin power loss function without sending to home, we just continue printing in the same position (Without showing LCD message too). The only thing you can percive while printing is a little pause of 5 seconds.

For a deeper exploration of the problem follow this:

We have been testing forcing the soft SPI clock to generate a desynchronization communication (Origin of the problem). I have attached the firmware if you want to test it. (The firmware emulates an SPI desynchronization) I have found that the watchdog triggers into queue.cpp file into the inline void GCodeQueue::get_sdcard_commands() in the while cycle. The firmware can not get out from the infinite loop where the condition n<0 is always met.

while (!ring_buffer.full() && !card.eof()) { 
      **const int16_t n = card.get();**
      const bool card_eof = card.eof()
      if (**n < 0** && !card_eof) { 
        ++counter_maxi;
        if (100 % counter_maxi == 0)
          SERIAL_ERROR_MSG(STR_SD_ERR_READ); continue;  HERE
      }

If you follow the code, from queue.cpp to cardreader.h, you will find the get() inline definition with file.read(). into it.

Then, into SdBaseFIle.cpp you can find the method "read" int16_t SdBaseFile::read(void *buf, uint16_t nbyte)

into the following line is the problem which makes the infinite loop described above.

else if (!vol_->fatGet(curCluster_, &curCluster_)){ // get next cluster from FAT

To be honest I do not understand what is doing with FAT and don't have the time to solve it.

Firmware - Soft SPI - Forced clock

gus-abreu commented 1 year ago

+1 on this issue. Using SKR Mini E3 v3 running 2.1.2, printing through SD gives random crashes. Using USB seems to be working fine.

macem commented 1 year ago

+1 I have new Octopus v1.1 and TFT24 and same problem with Marlin 2.1.x, with USB everything works. When I use an SD card (I tested flash drives, fast SD cards) the print stops in the same position during printing and the extruder motor makes a strange noise. I have to restart the printer. Everything has been working for a month, but now I change the Cura 5.2.0 configuration and something goes wrong. I checked and it seems that the 'Maximum Deviation' option when it is below 0.015 freezes them. If I have 0.065 everything works fine -> but I need more testing to confirm it.

github-actions[bot] commented 1 year ago

This issue has had no activity in the last 60 days. Please add a reply if you want to keep this issue active, otherwise it will be automatically closed within 10 days.

Bkara1981 commented 1 year ago

I have the same situation with two stm32f407 boards, two exact printers in different locations, different sd cards, with same firmware, same configurations. While printing from sd card, and usb connected to printer, Pronterface console just showing error reading sd card, and it does reboot. Also while printing there are missing gcodes(yes not missing steps, it is skipping some places of the part) , offset changes so few layers shifted on one axis, and head goes to some place and comes back. Original firmware and Klipper works fine, usb printing works fine with pronterface.

I can clearly say root cause is not about sd card quality or noise or emf, because original firmware works perfectly fine with cheap chinese sd cards. I have tried many things mentioned on many threads about the issue like lowering sd card spi speed, to disabling even the boot logo, nothing fixed the situation. I even tried updating the sdio.cpp to bugfix release, enabled sd_check_and_retry, nothing helped.

I think people who had this problem are switching to old releases or other firmwares and not submitting a bug, also many of them thinking this already existing bug requests will be solved in time(but yeah it will be marked as stale because of no bugfix requests and probably will not be fixed)

mechase2000 commented 1 year ago

Can you please post the code you used to get it to continue printing after the reboot?

robotinos3d commented 1 year ago

@mechase2000 yes, of course. The only problem is that this feature is avaible for MKS GEN L (mega2560) and MKS SGEN L (LPC1768). For other motherboards you have to add modifications into HAL into watchdog.cpp file (Compare original file vs mine). Refer to my description above, we have added some changes into EEPROM variables too and it has more changes for our machines. This changes were made on Marlin 2.0.9.3 version

marlin-2.0.9.3.zip

github-actions[bot] commented 1 year ago

This issue has had no activity in the last 60 days. Please add a reply if you want to keep this issue active, otherwise it will be automatically closed within 10 days.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

MarlinFirmware / Marlin