NatronGitHub / Natron

Open-source video compositing software. Node-graph based. Similar in functionalities to Adobe After Effects and Nuke by The Foundry.
http://NatronGitHub.github.io
GNU General Public License v2.0
4.6k stars 336 forks source link

Rendering silently stalls after X frames #248

Open devernay opened 6 years ago

devernay commented 6 years ago

From @unfa on April 6, 2018 10:58

Problem

I worked on an animated audio visualiser for my music album. I had to render out 2400 frames of a looped sequence that I could use later in another Natron project to produce the final comp.

However I was able to only render 50-70 frames at once each time. After that the rendering would stall. The CPU usage would drop to near-zero. I tried both from GUI and using NatronRenderer CLI - same behaviour. I tried different Natron versions: 2.3.5, 2.3.8, 2.3.9 - same problems. I yet have to try with Natron 2.3.10.

It also brought my whole system down multiple times - that might be a SWAP issue that I have (I've moved my system to ZFS and since then I had system freezes when SWAP is being used) - probably unrelated to Natron.

After a few days of trying different stuff I've decided to resign from using Natron for this production, because if I have to manually restart rendering for 100 000 frames that I'm going to need - it's just not worth it. Also the system freezes made it not look good if I had to have the system render non-stop for a week for example, mostly unattended.

Now I'm rendering this project with Blender.

I managed to render out the 2400 frames of a looped sequence in one go by rendering only a small subset of my initial project - so I guess some part of that node graph causes a problem.

I had some warnings about NaNs converted to white in a Crop node, but I don't think this is a reason why the whole thing stops rendering after a while.

Is saves out proper frames, it looks like Natron is stopping to start new rendering processes, and it just lets the current ones to run and finish properly, but then it "forgets" to render more frames.

I first thought that maybe the problem is a SlitScan node - but this one worked well. I wonder how can I debug this issue to find out what could have caused the stalled rendering.

Expected behavior:

Rendering all frames in the range.

Actual behavior:

Rendered a couple of dozens of frames, then it just stops - no crash, no error messages, no smoke.

Steps to Reproduce

I can host my project but it requires lots of data to run, not sure if I can reproduce the issue without that.

Versions

Multiple versions of Natron before the 2.3.10 release.

Linux Mint 18.3 KDE5, linux-lowlatency-hwe kernel. Ryzen 7 1700 CPU.

Copied from original issue: MrKepzie/Natron#1756

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/57517207-rendering-silently-stalls-after-x-frames?utm_campaign=plugin&utm_content=tracker%2F83915136&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F83915136&utm_medium=issues&utm_source=github).
devernay commented 6 years ago

Would you share that project with me?This is obviously a deadlock somewhere.Things to try:- if the rod of the input to the write node is very large, put a crop before the write (not necessary with 2.3.10)- reduce the number of threads used to render.

devernay commented 6 years ago

Did you get time to test with 2.3.10? Did you take a look at Natron memory usage? Use any process information display / top / htop. Did you try limiting the cache size in the parameters? Did you try limiting the number of rendering threads to 1 in the parameters (yes, I know this is not a solution, but it helps cornering down the problem)?

Do you have the possibility to send me your project, so that I can take a look at it? Send me a link by email.

devernay commented 6 years ago

@unfa any feedback?

devernay commented 6 years ago

From @unfa on April 24, 2018 8:39

Yes, sorry for my latency. I can deliver the project file - no problem. I'm not sure if it's reproducible without the used input image sequences (which are huge). I'll post the project file here soon.

ziSo12 commented 4 years ago

Any update on this? I got pretty much the same issue, running Natron 2.3.14 on Linux Mint 19.1 Cinnamon. When I restart Natron, rendering works. In my case, up to 30.. 40% but then suddenly stops and CPU usage is going down to zero. Cannot render in a single run.

unfa commented 4 years ago

Hey, I don't think Natron is maintained any more. It's got a lot of stability issues which in any production I've been using it in made it a complete and utter waste of time - I'm not going to use Natron any more.

Unless you want to be running around in circles, wondering why 10% your of frames are garbled - I advise anyone to just let go of this software.

Unless there's someone who miraculously has all the time on their hands, knowledge and the will to fix all the problems and make it a reliable program.

I've tried using it for many projects, because it's very exciting and I love the tool is has. but every time I did so, I ended up wasting 3x the amount of time this should take me, trying to get it to work reliably, finally giving up and agreeing to lower the quality of my work.

Believe me: Natron will only give you trouble.

rodlie commented 4 years ago

@ziSo12 : Have you tried the latest 2.3.15 RC releases? Several things have been fixed since 2.3.14.

ziSo12 commented 4 years ago

@unfa Hi, thanks for your reply and your insights. I'm pretty new to Natron and didn't know that. Of course i ran into some crashes, too but apart from this issue it worked pretty well so far. I totally get your point of wasting time and how frustrating it can be. Let's see, when i cannot resolve this issue, it doesn't make sense to keep using it anyways @rodlie Thanks for the hint, will try it out.

rodlie commented 4 years ago

@unfa : Please report issues (and a way to re-produce them) and they might get fixed.

ziSo12 commented 4 years ago

Ok, so I installed 2.3.15 RC 11. Issue is still there, however i was sometimes able to get it rendered to 100%. Not able to reproduce, though. Most of the time it isn't working.

What I tried so far:

Disable/enable OpenGL Rendering Various numbers of render threads including disabling threading (with -1) Rendering in seperate process edit: project file removed (will send by email)

I feel like the first render after a reboot works better than the renders after it.

update:

Now i set number of threads to 1, and tried rendering several times and restarting Natron /clearing caches inbetween ( my total memory is 16 GiB):

update 2: it seems to be an OS specific or driver related issue. I installed Natron on Windows on the same machine (dual boot) and it rendered the same project on the first go. Note: Natron used a max. of 9.0 GiB of memory like before on Linux.

devernay commented 4 years ago

@ziSo12 there are several workarounds, please read https://natron.readthedocs.io/en/rb-2.3/guide/getstarted-troubleshooting.html#common-workarounds If this is a large project, I would definitely recommend not rendering to video (item 1), and using the DiskCache node (item 3).

devernay commented 4 years ago

Hey, I don't think Natron is maintained any more.

Yes, it is. We're doing our best, which is not much, unfortunately.

2.3.15 fixes several threading bugs from 2.3.14. It's not released because there is no windows build yet, but release candidates are available for Linux and macOS.

Believe me: Natron will only give you trouble.

Thank you for your support, @unfa !

Using the DiskCache node and rendering to frames fixes most "stalling" projects, just try it for yourself.

ziSo12 commented 4 years ago

@devernay thank you for the suggestions, but the workarounds don't really change anything for me. I'm now rendering to image sequences only, as well as using a cache node, but the issue remains. It silently stops every x frames. As I said before, it's a linux specific issue. Maybe some ANR/dead loop

rodrigo-brito commented 4 years ago

This errors occurs a lot of times here. Any update about it?

rodlie commented 4 years ago

@rodrigo-brito : Not until we can reproduce the issue. Feel free to share a project where this issue can be replicated.

makew0rld commented 3 years ago

I am also experiencing this on Arch Linux, with the latest version of Natron (2.3.15) installed from FlatHub. Rendering would hang periodically with CPU totally dropping down. I would have to restart it. Unlike others in this thread, I had "Overwrite" unchecked on my writer, so every time I restarted rendering it would go from where I left off. I was rendering to frames. Eventually it worked and completed. I was using 2 render threads, but also experienced this with higher numbers.

I will try and share my project soon, but it was nothing too complex. I followed this tutorial and added an EXR sequence write node after doing everything up to and including tutorial video 3.

Hope this can be fixed! Thanks for all your work.

Shrinks99 commented 3 years ago

Worth noting that I was able to render @makeworld-the-better-one's project file without issues on my Mac. Wasn't able to replicate this issue so it's not a cut-and-dry one :(

makew0rld commented 3 years ago

Definitely appears to be a Linux specific issue, yeah. Others have mentioned earlier in the thread that their projects worked fine on Windows as well.

makew0rld commented 3 years ago

Here is my Natron project file that caused the issue: tutorial.ntp.zip

The project file paths are absolute, but they can be easily changed. Resources: PNG sequence and graffiti and mural.

From the other comments in this thread, I doubt my project file will be of any specific relevance, but hopefully helpful if the devs can't reproduce this issue on Linux. I sent my installation details above.

cc @devernay @rodlie hope this helps!

rodlie commented 3 years ago

Here is my Natron project file that caused the issue: tutorial.ntp.zip

On Ubuntu 20.04 and Natron 2.3.15 Release the project render stops at 15% and the graph seems to be stuck at the reformat node at the end. Will test on a debug build (and other OS).

makew0rld commented 3 years ago

Thanks a lot for trying it out! I'm glad this bug can now be considered reproduceable.

I also experienced the issue without the reformat node if I remember correctly.

I was also able to get past whatever percentage it stopped at (it was never a consistent number, and I don't remember it ever being 15%), by just trying again repeatedly, and keeping Overwrite disabled.

Hopefully the debug build will succeed on Linux!

rodlie commented 3 years ago

When I disable the reformat node the render is successful on both Ubuntu 20.04 and RHEL 8.3, every time.

Screenshot from 2021-02-27 23-13-45

Shrinks99 commented 3 years ago

Something I was noticing as well with this file is that the first air conditioner on the left should be painted out but despite being frameheld the program doesn't always seem to register this on frame 1. You can see this happening in your screenshot, I was able to get this to work by resetting the paint stroke lifetime but it's also a weird bug.

makew0rld commented 3 years ago

When I disable the reformat node the render is successful on both Ubuntu 20.04 and RHEL 8.3, every time.

@rodlie On the first run (after rebooting) mine stopped at 94%, but all frames were written. To be honest I believe this is a separate bug where Natron is miscalculating the progress or the threads are doing nothing or something. Let me know if I should open a separate report for that. The threads were still using up CPU but not much in comparison.

The second time I ran this (after completely stopping Natron and re-opening), with the reformat node still disabled, it was faster and went all the way to 100%. No threads were really using noticeable CPU at the end, indicating rendering was completely finished.


The issue @Shrinks99 mentioned has never occurred to me, not sure what's up with that. Maybe open another issue?

makew0rld commented 3 years ago

After opening Natron again (third time since boot) and re-enabling the reformat node, my render got all the way to 96% without issues. All frames were written, and so I believe this is a possibly unrelated bug as I described in my previous comment.

It seems like the original rendering stopping issue is not completely reliable, I'm glad you were able to reproduce it. Is it just the reformat node causing the issue then? What made you decide to try disabling it in the first place?

makew0rld commented 3 years ago

@rodlie any update on this? Thanks!

rodlie commented 3 years ago

If I get the time I will take a look tomorrow.

devernay commented 3 years ago

@rodlie still working on this? btw the stable branch switched to RB-2.4

rodlie commented 3 years ago

Sorry, had to prioritize work all week. Got time this weekend.

btw the stable branch switched to RB-2.4

Great.

devernay commented 3 years ago

Will you have time to work on this soon, or should we release 2.4.0 with only https://github.com/NatronGitHub/Natron/pull/603 merged in?

rodlie commented 3 years ago

Yeah, never enough time... Let me try after work today.

devernay commented 3 years ago

Got any chance to work on this? We can release 2.4.0 as it is if you don't have the time.

rodlie commented 3 years ago

Tried to debug yesterday, but I got no solution yet. We should move this issue to 2.4.x (or 2.5) for now.

devernay commented 2 years ago

bump @rodlie @YakoYakoYokuYoku

YakoYakoYokuYoku commented 2 years ago

Will try to repro later, although I'm suspecting that the issues it's either in OpenFX-IO or Natron.

devernay commented 2 years ago

It's a deadlock, so I doubt it comes from plugins (multithreading is done cleanly in plugins). Most likely Natron, and the first test after repro is to disable threading or limit to 1 render thread

devernay commented 2 years ago

I can't repro on my MacBook (quad-core from 2016 running Monterey) with 2.4.2 binary release, it renders fine even with the Reformat node enabled

makew0rld commented 2 years ago

This issue is only with Linux as I understand it. Shrinks99 noted above that he wasn't able to reproduce the bug on macOS either. That was with Natron 2.3.15, the latest at the time.

odditica commented 2 years ago

This issue is only with Linux as I understand it.

Incorrect. I encountered this issue consistently in 2.4.1 on Windows every single time I wanted to export a certain project, to the point I had to hack together a script that would render image-by-image, restarting the process over and over whenever it detected a stall, until all 720 frames have been rendered out. I wasn't able to figure out what nodes were causing the issue, but turning off entire branches seemed to have no effect. However, I suspect it might have something to do with the fact most of it was very expression-heavy, which Natron doesn't seem to like one bit. Expressions would sometimes evaluate incorrectly on certain frames and I'd have to re-render them a couple of times until I got the correct result. Also, if it's any help, I didn't use any external plugins/extensions, only what comes with Natron out-of-the-box.

EDIT: Also, a curiosity I just remembered - for some reason, frame 16 would never export, no matter what I tried. I could preview it in the viewport, but I could never render it out (it would just get skipped entirely) and to this day I have no idea why.

makew0rld commented 2 years ago

Okay, thanks for the extra info. But it still seems the bug does not appear on macOS. Maybe it's a false flag and just no one has been able to reproduce it, I'm not sure.

devernay commented 2 years ago

@odditica you cannot fully blame Natron. It may also be an issue with your expressions: the execution order of expressions cannot be guaranteed, and you may be using expressions whose value depends on the execution order. Typical example: Let us say A and B are zero before execution

expression 1: A := B + 1
expression 2: B := A + 1

What are the respective values of A and B? Nobody knows, since it depends on the execution order. And it is very (and I mean very) hard to detect automatically such errors in the expression code.

"Expression-heavy" may mean there are such issues where the result may depend on the expression order. Keep your expressions as simple as possible, the complicated expressions should be at one place and not depend on other expressions. If there are loops in your expression dependency graph, then the result may (or not) depend on the execution order.

odditica commented 2 years ago

@devernay

Well, in short, since my project was a looping animation, every single expression pulled a t variable from a "clock" NoOp node (which seemed the most appropriate type to use) that calculated it based on user-provided timing parameters. No cyclic graphs, however, so I see no reason why there should be issues deterministically resolving this. The most problematic part of the graph was a situation where another custom node used the t to generate periodic "wiggle" used in a few transform nodes - this did seem to behave deterministically only 90% of the time.

devernay commented 2 years ago

@odditica can you share the project?

odditica commented 2 years ago

@odditica can you share the project?

Sorry for the delay, here it is.

After cleaning it up a little and relativising all the paths, I actually managed to render it out just fine (no idea why!). Hopefully it will still be useful. Also here's the final render for reference.

ltsaber86 commented 1 year ago

Got same issue. Fixed by redusing "Number of parallel renders" to 4. Setting "System RAM to keep free" to 5% made render stall even with "Number of parallel renders": 5. With "System RAM to keep free" 10% and "Number of parallel renders": 6. render works in 50% cases With "System RAM to keep free" 20% and "Number of parallel renders": 5. render works in 90% cases

Maybe that helps

Stable setting that works: "Number of render threads": 16 "Number of parallel renders": 4 "System RAM to keep free": 20

PC specs: Intel Xeon, 8 CPU core with HT (total 16) 8 Gb RAM OS Windows 8.1 64 bit Natron 2.5.0

rodlie commented 1 year ago

Yeah, this has been a major issue for a long time (on all plattforms). In general I have to cap Natron to 4 parallel renderers and max threads per effect to 2 to avoid this issue (regardless of CPU).

Users will need to experiment with the settings to get something usable, out-of-the-box settings will most likely not work.

timobuske commented 1 year ago

For me (at least in my current project) the following workaround saved my life: For each Read node create a custom property and create a python expression for that property that references any property of another NoOp node or so. Could be an int named "a" that is 0. The Read just needs some Python input (to trigger something?). I don't know why but it worked.