HaveAGitGat / Tdarr

Tdarr - Distributed transcode automation using FFmpeg/HandBrake + Audio/Video library analytics + video health checking (Windows, macOS, Linux & Docker)
Other
3.04k stars 95 forks source link

Transcode fails in Docker container under Tdarr but not when run by hand #634

Closed adefaria closed 1 year ago

adefaria commented 2 years ago

I've been transcoding a lot of things successfully but I've also been looking into the transcode failures and trying to work out why they are not working. Sometimes it's just changing the output file extension from mkv -> mp4. Other times it's changing it from mp4 -> mkv. But there are times when the transcode just works when I replicate the command outside of the Docker container. So I decided if that was happening I'd stop the transcode outside the Docker container and re-queue it inside the Docker container to see if it works this time. It didn't.

The video under question is Grey's Anatomy/Grey's Anatomy S17E03 - My Happy Ending.mkv. I go to the transcode errors section and look in the log files for the ffmpeg command being run. In this case, it's:

tdarr-ffmpeg -c:v h264_cuvid -i "/input/TV/Grey's Anatomy/Grey's Anatomy S17E03 - My Happy Ending.mkv" -map 0 -c:v hevc_nvenc -cq:v 19 -b:v 4524k -minrate 3166k -maxrate 5881k -bufsize 9048k -spatial_aq:v 1 -rc-lookahead:v 32 -c:a copy -c:s copy -max_muxing_queue_size 9999 "/temp/Grey's Anatomy S17E03 - My Happy Ending-TdarrCacheFile-B4DzIkuKC.mkv"

BTW it'd be a nice enhancement to protect the input and output files by surrounding them with quotes as many video files contain spaces. Also, provide a field with just "This command was used to perform the transcode" perhaps with a copy link.

Here's the error report: a3bAuBeRB-log.txt

I ran the ffmpeg command untouched (save quoting the filenames and removing the leading "tdarr-") on my desktop and it started working. So I decided to do this inside the Docker container. Note I had limited my transcodes to 2 in Tdarr so I wasn't hit with the error that happens when you run too many GPU processes (Nvidia limits it to 3 on my machine). The transcode ran fine until completion. So why didn't it run OK under Tdarr in the first place? And why did it fail when I simply re-queue it?

To Reproduce

Expected behavior I expected it to run correctly the first time.

themooer1 commented 2 years ago

I had a similar problem where Tdarr in docker was using app/Tdarr_Node/node_modules/@ffmpeg-installer/linux-x64/ffmpeg, but when I ran ffmpeg ... manually in the docker container, it was using /usr/local/bin/ffmpeg which links to /usr/lib/jellyfin-ffmpeg/ffmpeg which was compiled with better support for (in my case) hardware transcoding on Intel GPUs.

supersnellehenk commented 2 years ago

BTW it'd be a nice enhancement to protect the input and output files by surrounding them with quotes as many video files contain spaces.

Tdarr runs ffmpeg in shell, so it doesn't need the quotes for it to work.

The transcode ran fine until completion. So why didn't it run OK under Tdarr in the first place? And why did it fail when I simply re-queue it?

Encodes can fail for many different reasons. I'd need an error from the job report to figure what it would be.

Sometimes it's just changing the output file extension from mkv -> mp4. Other times it's changing it from mp4 -> mkv

These fail because the plugin you're using is set to mp4 for example, but your original file is mkv. Mkv has certain codecs which mp4 doesn't support, hence it failing. In the case of your linked report, it's the ASS subtitles which mp4 doesn't support. Mp4 only supports mov_text.

Not all plugins offer an 'original container' option, which would fix your issue with changing containers.

adefaria commented 2 years ago

BTW it'd be a nice enhancement to protect the input and output files by surrounding them with quotes as many video files contain spaces.

Tdarr runs ffmpeg in shell, so it doesn't need the quotes for it to work.

The point is if I copy and paste the ffmpeg line into my bash shell on my desktop it doesn't execute unless I quote the filenames. I guess you could say that I shouldn't have to do that but how else can I determine what the errors are? Plus, as this issue indicates, sometimes the transcode fails in Tdarr but when I pull out that exact command, protect the filenames with quotes, it works fine on my desktop. Somebody else noted that Tdarr runs a tdarr_ffmpeg whereas I run just an ffmpeg. They also noted that there is an ffmpeg I think under /usr/local/bin. This could very well be an issue of different versions of ffmpeg, some working and others not. In fact there was also an issue saying to update the version of ffmpeg that Tdarr uses.

The transcode ran fine until completion. So why didn't it run OK under Tdarr in the first place? And why did it fail when I simply re-queue it?

Encodes can fail for many different reasons. I'd need an error from the job report to figure what it would be.

Sometimes it's just changing the output file extension from mkv -> mp4. Other times it's changing it from mp4 -> mkv

These fail because the plugin you're using is set to mp4 for example, but your original file is mkv. Mkv has certain codecs which mp4 doesn't support, hence it failing. In the case of your linked report, it's the ASS subtitles which mp4 doesn't support. Mp4 only supports mov_text.

I have experienced where the mkv -> mkv transcode fails and if I try it by hand but change the output to mp4 it works. That's why I changed the container to mp4 and ran the transcoding over my videos. I've since changed it back to mkv but shouldn't I be able to configure the plugins to try one first and if that fails try the next one? I really don't understand this plugin system. I would have thought that it might try say .mkv -> .mkv and if that fails try .mkv -> .mp4 or whatever. I tried hard to get my Tdarr node to use the Nvidia GPUs on my desktop and finally got that working using Migz Transcode Using Nvidia GPU & FFMpeg. As I understand it I can only set the container to one thing (mkv or mp4). I tried adding that plugin again on the theory that if the first plugin fails the second plugin would be tried but the system would not allow me to add the same plugin a second time.

Also, these Transcode Options are associated with a library. I have Nvidia on my desktop where I'm running a Tdarr node and I have a laptop that doesn't have an Nvidia card. So how can I have both nodes transcoding the same library when it's configured to use an Nvidia card for GPU transcoding? When I tried that I believe that every time the laptop attempted a transcode it failed since it didn't have an Nvidia GPU to perform the transcoding.

I'm also going to try moving tdarr_ffmpeg aside and trying to get /usr/local/bin/ffmpeg to be used and go through my remaining 3884 videos to see how many I can get transcoded successfully.

I've also seen where the transcode produces a file that is larger than the original. How do I mark those such that Tdarr doesn't constantly attempt to transcode them again and again?

supersnellehenk commented 2 years ago

how else can I determine what the errors are?

Those would be stated in the job report 99% of the time. For copy pasting I do agree that quoting them would be easier. But Tdarr does log the errors ffmpeg outputs. In some odd cases where ffmpeg just dies, it doesn't, since there's no actual error.

As I understand it I can only set the container to one thing (mkv or mp4).

Yeah that plugin doesn't have an 'original' container option to just keep the container it's in. You could add that manually, but a PR should be made for that in my opinion.

So how can I have both nodes transcoding the same library when it's configured to use an Nvidia card for GPU transcoding?

That's where the capabilities in the node options come in. Set your desktop node to NVENC, so it would only run NVENC plugins. Then you could add another CPU/whatever plugin in front of it. GPU workers skip past any plugin to see if they can run one. I'm not exactly sure how it determines how/when it can.

I'm also going to try moving tdarr_ffmpeg aside and trying to get /usr/local/bin/ffmpeg

tdarr_ffmpeg is just a symlink to /usr/local/bin/ffmpeg. It's done this way so a GPU statistics plugin on unraid can differentiate what process is from Tdarr.

I've also seen where the transcode produces a file that is larger than the original.

That would be the size check plugin. You can set a lower/upper bound which it can't exceed and throws it into errored.

adefaria commented 2 years ago

Take, for example, the file /Videos/TV/America's Got Talent/America's Got Talent S17E03 - Auditions 3.mkv I have about 3888 transcode errors and I decided to pick this one.

65jH_claZm-log.txt

Says cu->cuInit(0) failed -> CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected yet I know there is a CUDA device as other transcode have used the GPU without issue and I only do at most 3 at a time which is what my video card can support. This transcode was done with the GPU. So I decided to try again but this time using a CPU instead of a GPU. I set the GPUs to 0 and the CPUs to 1 then requeued the file. That worked!

hyLFpz8Em-log.txt

I know that CPU transcoding is supposed to be better but slower and more resource-intensive than the GPU transcoding but I expected that the GPU transcode would at least work. Like I said, AFAICT there was a CUDA device available to transcode when using the GPU.

OK, back to transcoding /Videos/TV/America's Got Talent/America's Got Talent S17E03 - Auditions 3.mkv. Set CPUs to 0, GPUs to 1, and requeued the video by copying a saved copy of the original file back into place. It transcodes OK, but it does so with

tdarr-ffmpeg -i /input/TV/America's Got Talent/America's Got Talent S17E03 - Auditions 3.mkv -map 0 -c:v libx265 -b:v 2349k -minrate 1644k -maxrate 3053k -bufsize 4698k -c:a copy -c:s copy -max_muxing_queue_size 9999 -map -0:d /temp/America's Got Talent S17E03 - Auditions 3-TdarrCacheFile-5TncZPJr-t.mkv

which nvtop does not report that it's using a GPU yet the system monitor says it's using the CPU. This and I have CPUs set to 0 and GPUs set to 1. Shouldn't it only use GPUs? How do I force it to use GPUs only?

tdarr_ffmpeg is just a symlink to /usr/local/bin/ffmpeg.

Actually both /usr/local/bin/tdarr-ffmpeg and /usr/local/bin/ffmpeg are symlinks to /usr/lib/jellyfin-ffmpeg.

I've also seen where the transcode produces a file that is larger than the original.

That would be the size check plugin. You can set a lower/upper bound which it can't exceed and throws it into error.

I'm not sure where to check in the plugin - I see no lower/upper bound to configure. I'm using Migz-Transcode Using GPU & FFMPEG and Migz-Transcode Using CPU & FFMPEG plugins.

I had had only Migz-Transcode Using GPU & FFMPEG plugin set initially because I always wanted to use my GPUs. Later I added the Migz-Transcode Using CPU & FFMPEG plugin on the theory that if the GPU transcode fails then the CPU plugin would be tried. Now you're telling me that I should put Migz-Transcode Using CPU & FFMPEG first so I'll try that. Maybe my 3888 transcode errors will go down.

supersnellehenk commented 2 years ago

cu->cuInit(0) failed -> CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

This error is due to one of two things. Either, there's actually no CUDA device, which in your case would not be true. Second is the capabilities of said CUDA device. Certain options like b-frames aren't supported on a Pascal generation card.

which nvtop does not report that it's using a GPU yet the system monitor says it's using the CPU.

If you look at the command that's being ran, it states -c:v libx265 which is a CPU encoder for HEVC. What you want is -c:v hevc_nvenc which would be the Nvidia HEVC encoder.

If you want to simply only use the GPU to convert video, disable the CPU plugin. Go to your node, open the node options, then toggle Allow GPU workers to do CPU tasks to on. Then you can set CPU workers to 0 and GPU workers to 3, since you have not patched your card.

If you want to do a hybrid option of both GPU and CPU, then you need to go to your node settings and disable said toggle. Then you can use both CPU and GPU workers.

You've also got node capabilities, which defines the type of hardware the node has available for encoding so it doesn't try to run an Intel encoder on an Nvidia card for example.

I'm not sure where to check in the plugin

The size check plugin is a different plugin, which you'll find in the plugin tab at the top. There you can copy the plugin ID for adding to your plugin stack. Then you'll find those two options.

on the theory that if the GPU transcode fails then the CPU plugin would be tried

That's not how plugins work. Mixing GPU & CPU complicates things, but the basic approach is to run the plugins in order, from top to bottom. If one fails, the entire file fails and gets thrown into error/cancelled.

GPU workers work a little weird, which I still don't fully understand either, but my basic understanding is they essentially 'skip' all plugins until they find one that produces an output they can run, then run that. So say you have 7 plugins, the 4th being the GPU plugin. The GPU worker would skip 1-3, run 4, then requeue the file. If another GPU workers picks it up, it sees there's no GPU stuff to run anymore, requeuing it for a CPU worker.

adefaria commented 2 years ago

cu->cuInit(0) failed -> CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

This error is due to one of two things. Either, there's actually no CUDA device, which in your case would not be true. Second is the capabilities of said CUDA device. Certain options like b-frames aren't supported on a Pascal generation card.

You see that's the thing. I've had many, many of these failures. I can go into the report and copy and paste the ffmpeg command into a bash shell on my desktop - the same one with the Nvidia GPUs and it transcodes without issue. We know the CUDA device is there and we know it's the same video file with the same parameters. Doesn't work in the Docker container, works on the physical machine outside the Docker container.

Now granted, the ffmpeg version that I have installed on my Ubuntu 21.10 desktop is newer than the jellyfin one installed in the Docker container. So I went into the Docker container and did an apt install ffmpeg which installed it as /usr/bin/ffmpeg (whereas /usr/local/bin/tdarr-ffmpeg and /usr/local/bin/ffmpeg both point to /usr/lib/jellyfin-ffmpeg/ffmpeg via symlinks) and changed the symlinks in /usr/local/bin to point to /usr/bin/ffmpeg. Now, aside from it using only CPUs (even though the Tdarr web interface says Transcode GPU the ffmpeg process does not show in nvtop and my CPUs are buzzing), all of my transcodes seem to be working.

I just now disabled the CPU plugin and it's now using GPUs. Let's see if the transcodes start failing.

which nvtop does not report that it's using a GPU yet the system monitor says it's using the CPU.

If you look at the command that's being ran, it states -c:v libx265 which is a CPU encoder for HEVC. What you want is -c:v hevc_nvenc which would be the Nvidia HEVC encoder.

I added the CPU plugin in front of the GPU plugin for the Library as you suggested. Now it only seems to use the CPUs for transcoding. This really taxes my system (fans making lots of noise) which is why I want to use the GPUs only. I suspect since the CPU transcoder was in effect that it used -c:v libx265 instead of -cv hevc_nvenc However, it seems clear that GPU transcoding will simply fail on these 3888 videos I have as it did the last few times.

If you want to simply only use the GPU to convert video, disable the CPU plugin. Go to your node, open the node options, then toggle Allow GPU workers to do CPU tasks to on. Then you can set CPU workers to 0 and GPU workers to 3, since you have not patched your card.

This is exactly what I did before reporting this issue and I had been running that way for months. Now I'm down the stubborn few that will not transcode without error and I'm trying to fix them. In the past, I not only disabled the CPU plugin but I removed it. Then I tried bringing it back but placing it after the GPU plugin on the theory that perhaps if the GPU plugin failed it would try the CPU plugin that followed it. That didn't happen. If the GPU plugin failed it just reported that as a transcode error. Then, with your advice, I put the CPU plugin in front of the GPU plugin. Now only the CPU plugin is used.

If you want to do a hybrid option of both GPU and CPU, then you need to go to your node settings and disable said toggle. Then you can use both CPU and GPU workers.

I've now done that for my laptop. Thanks.

You've also got node capabilities, which defines the type of hardware the node has available for encoding so it doesn't try to run an Intel encoder on an Nvidia card for example.

I have an issue reported here (#614) asking how to use my Intel encoders on my laptop (a 2015 MacBook running Ubuntu) with Tdarr. I'd like to use the GPUs there too (I'm assuming they have GPUs) but have not managed to configure it to work and my issue has not been resolved.

I'm not sure where to check in the plugin

The size check plugin is a different plugin, which you'll find in the plugin tab at the top. There you can copy the plugin ID for adding to your plugin stack. Then you'll find those two options.

I guess I would have thought that the size plugin would have been in there by default. Added it.

on the theory that if the GPU transcode fails then the CPU plugin would be tried

That's not how plugins work. Mixing GPU & CPU complicates things, but the basic approach is to run the plugins in order, from top to bottom. If one fails, the entire file fails and gets thrown into error/cancelled.

GPU workers work a little weird, which I still don't fully understand either, but my basic understanding is they essentially 'skip' all plugins until they find one that produces an output they can run, then run that. So say you have 7 plugins, the 4th being the GPU plugin. The GPU worker would skip 1-3, run 4, then requeue the file. If another GPU workers picks it up, it sees there's no GPU stuff to run anymore, requeuing it for a CPU worker.

That seems strange.

Well the GPU transcodes seem to be working again so I've done this:

This seems to be working much better - GPU transcodes are working well, and my laptop is only accepting CPU transcode and seeming to transcode successfully too! Not sure if it's my plugin configuration or the fact that I updated ffmpeg in the container.

And while my Not required slice of the pie has been growing, my Transcode success is still there. Considering I've Tdarr'ed through all my videos a few times, when does a Transcode success turn into a Not required? Isn't the end goal to have all your videos in Not required?

adefaria commented 2 years ago

OK, here's another one - 9baSn7kZ-MY-log.txt

The ffmpeg command is:

tdarr-ffmpeg -c:v h264_cuvid -i /input/TV/Criminal Minds/Criminal Minds S13E04 - Killer App.mkv -map 0 -c:v hevc_nvenc -cq:v 19 -b:v 370k -minrate 259k -maxrate 481k -bufsize 741k -spatial_aq:v 1 -rc-lookahead:v 32 -c:a copy -c:s copy -max_muxing_queue_size 9999 -map -0:d /temp/Criminal Minds S13E04 - Killer App-TdarrCacheFile-LuH_W4ICw9.mkv

And this is the error:

2022-06-21T12:47:11.000Z Error while opening decoder for input stream #0:0 : Unknown error occurred

If I take the ffmpeg command and run it outside of Docker on my desktop it transcodes without issue:

Earth:ffmpeg -c:v h264_cuvid -i "/input/TV/Criminal Minds/Criminal Minds S13E04 - Killer App.mkv" -map 0 -c:v hevc_nvenc -cq:v 19 -b:v 370k -minrate 259k -maxrate 481k -bufsize 741k -spatial_aq:v 1 -rc-lookahead:v 32 -c:a copy -c:s copy -max_muxing_queue_size 9999 -map -0:d "/temp/Criminal Minds S13E04 - Killer App-TdarrCacheFile-LuH_W4ICw9.mkv" && say "Transcode completed"
ffmpeg version 4.4.2-0ubuntu0.21.10.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-7ubuntu2)
  configuration: --prefix=/usr --extra-version=0ubuntu0.21.10.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  WARNING: library configuration mismatch
  avcodec     configuration: --prefix=/usr --extra-version=0ubuntu0.21.10.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared --enable-version3 --disable-doc --disable-programs --enable-libaribb24 --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libtesseract --enable-libvo_amrwbenc --enable-libsmbclient
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, matroska,webm, from '/input/TV/Criminal Minds/Criminal Minds S13E04 - Killer App.mkv':
  Metadata:
    ENCODER         : Lavf57.82.100
  Duration: 00:39:37.51, start: 0.000000, bitrate: 777 kb/s
  Stream #0:0: Video: h264 (High), yuv420p(tv, bt709/unknown/unknown, progressive), 720x404, SAR 1:1 DAR 180:101, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
    Metadata:
      DURATION        : 00:39:37.458000000
  Stream #0:1: Audio: aac (LC), 48000 Hz, stereo, fltp (default)
    Metadata:
      DURATION        : 00:39:37.513000000
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (h264_cuvid) -> hevc (hevc_nvenc))
  Stream #0:1 -> #0:1 (copy)
Press [q] to stop, [?] for help
Output #0, matroska, to '/temp/Criminal Minds S13E04 - Killer App-TdarrCacheFile-LuH_W4ICw9.mkv':
  Metadata:
    encoder         : Lavf58.76.100
  Stream #0:0: Video: hevc (Main), nv12(tv, bt709/unknown/unknown, progressive), 720x404 [SAR 1:1 DAR 180:101], q=2-31, 23.98 fps, 1k tbn (default)
    Metadata:
      DURATION        : 00:39:37.458000000
      encoder         : Lavc58.134.100 hevc_nvenc
    Side data:
      cpb: bitrate max/min/avg: 481000/0/0 buffer size: 741000 vbv_delay: N/A
  Stream #0:1: Audio: aac (LC) ([255][0][0][0] / 0x00FF), 48000 Hz, stereo, fltp (default)
    Metadata:
      DURATION        : 00:39:37.513000000
frame=57002 fps=664 q=17.0 Lsize=  171689kB time=00:39:37.49 bitrate= 591.6kbits/s speed=27.7x    
video:136282kB audio:34225kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.693164%
Earth:

Spot checking further failures I get a bunch of:

[AVHWDeviceContext @ 0x5630c2a25e40] cu->cuInit(0) failed -> CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

It's as if my GPUs go missing. Restarting the Docker container often seems to fix this but then it breaks again.

HaveAGitGat commented 1 year ago

I think this is same as this issue but reopen if not: https://github.com/HaveAGitGat/Tdarr/issues/666