Open TomKaltz opened 6 years ago
Also please note that the stock 2.0.7 release records this file without dropped frames with following command...
ADD 1-7 FILE record7.mxf -vcodec dnxhd
Yea, but 2.0.7 doesn't output proper colors.
The bottleneck here seems to be the extra scale filter we add to correct the color range (and the fact that we don't run the encoder and filter in parallel).
There is no way around this issue. The scale=out_range=tv:out_color_matrix=bt709
filter doesn't run in real time with color transforms unless you use a powerful enough cpu in terms of SIMD, frequency and IPC. More cores does not help.
Is there anyway we could do this at a card level. (Decklink/bluefish)
I'm not sure why it's so slow.
Resolving this requires one of the following:
@5opr4ni no card is involved here, other than the gpu
I am wondering why this scale filter should be/is so slow on this position. I have to check this with standalone ffmpeg next week. Maybe some additional flags provide more performance... I had no issues with encoding high bitrate long gop h264 4:2:2 on HP Z440 workstation.
Sorry! missed the consumer, thought about the producer.
@5opr4ni on input it's not a problem as far as I can tell
I wish the scale filter had slice threading...
It's one of ffmpeg's core components... Can't imagine a bad performance here...
@premultiply: it has fast and slow paths, they don't optimize for every possible use case
Take a 59p video and see if you can build a corresponding command string that runs in realtime in standalone ffmpeg.
Yes. And have to check if it does scale twice for any reason.
One other alternative is to make the GPU mixer always output TV range RGB. That way we might not need the extra color transform (except for the screen consumer which could downgrade to experimental until we fix it). @premultiply? This would probably result in less accurate RGB=>YUV transforms though.
I'm not sure exactly how the FFMPEG default RGB->YUV conversion works. But I'm guessing it doesn't apply any color range calculations.
Another alternative is to output YUV444 (TV) from the mixer. Hmmm.... I think I like that best of all alternatives.
Will however require an extra GPU pass and also some work in the decklink and especially screen consumer. Would make ffmpeg and decklink consumers faster.
No it is not that easy as you convert to YUV (dont care of color sampling here) it is always TV range BUT you have to decide about the COLOR MATRIX before you do the conversion from RGB. This would only be possible if the colormatrix gets locked to the channel mode and other input and output modes (resolutions) are not allowed. I do not think we want that. Converting the colormatrix afterwards may be lossy in some color grades. Avoid that were possible (has to be done for SD-HD-UHD conversion in one step). It's the same for converting broken yuv fullrange stuff to correct yuv tv range. This is even worse. Keep in mind that we are doing only 8 bit per channel here...
The only workaround would be to do all these conversions on GPU for each consumer. Only screen consumer does not need any conversion (native progressive RGB from mixer). Decklink can do valid conversion through SDK or hardware (needs interlaced RGB). But FFMPEG consumer may need prefiltered combination on users requested output format. At the moment our good common interface is progressive RGBA to each consumer and every consumer does this on its own as requiered.
@premultiply what about 16 bit output from mixer? We want to do that in the future anyway... i.e. YUV444 (bt609, bt709, bt2020 depending on channel format) 16 bit. YUV444_16 => YUV422_8 should be relatively fast on cpu.
Hmm... of course 16 bit to 8 bit will require dithering...
The only clean solution would be to have multiple different outputs from mixer depending on consumer... if we are to do this without CPU involvement...
Easiest is probably if FFMPEG could take advantage of multi-core for these transformations.
Since we are not doing any scaling in this conversion I could probably implement a slice threaded color transform util based on sws scale.
Ok, I've implemented threaded color transform (https://github.com/CasparCG/server/commit/7b94bc6544b620583263bee411f88be99ab6eda2). HOWEVER, it will always convert to YUVA444P, BT709 and only work with channel heights dividable by 8. Which is far from optimal but should work well for most cases.
The most problematic case will be RGB(A) and/or full range recording... but that is very unusual.
@TomKaltz please verify. -filter:v interlace,format=yuv422p
Possible further optimizations:
@premultiply: please create separate issue for those
@ronag and I iterated on this today and it's getting better but still very inefficient. In my testing it seems if we omit alpha and swscale to AV_PIX_FMT_YUV422P it helps slightly. The best performance I got was manually changing all occurrences of AV_PIX_FMT_YUVA422P
to AV_PIX_FMT_YUV422P
in ffmpeg_consumer.cpp after commit 0d721847b49d022f7db09f48e92d8732b0db19c8 and using the following command...
add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace -threads:v 4
My brand new quad-core 3.1ghz macbook pro could barely keep up but it did. I'm hoping the color transform can be moved to GPU because currently using ffmpeg consumer to record broadcast formats is not very performant.
Just to make shure: You have verified that it is not dnxhd codec from ffmpeg which kills the performance? Same with x264?
I definitely tested with ProRes with same bad results. I did a quick test with x264 defaults and it was slightly better but still not good.
Prores and dnxhd are the slowest codecs in ffmpeg I know. Maybe they are single threaded or anything like this. In don’t know. Anyway... their performance is bad.
@TomKaltz your computer is 2.3 GHz with 3.1 turbo :)... no? Personally, I don't think recording is something for a laptop.
x264 defaults is not for realtime recording. You should be using -preset:v veryfast
.
One more optimization would be to do the color conversion and interlacing in the same step... but now we're moving things out of the ffmpeg filter => more complexity.
Interlacing outside of user defined filter would mean that high quality progressive recording is lost when channel is set to interlaced format. Might be ok for most recording applications but replays will suffer. Mmmmh...
Anyway I would prefer to switch to AV_PIX_FMT_YUV422P as @TomKaltz said above as it would reduce the buffersizes/amount of data to move and increase the performance for common use.
We need to use an alpha based pixel format since some users record alpha. We should create a dummy filter graph and check the resolved pixel format and use that.
@premultiply: I'm unsure whether performing the transform in slices is actually valid given dithering etc... are you able to find out?
I can try to measure it if there is a build availible.
The auto build should be running.
I'm removing this from 2.2. There is not much more we can do without more effort.
From what I can see from fist test the new transform is inaccurate with levelshifts. But I have to do multipass tests.
@premultiply I think the issue is more the possible seams between the (8) slices since it might be using dithering for the full => tv range transformation to distribute quantization errors. I'm unsure of the exact implementation and its impact. @5opr4ni maybe you know someone that can shed light on it?
@premultiply maybe you could investigate the implementation (sws flags) and if it is possible to enable/disable?
@ronag Can we try to replace swscale by zscale filter? Maybe it has better performance and accuracy.
-filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p
should do it if RGBA is passed to ffmpeg (again). dither parameter may be added additionally.
https://ffmpeg.org/ffmpeg-filters.html#zscale-1
@premultiply we're not using a scale filter, we're doing the scale manually. We could go back to how it was before. But then we don't get a parallel scale filter.
Yes i know. But it’s also swscale. And i‘d like to try doing it by zscale with ffmpeg as before with scale and no manual preconversion to compare the performance.
Run some benchmarks with vanilla ffmpeg. If there is any tangible advantage I’ll revert the parallel conversion.
@TomKaltz Can you try it again with a build before https://github.com/CasparCG/server/commit/7b94bc6544b620583263bee411f88be99ab6eda2 and -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p
?
I will report back...
Compiled commit 54997aedb7ee2372f42b9a10bcfc0304fdb735c3 and ran with 1080i5994 channel.....
[2018-03-07 13:50:19.180] [info] Received message from Console: add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p\r\n
#202 ADD OK
[2018-03-07 13:50:19.181] [info] ffmpeg[record700.mxf] Initialized.
[2018-03-07 13:50:19.243] [error] [ffmpeg] code 3074: no path between colorspaces
[2018-03-07 13:50:19.243] [error]
[2018-03-07 13:50:19.256] [error] C:\Users\Thomas\casparcg\src\modules\ffmpeg\consumer\ffmpeg_consumer.cpp(358): Throw in function void __cdecl caspar::ffmpeg::Stream::send(class caspar::core::const_frame,const struct caspar::core::video_format_desc &,class std::function<void __cdecl(class std::shared_ptr<struct AVPacket>)>)
[2018-03-07 13:50:19.256] [error] Dynamic exception type: class boost::exception_detail::clone_impl<struct caspar::ffmpeg::ffmpeg_error_t>
[2018-03-07 13:50:19.256] [error] [struct boost::errinfo_api_function_ * __ptr64] = av_buffersink_get_frame
[2018-03-07 13:50:19.256] [error] [struct boost::errinfo_errno_ * __ptr64] = 542398533, "Unknown error"
[2018-03-07 13:50:19.256] [error] [struct caspar::tag_stacktrace_info * __ptr64] = 0# 0x00007FF6F994755E in casparcg
[2018-03-07 13:50:19.256] [error] 1# 0x00007FF6F9969AE0 in casparcg
[2018-03-07 13:50:19.256] [error] 2# 0x00007FF6F9A91A0A in casparcg
[2018-03-07 13:50:19.256] [error] 3# 0x00007FF6F9A8F0BB in casparcg
[2018-03-07 13:50:19.256] [error] 4# tbb::interface7::internal::task_arena_base::internal_current_slot in tbb
[2018-03-07 13:50:19.256] [error] 5# 0x00007FF6F9A80AE4 in casparcg
[2018-03-07 13:50:19.256] [error] 6# 0x00007FF6F9A8B4F7 in casparcg
[2018-03-07 13:50:19.256] [error] 7# 0x00007FF6F9A8CD50 in casparcg
[2018-03-07 13:50:19.256] [error] 8# 0x00007FF6F9943849 in casparcg
[2018-03-07 13:50:19.256] [error] 9# iswascii in ucrtbase
[2018-03-07 13:50:19.256] [error] 10# BaseThreadInitThunk in KERNEL32
[2018-03-07 13:50:19.256] [error] 11# RtlUserThreadStart in ntdll
[2018-03-07 13:50:19.256] [error]
[2018-03-07 13:50:19.256] [error]
[2018-03-07 13:50:19.315] [info] ffmpeg[record700.mxf] Uninitialized.
@ronag the parallel conversion is significantly more performant. Is there any downside to having the transform output locked to yuva422p in this way?
Found out that the current implementation is wrong as it assumes BGRA Input from mixer to be tv range and not full range. This gives wrong Levels in Output.
Complete filter chain for pre https://github.com/CasparCG/server/commit/0d721847b49d022f7db09f48e92d8732b0db19c8 is ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p
@TomKaltz Thanks for testing! The problem in your case is that dnxhd requires 10 bit which zscale does not support in this conversion path. But this also gives the hint why why dnxhd performance is low. swscale needs to upscale from 8 to 10 bit first before writing to the dnxhd encoder.
Some sort of XDCAM HD422 flavor (wrong audio track configuration) should give much better performance:
ADD 1-777 FILE test.mxf -codec:v mpeg2video -codec:a pcm_s24le -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p -b:v 50M -maxrate:v 50M -bufsize:v 3835k -minrate:v 50M -profile:v 0 -level:v 2 -flags:v ilme+ildct
This is a sort of a continuation of #883
Compiled 1fb0d9348d424a008d1e2ee97539aac15a1e0f1f myself.
Used command....
add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=out_range=tv:out_color_matrix=bt709,format=yuv422p
Input buffer fills linearly from the start of the consumer and saturates and never drains. Plenty of resources left in this brand new macbook pro.