ffmpeg consumer slow colorspace transform on i59

TomKaltz commented 6 years ago

This is a sort of a continuation of #883

Compiled 1fb0d9348d424a008d1e2ee97539aac15a1e0f1f myself.

Used command.... add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=out_range=tv:out_color_matrix=bt709,format=yuv422p

Input buffer fills linearly from the start of the consumer and saturates and never drains. Plenty of resources left in this brand new macbook pro.

TomKaltz commented 6 years ago

Also please note that the stock 2.0.7 release records this file without dropped frames with following command...

ADD 1-7 FILE record7.mxf -vcodec dnxhd

ronag commented 6 years ago

Yea, but 2.0.7 doesn't output proper colors.

The bottleneck here seems to be the extra scale filter we add to correct the color range (and the fact that we don't run the encoder and filter in parallel).

ronag commented 6 years ago

There is no way around this issue. The scale=out_range=tv:out_color_matrix=bt709 filter doesn't run in real time with color transforms unless you use a powerful enough cpu in terms of SIMD, frequency and IPC. More cores does not help.

5opr4ni commented 6 years ago

Is there anyway we could do this at a card level. (Decklink/bluefish)

ronag commented 6 years ago

I'm not sure why it's so slow.

Resolving this requires one of the following:

Faster colorspace transform on CPU
Parallel colorspace transform
Do the transform on GPU

ronag commented 6 years ago

@5opr4ni no card is involved here, other than the gpu

premultiply commented 6 years ago

I am wondering why this scale filter should be/is so slow on this position. I have to check this with standalone ffmpeg next week. Maybe some additional flags provide more performance... I had no issues with encoding high bitrate long gop h264 4:2:2 on HP Z440 workstation.

5opr4ni commented 6 years ago

Sorry! missed the consumer, thought about the producer.

ronag commented 6 years ago

@5opr4ni on input it's not a problem as far as I can tell

ronag commented 6 years ago

I wish the scale filter had slice threading...

premultiply commented 6 years ago

It's one of ffmpeg's core components... Can't imagine a bad performance here...

ronag commented 6 years ago

@premultiply: it has fast and slow paths, they don't optimize for every possible use case

ronag commented 6 years ago

Take a 59p video and see if you can build a corresponding command string that runs in realtime in standalone ffmpeg.

premultiply commented 6 years ago

Yes. And have to check if it does scale twice for any reason.

ronag commented 6 years ago

One other alternative is to make the GPU mixer always output TV range RGB. That way we might not need the extra color transform (except for the screen consumer which could downgrade to experimental until we fix it). @premultiply? This would probably result in less accurate RGB=>YUV transforms though.

I'm not sure exactly how the FFMPEG default RGB->YUV conversion works. But I'm guessing it doesn't apply any color range calculations.

ronag commented 6 years ago

Another alternative is to output YUV444 (TV) from the mixer. Hmmm.... I think I like that best of all alternatives.

Will however require an extra GPU pass and also some work in the decklink and especially screen consumer. Would make ffmpeg and decklink consumers faster.

premultiply commented 6 years ago

No it is not that easy as you convert to YUV (dont care of color sampling here) it is always TV range BUT you have to decide about the COLOR MATRIX before you do the conversion from RGB. This would only be possible if the colormatrix gets locked to the channel mode and other input and output modes (resolutions) are not allowed. I do not think we want that. Converting the colormatrix afterwards may be lossy in some color grades. Avoid that were possible (has to be done for SD-HD-UHD conversion in one step). It's the same for converting broken yuv fullrange stuff to correct yuv tv range. This is even worse. Keep in mind that we are doing only 8 bit per channel here...

The only workaround would be to do all these conversions on GPU for each consumer. Only screen consumer does not need any conversion (native progressive RGB from mixer). Decklink can do valid conversion through SDK or hardware (needs interlaced RGB). But FFMPEG consumer may need prefiltered combination on users requested output format. At the moment our good common interface is progressive RGBA to each consumer and every consumer does this on its own as requiered.

ronag commented 6 years ago

@premultiply what about 16 bit output from mixer? We want to do that in the future anyway... i.e. YUV444 (bt609, bt709, bt2020 depending on channel format) 16 bit. YUV444_16 => YUV422_8 should be relatively fast on cpu.

ronag commented 6 years ago

Hmm... of course 16 bit to 8 bit will require dithering...

ronag commented 6 years ago

The only clean solution would be to have multiple different outputs from mixer depending on consumer... if we are to do this without CPU involvement...

Easiest is probably if FFMPEG could take advantage of multi-core for these transformations.

ronag commented 6 years ago

Since we are not doing any scaling in this conversion I could probably implement a slice threaded color transform util based on sws scale.

ronag commented 6 years ago

Ok, I've implemented threaded color transform (https://github.com/CasparCG/server/commit/7b94bc6544b620583263bee411f88be99ab6eda2). HOWEVER, it will always convert to YUVA444P, BT709 and only work with channel heights dividable by 8. Which is far from optimal but should work well for most cases.

The most problematic case will be RGB(A) and/or full range recording... but that is very unusual.

ronag commented 6 years ago

@TomKaltz please verify. -filter:v interlace,format=yuv422p

premultiply commented 6 years ago

Possible further optimizations:

Convert directly to YUV422P as 4:4:4 formats are unusual for recorded files.
Maybe drop the alphachannel completly for ffmpeg consumer if this will further improve the performance (maybe only together with previous point, yuv422p) as recording with alpha channel may be rare.
Make color matrix dependant on channel mode / resolution.

ronag commented 6 years ago

@premultiply: please create separate issue for those

TomKaltz commented 6 years ago

@ronag and I iterated on this today and it's getting better but still very inefficient. In my testing it seems if we omit alpha and swscale to AV_PIX_FMT_YUV422P it helps slightly. The best performance I got was manually changing all occurrences of AV_PIX_FMT_YUVA422P to AV_PIX_FMT_YUV422P in ffmpeg_consumer.cpp after commit 0d721847b49d022f7db09f48e92d8732b0db19c8 and using the following command...

add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace -threads:v 4

My brand new quad-core 3.1ghz macbook pro could barely keep up but it did. I'm hoping the color transform can be moved to GPU because currently using ffmpeg consumer to record broadcast formats is not very performant.

premultiply commented 6 years ago

Just to make shure: You have verified that it is not dnxhd codec from ffmpeg which kills the performance? Same with x264?

TomKaltz commented 6 years ago

I definitely tested with ProRes with same bad results. I did a quick test with x264 defaults and it was slightly better but still not good.

premultiply commented 6 years ago

Prores and dnxhd are the slowest codecs in ffmpeg I know. Maybe they are single threaded or anything like this. In don’t know. Anyway... their performance is bad.

ronag commented 6 years ago

@TomKaltz your computer is 2.3 GHz with 3.1 turbo :)... no? Personally, I don't think recording is something for a laptop.

x264 defaults is not for realtime recording. You should be using -preset:v veryfast.

One more optimization would be to do the color conversion and interlacing in the same step... but now we're moving things out of the ffmpeg filter => more complexity.

premultiply commented 6 years ago

Interlacing outside of user defined filter would mean that high quality progressive recording is lost when channel is set to interlaced format. Might be ok for most recording applications but replays will suffer. Mmmmh...

Anyway I would prefer to switch to AV_PIX_FMT_YUV422P as @TomKaltz said above as it would reduce the buffersizes/amount of data to move and increase the performance for common use.

ronag commented 6 years ago

We need to use an alpha based pixel format since some users record alpha. We should create a dummy filter graph and check the resolved pixel format and use that.

ronag commented 6 years ago

@premultiply: I'm unsure whether performing the transform in slices is actually valid given dithering etc... are you able to find out?

premultiply commented 6 years ago

I can try to measure it if there is a build availible.

ronag commented 6 years ago

The auto build should be running.

ronag commented 6 years ago

I'm removing this from 2.2. There is not much more we can do without more effort.

premultiply commented 6 years ago

From what I can see from fist test the new transform is inaccurate with levelshifts. But I have to do multipass tests.

ronag commented 6 years ago

@premultiply I think the issue is more the possible seams between the (8) slices since it might be using dithering for the full => tv range transformation to distribute quantization errors. I'm unsure of the exact implementation and its impact. @5opr4ni maybe you know someone that can shed light on it?

@premultiply maybe you could investigate the implementation (sws flags) and if it is possible to enable/disable?

premultiply commented 6 years ago

@ronag Can we try to replace swscale by zscale filter? Maybe it has better performance and accuracy.

premultiply commented 6 years ago

-filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p should do it if RGBA is passed to ffmpeg (again). dither parameter may be added additionally. https://ffmpeg.org/ffmpeg-filters.html#zscale-1

ronag commented 6 years ago

@premultiply we're not using a scale filter, we're doing the scale manually. We could go back to how it was before. But then we don't get a parallel scale filter.

premultiply commented 6 years ago

Yes i know. But it’s also swscale. And i‘d like to try doing it by zscale with ffmpeg as before with scale and no manual preconversion to compare the performance.

ronag commented 6 years ago

Run some benchmarks with vanilla ffmpeg. If there is any tangible advantage I’ll revert the parallel conversion.

premultiply commented 6 years ago

@TomKaltz Can you try it again with a build before https://github.com/CasparCG/server/commit/7b94bc6544b620583263bee411f88be99ab6eda2 and -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p?

TomKaltz commented 6 years ago

I will report back...

TomKaltz commented 6 years ago

Compiled commit 54997aedb7ee2372f42b9a10bcfc0304fdb735c3 and ran with 1080i5994 channel.....

[2018-03-07 13:50:19.180] [info]    Received message from Console: add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p\r\n
#202 ADD OK
[2018-03-07 13:50:19.181] [info]    ffmpeg[record700.mxf] Initialized.
[2018-03-07 13:50:19.243] [error]   [ffmpeg] code 3074: no path between colorspaces
[2018-03-07 13:50:19.243] [error]
[2018-03-07 13:50:19.256] [error]   C:\Users\Thomas\casparcg\src\modules\ffmpeg\consumer\ffmpeg_consumer.cpp(358): Throw in function void __cdecl caspar::ffmpeg::Stream::send(class caspar::core::const_frame,const struct caspar::core::video_format_desc &,class std::function<void __cdecl(class std::shared_ptr<struct AVPacket>)>)
[2018-03-07 13:50:19.256] [error]   Dynamic exception type: class boost::exception_detail::clone_impl<struct caspar::ffmpeg::ffmpeg_error_t>
[2018-03-07 13:50:19.256] [error]   [struct boost::errinfo_api_function_ * __ptr64] = av_buffersink_get_frame
[2018-03-07 13:50:19.256] [error]   [struct boost::errinfo_errno_ * __ptr64] = 542398533, "Unknown error"
[2018-03-07 13:50:19.256] [error]   [struct caspar::tag_stacktrace_info * __ptr64] =  0# 0x00007FF6F994755E in casparcg
[2018-03-07 13:50:19.256] [error]    1# 0x00007FF6F9969AE0 in casparcg
[2018-03-07 13:50:19.256] [error]    2# 0x00007FF6F9A91A0A in casparcg
[2018-03-07 13:50:19.256] [error]    3# 0x00007FF6F9A8F0BB in casparcg
[2018-03-07 13:50:19.256] [error]    4# tbb::interface7::internal::task_arena_base::internal_current_slot in tbb
[2018-03-07 13:50:19.256] [error]    5# 0x00007FF6F9A80AE4 in casparcg
[2018-03-07 13:50:19.256] [error]    6# 0x00007FF6F9A8B4F7 in casparcg
[2018-03-07 13:50:19.256] [error]    7# 0x00007FF6F9A8CD50 in casparcg
[2018-03-07 13:50:19.256] [error]    8# 0x00007FF6F9943849 in casparcg
[2018-03-07 13:50:19.256] [error]    9# iswascii in ucrtbase
[2018-03-07 13:50:19.256] [error]   10# BaseThreadInitThunk in KERNEL32
[2018-03-07 13:50:19.256] [error]   11# RtlUserThreadStart in ntdll
[2018-03-07 13:50:19.256] [error]
[2018-03-07 13:50:19.256] [error]
[2018-03-07 13:50:19.315] [info]    ffmpeg[record700.mxf] Uninitialized.

TomKaltz commented 6 years ago

@ronag the parallel conversion is significantly more performant. Is there any downside to having the transform output locked to yuva422p in this way?

premultiply commented 6 years ago

Found out that the current implementation is wrong as it assumes BGRA Input from mixer to be tv range and not full range. This gives wrong Levels in Output.

Complete filter chain for pre https://github.com/CasparCG/server/commit/0d721847b49d022f7db09f48e92d8732b0db19c8 is ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p

premultiply commented 6 years ago

@TomKaltz Thanks for testing! The problem in your case is that dnxhd requires 10 bit which zscale does not support in this conversion path. But this also gives the hint why why dnxhd performance is low. swscale needs to upscale from 8 to 10 bit first before writing to the dnxhd encoder.

premultiply commented 6 years ago

Some sort of XDCAM HD422 flavor (wrong audio track configuration) should give much better performance: ADD 1-777 FILE test.mxf -codec:v mpeg2video -codec:a pcm_s24le -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p -b:v 50M -maxrate:v 50M -bufsize:v 3835k -minrate:v 50M -profile:v 0 -level:v 2 -flags:v ilme+ildct

CasparCG / server

ffmpeg consumer slow colorspace transform on i59 #901