Test Report - Githubissues

MouriNaruto commented 6 years ago

Today ,I have tested this fork.

I am using a 2160p 120fps H264 AVC 8Bit video to test. My CPU is Intel Xeon E3 1230 V2, the GPU is Nvidia Geforce GTX 760 and the memory is 2 x 8GB DDR3 1333.

I only adjust the Windows SDK version to 10.0.10240.0 and use the merged FFmpeg dynamic linked library compiled by myself.(https://github.com/M2Team/FFmpegUniversal)

And I found something need to report.

unabletoreplayvideo

An error occurs when you try to replay the video after it has finished playing.

performance

Although we don't talk about reaching the target of 120fps, but the frame curve is not smooth enough.

gpu

GPU uses a lot of time on copying.

cpu

CPU usage.

I hope my report can help improve this fork.

Mouri

lukasf commented 6 years ago

Hi Mouri, thank you for taking time to test the fork. I have just fixed the first issue (error on file restart).

About the performance: I think you can't avoid the GPU copy operations. When you do software decode, the result is stored in RAM and must be copied into GPU memory. At 4K 120hz that is a lot of copying., which can only be avoided when using hardware decoding. And if decoding does not happen at real-time, it is no surprise that FPS is unstable. Some scenes are more complex than others, so on some you reach higher fps than on others. Fps will only be stable if you reach desired playback speed.

To improve performance you can try to set Config.VideoOutputAllowIyuv = true (see new toggle switch in updated sample). This should avoid sws_scale calls, freeing up CPU resources. Iyuv has some disadvantages, that is why it is not enabled by default. But if you need every fps, this is the right setting for you.

lukasf commented 6 years ago

When using Iyuv, I get only ~0.3% CPU usage in FFmpegInterop lib, the rest is external code...

MouriNaruto commented 6 years ago

Hi lukasf, I have tested it just now. The replay bug has been solved. But the performance is not seemed better than before.

This is my settings.

performance2 This is the performance graph.

Mouri

lukasf commented 6 years ago

Then there is not much we can do. At least, I don't have any idea right now. And as you can see, there is almost zero CPU time spent in our code anymore with iyuv. Maybe a newer ffmpeg version could improve performance, but that could require changes in the code. Plus, the last stable version 3.4 is pretty old, I'd rather wait for 3.5 before doing the work.

brabebhin commented 6 years ago

I'd say this is pretty impressive for software decoding...

MouriNaruto commented 6 years ago

https://github.com/FFmpeg/FFmpeg/tree/release/3.4 This is the source code of FFmpeg I used, updated 12 days ago. I find there are some API changes in FFmpeg 3.5 when I read the master branch's FFmpeg doxygen.

Yes, I am agree with you in most cases.

I also wanted to know what it will happen if implement the Media Foundation interface directly. If someone can answer to me, I will be glad because one of my friends suggests me to implement the Media Foundation interface. (He thinks it may solve my problem.)

MouriNaruto commented 6 years ago

@mcosmin222 I know that the video I'm testing can be played using PotPlayer smoothly.

But I hope the FFmpegInterop can reach this goal because I used many Media Player Applications used Windows Runtime and find its software decoding performance so bad. Eat more CPU and GPU but not show result smoothly.

Recently, I tried to modify the FFmpegInterop/master branch, removed all hardware decoders only use the software decoder because hardware decoder is useless when playing a video which used the future mainstream resolution or bit rate.

If I do some works about the software decoders which can solve my problem in the future, I want to contribute my work to this fork or the master.

Good luck to you and me. Thank you, @lukasf and @mcosmin222 .

Mouri

lukasf commented 6 years ago

I hope that there is not much additional overhead by using MediaStreamSource. It should be possible to more or less directly map MediaStreamSample to IMFSample. But you will only know if you try it. It would sure be interesting to see how performance is with a direct MF implementation, compared to MediaStreamSource.

You would need to implement the IMFByteStreamHandler and register your class with the MediaExtensionManager. I can tell you that it is quite some work, the MF interfaces are a bit cumbersome. I have done it a few times for audio decoders (FLAC, OGG). If you try it, please let us know how it goes.

MouriNaruto commented 6 years ago

Thank you. But I am curious how to register to make all file formats use my handler. (Because the FFmpeg can get supported formats, but the information is not complete.)

lukasf commented 6 years ago

You can register the same MF class it for multiple file extensions and mime types. You'd have to find out which file types are supported by ffmpeg and how their extensions and mime types are. But for a performance test, it would be sufficient to just register it for video/mp4. The rest can be done later, if you see that it really has performance benefits. I am not 100% sure, but it might be that MF will try your ByteStreamHandler if it does not find another codec to handle it, even if you did not register that file type. But if you want to override existing codecs, you definitely must register for all those common file types.

There is one thing I just remembered: Some videos (especially HEVC but sometimes also H264) have special color range. Those are rendered incorrectly in most players such as MPC-HC (which uses ffmpeg/LAV). We pass the required info to MF to allow color range conversion and correct display of those files. I have noticed about 10% higher CPU usage with this conversion (which is surprisingly high). It might be that your file also has this color space. You can search for MF_MT_VIDEO_NOMINAL_RANGE and disable that line, and see if it improves performance. It only has an effect if you use Iyuv, so enable the Iyuv switch and disable that code line, then check if performance is better. It probably won't be enough to get you to 120fps, still, it could be an improvement (if your file uses this color range).

MouriNaruto commented 6 years ago

Thank you very much. I will try it recently.

Today, I try to optimize your work at my repository. I tried and have found some ideas, and I hope I can help.

About the NativeBufferFactory::CreateNativeBuffer method. The less your invoked, the better performance you get. Because even create an Windows Runtime object written by WRL, there is still a visible overhead.
There is also a visible overhead in the UncompressedVideoSampleProvider::SetSampleProperties method. Because the Windows Runtime implicitly converts C/C++ variable to a reference or a Windows Runtime object. Here is my ugly code fragment. (I think you can do it better,)

Guid g_MFMTVideoChromaSiting(MF_MT_VIDEO_CHROMA_SITING);

IBox<uint32>^ g_MFVideoChromaSubsampling_MPEG2 = ref new Box<uint32>(MFVideoChromaSubsampling_MPEG2);
IBox<uint32>^ g_MFVideoChromaSubsampling_MPEG1 = ref new Box<uint32>(MFVideoChromaSubsampling_MPEG1);
IBox<uint32>^ g_MFVideoChromaSubsampling_DV_PAL = ref new Box<uint32>(MFVideoChromaSubsampling_DV_PAL);
IBox<uint32>^ g_MFVideoChromaSubsampling_Cosited = ref new Box<uint32>(MFVideoChromaSubsampling_Cosited);

IBox<int>^ g_TrueValue = ref new Box<int>(TRUE);
IBox<int>^ g_FalseValue = ref new Box<int>(FALSE);

Guid g_MFSampleExtension_Interlaced(MFSampleExtension_Interlaced);
Guid g_MFSampleExtension_BottomFieldFirst(MFSampleExtension_BottomFieldFirst);
Guid g_MFSampleExtension_RepeatFirstField(MFSampleExtension_RepeatFirstField);

HRESULT UncompressedVideoSampleProvider::SetSampleProperties(MediaStreamSample^ sample)
{
    MediaStreamSamplePropertySet^ ExtendedProperties = sample->ExtendedProperties;

    if (m_interlaced_frame)
    {
        ExtendedProperties->Insert(g_MFSampleExtension_Interlaced, g_TrueValue);
        ExtendedProperties->Insert(g_MFSampleExtension_BottomFieldFirst, m_top_field_first ? g_FalseValue : g_TrueValue);
        ExtendedProperties->Insert(g_MFSampleExtension_RepeatFirstField, g_FalseValue);
    }
    else
    {
        ExtendedProperties->Insert(g_MFSampleExtension_Interlaced, g_FalseValue);
    }

    switch (m_chroma_location)
    {
    case AVCHROMA_LOC_LEFT:
        ExtendedProperties->Insert(g_MFMTVideoChromaSiting, g_MFVideoChromaSubsampling_MPEG2);
        break;
    case AVCHROMA_LOC_CENTER:
        ExtendedProperties->Insert(g_MFMTVideoChromaSiting, g_MFVideoChromaSubsampling_MPEG1);
        break;
    case AVCHROMA_LOC_TOPLEFT:
        if (m_interlaced_frame)
        {
            ExtendedProperties->Insert(g_MFMTVideoChromaSiting, g_MFVideoChromaSubsampling_DV_PAL);
        }
        else
        {
            ExtendedProperties->Insert(g_MFMTVideoChromaSiting, g_MFVideoChromaSubsampling_Cosited);
        }
        break;
    default:
        break;
    }

    return S_OK;
}

Mouri

MouriNaruto commented 6 years ago

I also changed some methods in MediaSampleProvider class because there is a visible overhead. (All the overheads I mentioned were watched in the Visual Studio Performance Profiler by myself.)

Here is the method definitions which I changed.

void QueuePacket(AVPacket& packet);
void PopPacket(AVPacket& packet);
HRESULT GetNextPacket(AVPacket& avPacket, LONGLONG & packetPts, LONGLONG & packetDuration);

lukasf commented 6 years ago

Frankly, this all sounds like unneccessary micro-optimizations to me. In your last profiler result, 99.6% of CPU time was spent outside our library. Now you are really trying to optimize those 0.4%? Sorry, I think you are on the wrong track. You can spend weeks doing with micro-optimizations on those 0.4% and you will never get any noticeable improvement.

When you want to optimize performance, you need to find the real performance bottlenecks and tackle them. And the bottlenecks are obviously not in our library data flow. If other players using ffmpeg/LAV are really faster, then it could be the ffmpeg build, the way we configure/use ffmpeg, MediaFoundation, the UWP rendering part, or the MediaStreamSource overhead. But it is clear from your results, that it is not our internal data handling that is holding back performance. In older versions, we had a lot of unneccessary buffer copy and other slowdowns. It is very satisfying to see that this has all been solved now.

Still some info on the three points you mentioned:

This is true, but MediaStreamSource requires use of IBuffer, so we must instantiate a WinRT class. We could add a pool of IBuffers which can be re-used (or even a pool of MediaStreamSamples). But I guess it is quite a lot of work. So personally, I would not put work in that area right now, given the good performance numbers. If you want to do it and you can give proof that is has noticeable impact (profile before and after), I will happily merge it.
This makes the code a bit ugly. The Guid redefinitions are definitely unneccessary, you are just creating a second copy of them. Using fixed IBox<> values probably has a very small performance benefit. I will think about adding it.
As far as I know, reference and pointer is internally exactly the same to the compiler. It is only a syntactic difference and should even lead to the exact same binary code. A small google search confirmed my assumption so far. Please add more info if you are really sure that this helps.

lukasf commented 6 years ago

Please don't get me wrong, I really appreciate that you want to help improving performance. It's just that I do not think that these are the areas which really need improvements right now.

If you want to work on the library, I can add you as a contributor of this fork. Then you can create your own branches, and create PRs if you think you have improvements. Just let me know.

MouriNaruto commented 6 years ago

Thank you, lukasf.

If the external code, I remember that the FFmpeg load accounted for only about 60%~70% of the CPU time. If we still use the FFmpeg default compilation options, It seems we can't get a visible improve.

About the optimization I said before, it seems that the FPS value curve of my device is smoother than before, so this is why I recommended. (Maybe it's just a psychological consolation.)

Mouri

MouriNaruto commented 6 years ago

I want to improve this library. But I don't have more ideas now because I basically said all things I thought. I think I should read documents of FFmpeg carefully, may find a good idea.

brabebhin commented 6 years ago

I think part of the overhead comes from the winRT sandboxing model, particularly from IO through the runtime broker.

It is interesting to see what PotPlayer really is (maybe it is a packaged desktop app) and how it does this. Microsoft states in the docs that if you want as little overhead as possible, you need to go with MF directly. I would advise against it, unless you really need that 4k at 120fps

MouriNaruto commented 6 years ago

Almost forgot to say, IYUV mode eat more memory, but performance is same as the software scalar.

MouriNaruto commented 6 years ago

@mcosmin222 PotPlayer is the only desktop app can play my test video smoothly on my PC. MPC-HC's performance looks like this fork when I tested few years ago. (So I use the PotPlayer to replace the MPC-HC.) (The test video is from 2013.)

MouriNaruto commented 6 years ago

I have been wondering why Potplayer can decode that video with a 70%~90% CPU usage. So I desired to find the answer, and this is the reason why I wrote so many replies in this issue.

MouriNaruto commented 6 years ago

Anyway, thank you everyone. I think this discussion should be ended now. I will open a new issue in the future if I have new ideas.

lukasf commented 6 years ago

Just for information: I have tested current PotPlayer, MPC-HC and FFmpegInterop with a few 4K files I had here. I got mixed results. For one file PotPlayer was noticeably faster than MPC-HC or FFmpegInterop. For other files, it was slower. It might depend on the pixel format used. The file where PotPlayer was faster had 10bit pixel format, the others had 8bit. Not sure why PotPlayer is faster, in settings 10bit was disabled so I guess it does downscale output to 8bit (but it's not possible to check the actual output format in the player). MPC-HC did output 10bit, FFmpegInterop downscaled to 8bit, but both were slower than PotPlayer. But then again, for two other files PotPlayer was slower, for one it was about same speed. They seem to do some things different, sometimes better, sometimes worse.

FFmpegInterop was usually on par or sometimes even faster than MPC-HC. FFmpegInterop seems to have higher CPU usage for a few seconds, then it goes to similar level as MPC-HC. It might be due to buffering and multi threaded decoding. I just did some experiments with adding direct 10bit output to avoid scaler, but I was not successful. Generally, I am satisfied with the performance I see. Having 10bit output would be nice, but it's not a top priority for me right now.

Feel free to open a new issue if you find ways to improve performance or find other things.

lukasf commented 6 years ago

And one more small update: Got 10bit working :) Performance is even worse, due to neccessary sws_scale calls. Still adding this for those who want the extra quality (and have 10/12bit display).

ffmpeginteropx / FFmpegInteropX

Test Report #3