GSdx minor optimization

gregory38 commented 6 years ago

Please find some ideas to improve gsdx

[x] Idea 1 (sw renderer):

On the SW renderer, final image is uploaded to the GPU. Current code rely on an external buffer which is copied in kernel space, then GPU thought the "Update". Instead we should use a map/unmap pattern. It will avoid a memcpy of the image as we directly write data to kernel space. (note it might hurt perf is the kernel space is actually already a distant gpu memory)

[ ] Idea 2 (sw renderer):

Rendering threads are connected to main thread with ring buffers. Currently we send scanline one by one in the ring buffer. Maybe it would be more efficient to pack them in small array and send the array (of scanline) in the ring buffer.

[ ] Idea 3 (hw renderer):

GSdevice contains a pool of unused texture to avoid slow reallocation. Currently the pool contains input texture/color buffer / depth buffer. Input textures are rather small however color/depth buffer can be huge due to upscaling. The idea is to split the pool by GStexture type. So we can have a big pool for small texture and another small pool for large color/depth buffer. It will typically help to reduce GSdx memory spikes.

[x] Idea 4 (hw renderer for 8 bits textures):

Palettes are attached to a data texture, what I mean is that you always have 2 textures (a texture that contains data, a texture that contains the palette). It doesn't work great when game uses multiple palettes for the same data texture (we re-upload the palette texture => ZoE). An idea would be to decorrelate the palette from the data texture. One we could imagine a hash-table that contains all palettes. So we can upload the data once and future access will only be a rebind of the palette. It would be nice to have a fast-path that avoid the hash lookup as most of the time the data texture will use the same palette.

[ ] Idea 5 (hw renderer for 8 bits textures):

Palettes are handled with 2D texture (size 256x1). It would be better to replace it with a big texture buffer (potentially big enough to contains all game palettes, a palette is 1KB of data). Texture buffer is a linear piece of memory which mean that we can write data from the CPU directly without DMA (real zero copy). We need to ensure that buffer is allocated on VRAM (not RAM) to ensure fast texel fetch from the shader. However it means that we need to implement manually the synchronization (especially if we update the data, if buffer is big enough sync will be useless).

[ ] Idea 6 (hw renderer)

During the rendering, a game could reuse an old target (frame buffer) as an input texture. The target is often downscaled/rescaled to ensure correct texture coordinates which are relatives to the size of the texture. Rescaling is slow and reduce quality. We should try to avoid it. The proposal is to track the size of valid texel in the target (could be based on the FBW and end page block) to copy only the useful data. For example, let's say GS input texture is 1024x1024. Rendered data in the target is 640x480 with an upscale factor of 2. We could create an input texture of 2048x2048 with the help of sparse texture and copy only the 1280x960 texels.

iMineLink commented 6 years ago

I'm totally positive with idea 3. Though it may be easier to keep track of an upper bound RAM/VRAM usage of the pool (putative), and replace the existing recycling logic which does not take that information into account, i.e.:

void GSDevice::Recycle(GSTexture* t)
{
    if(t)
    {
        t->last_frame_used = m_frame;

        m_pool.push_front(t);

        //printf("%d\n",m_pool.size());

        while(m_pool.size() > 300)
        {
            delete m_pool.back();

            m_pool.pop_back();
        }
    }
}

with something that prioritizes clearing bigger textures first from the pool, and then smaller only if needed (keeping a maximum number of total textures in the pool to avoid the logic slowing down operations):

void GSDevice::Recycle(GSTexture* t)
{
    if(t)
    {
        t->last_frame_used = m_frame;

        m_pool.push_front(t);

        // MemUsage computes an upper bound memory consumption for the texture
        m_mem_usage += MemUsage(t);

        //printf("%d\n",m_pool.size());

        double mem_cutoff = 0.25 * AVAILABLE_MEM;

        while(m_mem_usage > 0.5 * AVAILABLE_MEM)
        {
            bool any_deleted = false;

            for(auto i = m_pool.begin(); i != m_pool.end(); )
            {
                GSTexture* tt = *i;

                if(MemUsage(tt) > mem_cutoff)
                {
                    m_mem_usage -= MemUsage(tt);

                    delete tt;

                    i = m_pool.erase(i);

                    any_deleted = true;

                    if (m_mem_usage <= 0.5 * AVAILABLE_MEM)
                    {
                        break;
                    }
                }
                else 
                {
                    ++i;
                }
            }

            if (!any_deleted)
            {
                mem_cutoff /= 2;
            }
        }
    }
}

Though the nested loop is not nice, the GSFastList implementation should make iterating the pool really fast.

rcaridade145 commented 6 years ago

Idea 1 - For something like a APU it would be a big improvement no? Idea 2 has the potential to improve the situation for which you @gregory38 made those HLE shaders?

jobs-git commented 6 years ago

There is not much memory issue in pcsx2, so Idea 3 can be postponed.

I go for implementation of

Idea 1 + Idea 2

since most of pcsx2 issues was more on CPU and GPU, nothing beats performance.

As WinXP have beaten WinVista, that was performance problem. Linux have beaten M$ in servers and supercomputers, still performance.

lightningterror commented 6 years ago

There is not much memory issue in pcsx2, so Idea 3 can be postponed.

It will be helpful for games that have high memory usage/get the gsdx out of memory issue. It's also good for upscaling since the memory requirements will be lower, you can crank up the upscaling a bit more.

So yeah, I'd like to see this improvement. It's mandatory.

FlatOutPS2 commented 6 years ago

There is not much memory issue in pcsx2, so Idea 3 can be postponed.

This is not a discussion on which of these ideas should be implemented, or when they should be done. The issue is meant to be used for discussing how to implement them. All of these ideas should be implemented as soon as someone comes up with a good way of doing it.

gregory38 commented 6 years ago

On idea1, I don't know how to implement zero copy for APU.

So far we do (for openGL)

Unpack gs image in a local buffer
Copy in a pixel buffer object (memory allocated by the driver)
DMA from PBO to the texture

New code will do

Unpack gs image in the PBO directly
DMA from PBO to the texture

Even without zero copy, I think the remaining CPU overhead should be small (well current overhead should be low anyway).

On idea 3, yes we could count the memory but it isn't easy. We don't know if memory is allocated on ram/vram/both/none ! That being said, there is maybe a limitation on my idea. We need to check texture that is created from target. It could be huge too (note maybe we should crop it to valid data and remove the downscaling)

mirh commented 6 years ago

On idea1, I don't know how to implement zero copy for APU.

Links to the rescue! Though, I guess like we aren't going down the hard way (be it heterogeneous api, or CL_MEM_ALLOC_HOST_PTR on plain opencl).. and I'm not sure if something like that is even possible in opengl? EDIT: maybe? (also, please, take notice this slides refers to the very first apu architectures, that were quite far from HSA.. or even "coherency" at all, at times. Still I guess like these tips might be a first start. Maybe they also apply to Intel's) EDIT1-bis: more info (it should indeed also work for intel - though I guess like it's still not optimal) EDIT1-quater: seems like intel's opencl 1.2-gen cpu+gpu combos have more capabilities (read: more to gain from this) than similarly spec'ed amd's apus. EDIT1-ter: found the optimal (and lol, seeing my very own netbook cpu "beating" high-end cards) EDIT2: also vulkan, btw (maybe dx too?)

EDIT3: and last ideas. Though in retrospect, I'm not sure if there's really that much of a point when opengl is good® only on the chips of the only vendor without a x86 cpu

On idea 3, yes we could count the memory but it isn't easy.

Weren't there some extensions exposing it? (did we already discussed this? I have some dejavu) Also, could this be of some help? EDIT4: this for dx?

iMineLink commented 6 years ago

Idea2: Multiple buffering with zero copy seems a killer (less locking), but pinning DMA accessible memory shouls be used wisely imho, as a mem spike there can make the OS unstable. Having many small pinned buffers though may decrease pressure on the CPU cache/busses, which is best for GSdx needs. Idea 3: Yes we were already discussing this, and I personally would go for the upper bound (i.e. tex in RAM and VRAM , no compression) and see if It makes sense w.r.t. the actual 300 tex limit. And if available, It May Be used the driver extension to improve precision (aka lowering the bound)

EDIT: Regarding idea 2, the last link @mirh shared highlights in the reported results that: "interesting that glBuffer*Data with orphaning seems to be even comparable to PBM. So old code that uses this approach might be still quite fast!", and: "using glMapBuffer without orphaning is the slowest approach" So maybe it could be worth a try using the glBuffer*Data with orphaning technique (I don't even know what orphaning means in this context to be sincere).

rcaridade145 commented 6 years ago

Idea1 - https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_pinned_memory.txt . @mirh on that link you provided page 34 seems more appropriate https://developer.amd.com/wordpress/media/2013/06/1004_final.pdf#page=34 This is the same as ARB_buffer_storage with MAP_PERSISTENT_BIT no? Idea 3 - Ideally this could be implemented like a generational garbage collector. The main issue seems to be Gsdx is creating too many and too big textures for what we would be necessary (after reading games using the channel shuffle effect blog post the idea i got). If we consider the age to be a couple of frames, could we automatically recycle those with age > tex.age?

mirh commented 6 years ago

This is the same as ARB_buffer_storage with MAP_PERSISTENT_BIT no?

Supposedly buffer_storage should do the same, yes (there was a dolphin post talking about that - also a link of mine there). EDIT: or maybe not? EDIT2: pinned_memory is reportedly faster (though, I guess like it never hurts to personally test case it, especially since it wouldn't hurt if same code did not-screw dGPUs)

I'm not totally clear on whether that is already "zero copy" enough.

iMineLink commented 6 years ago

idea 3: what should be the difference in the recycling logic for GSTextures generated from target? Also they are not as frequent as those generated from source, iirc.

Idea 1: Let's say we choose the zeroest copy method of them all, how much can we push to the hw renderer of such an approach?

mirh commented 6 years ago

Not sure how all the synthetic benchmarks around could correlate with pcsx2 (I'm not a dev! \^^), but in the most rosy situation unified direct memory access (Unified Memory Space, or whatever you want to call it) can give you a 300-fold improvement. "The dispatch time of HSA runtime is close to zero"

Again though, I don't know which are the gsdx limits that made all current "workarounds" necessary, whether maybe "normal" SVM couldn't already be enough to solve 'em - or whether anything of this would even be possible to apply to opengl.

gregory38 commented 6 years ago

First proposal is about the sw renderer. We already use persistent buffer (which is zero-copy compatible). However I don't think texture data can be uploaded as zero-copy. Buffers are linears so you can easily write data from the cpu. However texture format depends on the hardware. So you can't upload write data from CPU. Note maybe we could use a linear buffer instead of a texture but it would require to do manually the filtering of data. In summary, code is already done (hw renderer use map/unmap pattern). We only need to add it for the upload of sw rendered image

gregory38 commented 6 years ago

For idea3, we already drop old texture/target based on the age. However channel shuffle will trigger hundred of allocation in the same frame. Technically we only need 5-10 for the target. We only want to avoid a crash due to ram/vram spike. We don't need a perfect tuning of memory allocation.

rcaridade145 commented 6 years ago

In those cases most VMs cache the objects and reuse them. Like String.intern in Java https://en.m.wikipedia.org/wiki/String_interning. In this case this seems difficult

gregory38 commented 6 years ago

@rcaridade145 what do you mean, for which idea ?

rcaridade145 commented 6 years ago

For idea 3. In vm like java if we wish we can intern an object making sure there is only one copy no matter how many instances we create.On another point if we have a memory area that can be divided to allocate X textures we could stop allocating and de allocating at.least for some circustances

gregory38 commented 6 years ago

Well that mostly current code behaviour. Instead to allocate a new texture, we fetch it from the (already allocated) texture pool.

rcaridade145 commented 6 years ago

@gregory38 and its that memory allocated once or all the time? Would it be possible/desirable to pre allocate space N textures ,map memory area and then use something like sparse textures https://pt.slideshare.net/CassEveritt/approaching-zero-driver-overhead page 54

gregory38 commented 6 years ago

What we do is that we

allocate the texture once
when unused put it in a pool
then future use will get back the texture from the pool
repeat 2 and 3
if the pool is too big, we deallocate the texture

Currently by texture I mean input texture, and output framebuffer. Technically it could be a good idea to sparse allocate array of input texture of all standard sizes/formats. But it is complicated and potentially the overhead is bigger than a basic pool. However a sparse render target would be nice (because 70%/90% of allocated memory is useless). I think I didn't look further in the past due to limited support (so far only hardware GL3 feature are used. Linux free drivers have 0 support of sparse texture as of today). It could be a good idea to reduce memory overhead on recent GPU.

gregory38 commented 6 years ago

Hum potentially sparse texture should help to better handle the conversion of old frame buffer into texture. Currently the code downscale the framebuffer whereas we could use a sparse texture based on the upscaling factor.And only tex-copy the valid pixel of the framebuffer.

rcaridade145 commented 6 years ago

. Linux free drivers have 0 support of sparse texture as of today

Phoronix says RadeonSI has the extension since 17.1

https://www.phoronix.com/scan.php?page=news_item&px=RadeonSI-Sparse-Buffer-Lands

gregory38 commented 6 years ago

Hum, I need to check mesa code source. Be aware that sparse buffer isn't sparse texture.

rcaridade145 commented 6 years ago

Hum, I need to check mesa code source. Be aware that sparse buffer isn't sparse texture.

I'm sorry you are totally correct.

gregory38 commented 6 years ago

Based on this interesting discussion, I added the idea 4/5. It is more complex to implement but I think it should add a nice perf boost for 8 bits textures handling (in particular Zone of Enders).

FlatOutPS2 commented 6 years ago

@lightningterror Do you plan to add all this to 1.6? Doesn't seem very likely that'll happen before 1.6 (unless 1.6 will be postponed to 2020).

lightningterror commented 6 years ago

(unless 1.6 will be postponed to 2020).

Uh I thought it would've been good to include those changes but I see what you mean.

gregory38 commented 6 years ago

Note idea 1 is easy to do, you can use this example (taken from the HW texture cache)

            GSTexture::GSMap m;

            if(m_texture->Map(m, &r, layer))
            {
                (mem.*rtx)(off, r, m.bits, m.pitch, m_TEXA);

                m_texture->Unmap();
            }
            else
            {
                (mem.*rtx)(off, r, buff, pitch, m_TEXA);

                m_texture->Update(r, buff, pitch, layer);
                        }

We should do something similar to present the SW renderer frame.

iMineLink commented 6 years ago

I implemented more or less idea 4 (not bug-free, but a proof of concept) and the problem with many games, and in particular ZoE, is the huge number (ten of thousands) of different palettes being used. There is almost no sharing of the same palette by different Source(s) though, so the effort to keep many palettes (a texture object for each one of them) in memory slows down all the operations. Allocating a texture for each one of them seems to be the performance killer, at least for the benchmarks I did.

I think that implementing idea 5, so a big palette, let's call it a canvas (single texture object), to be used as linear buffer for all the palettes (or a enough big number of them) may also unlock the capability to adopt idea 4.

gregory38 commented 6 years ago

What do you mean by memory ? You have 2 memory areas. The cpu and the gpu. You need a ram copy, likely stored in a kinds of hash to easily lookup the gpu object. And you need the gpu object. Palette is 1KB, so even 100k palette should be doable (i.e. 100MB of data)

iMineLink commented 6 years ago

By allowing 100k palettes (more or less) I had serious slowdowns. I did just fast tests, maybe this week end I can do more and propose pull request to discuss. There might have been two issues, from what I experienced: 1) Too many texture objects in memory at the same time so all the operations which are time O(n) with the number of textures are slow. There is no real memory size problem nor cpu nor gpu side, but I meant that having too many objects around is killer. 2) Maybe this is the real reason: having a palette map of 100k objects makes searching the one with correct size/clut content very hard and slow. So if this is the cause of the slowdown, it could be worth finding a smart hashing technique or search procedure.

But still, I think that instead of using 100k surfaces it could be more elegant and maybe performant to use a single linear buffer: still, the searching problem would remain.

gregory38 commented 6 years ago

You shouldn't have O(n) operation. Yes, the search in a big hash isn't a good idea. You should store somewhere in the main texture a copy of the palette. Rational is that most of the time (except zoe), you should use the same palette for a given texture (Note we have sse/avx memcmp for clut). In case of bad memcmp, search the hash. Potentially hash isn't well designed for the data pattern of the palette.

Note for 5. You should use a texture buffer instead of a big 1d texture because textures are limited in size, and uploading data is slow. However I don't how costly is the binding of a subpart of the buffer as input texture.

gregory38 commented 6 years ago

Hum how many hash did you use ? 1 or 2 (entry of 64B and 1024B)

iMineLink commented 6 years ago

Well, 2: I used to check pal value (so 64B vs 1024B), then clut hash (very naive, sum of first 8 clut values in a uint64). After the hash check, compare64 for whole clut content. Still, I don't have code now but I think I used linear search instead of hash map with both FastList and std::vector. Maybe, being the lists long, the hash map outperforms those: I need to check that versus compare64 overhead.

gregory38 commented 6 years ago

There is too much elements for a linear search. A hash map is mandatory.

iMineLink commented 6 years ago

I will try hash map indeed, maybe with smarter clut hashing (using techniques like the one reported in https://codereview.stackexchange.com/a/172095). This would reduce hit time. But when I talked about O(n) operation I was not only referring to the palette map search, but also the the driver overhead, which I think is higher when allocating the 100001st texture rather than the 1st one: in this case even if we approach hit time ~0ms, allocating EVERY new textures in GSdx would require much more time, because of the many palette texture being stored.

rcaridade145 commented 6 years ago

I've seen more and more projects relying on https://github.com/Cyan4973/xxHash for hashing

gregory38 commented 6 years ago

Hum, I see what you mean. The driver has a kind of hash to map gl handle (int) to real gl object (pointer). It is likely not optimized for huge number of objects. Yes it would be easier for the driver to only manage a single huge buffer.

As far as I understand the unordered map doesn't invalidate the underlying data (even in case of rehashing). So potentially we could just store a pointer to the last data (only useful if the hash load is bad, i.e. multiple palettes by bucket). Otherwise if load is good, a hash lookup might be fast enough.

iMineLink commented 6 years ago

Do not worry about the fast path as the Source object already contains a reference to its palette and the copy of the clut of the current palette (to be checked against the clut of the renderer). So fast path is simply do not change palette reference. In case of clut mismatch, query the pool, which then can do:

Select the hash map to search into based on pal value
If hit: hash current renderer clut content using good hash and search for the hash in the hash map
If hit: linear scan the hash map bucket comparing palette clut with current clut using compare64
If hit: return the reference of the existing palette with same clut

In any miss case create the new palette and add to the pool (and/or update pool data structures), then return reference.

Right now the solution which I developed differs only in linear searching an array (with pal and clut hash comparison first, then if hit compare64, if hit return) instead of using hash techniques. Let's say we reduce drastically hit time, I will better check by the end of the week if the driver can handle the huge amount of texture object without loosing performances.

Otherwise, if we need a single buffer, we can exploit the fact that the palettes are only of two different sizes, so use either the single big (not linear) texture or array texture / texture atlas for palettes.

EDIT: For notice, i wrapped GSdx palette texture objects in a container class which has the reference to the texture object, the pal value, the copy of the clut with which the palette was created, and a ref counter. The pointer to objects of the class are those saved into the pool. On fetch from pool, the pointer to texture object is extracted from container and returned.

gregory38 commented 6 years ago

If I were you, I would use a texture buffer rather than an atlas or texture array. Texture must remain small enough so the texture unit can read/interpolate data. For palette we don't need any interpolation so you can enjoy a huge 1d buffer. Beside texture buffer allow direct write from the host (aka zero copy). Atlas or texture array will need an extra uniform parameter whereas you just need to bind a portion of the buffer as texture.

mirh commented 6 years ago

Ehrm.. Would not like to disrupt any of you gentlemen with my usual random rambling Though, maybe first it wouldn't hurt if you could set up some dumb benchmark for "zero-copy techniques", since there seem to be so many, with so different caveats (AMD_pinned_memory, ARB_buffer_storage, ~~INTEL_map_texture~~, GpuMemoryBuffer - not to mention the HSA mess ofc) EDIT: intel extension should suck

gregory38 commented 6 years ago

Well they are mostly the same. Just a way to write data from CPU to GPU accessible memory. You have the issue of tiling for 2d texture. But palettes are 1d, even better there is no filtering. Palettes are small so they should fit in caches. Is there any other caveat ?

mirh commented 6 years ago

Well, for starters, I cannot understand why chrome would go to all those difficulties, instead of just using normal opengl calls. Moreover, as linked in my edit, at least amd_pinned is reported as faster than buffer_storage (interesting fact about the former: it's somewhat soon® to be also supported on intel) EDIT: well, it turns out that again never went anywhere, and the extension is only now supported in iris

gregory38 commented 6 years ago

You shouldn't have any speed diff between amd_pinned and buffer_storage. Performance will be different if you allocate different piece of memory. Because you can allocate RAM than can be read by the GPU (aka GART), or you allocate VRAM that can be written by the CPU. The easy case is RAM == VRAM. But if you have a PCIe in the middle, it becomes more complex, you need synchronization. Besides plenty of small write over PCIe are likely slower than a big dma.

Anyway, what do you mean by difficulties ? glCopyTexImage2d is opengl. And you can allocate a DMA buffer with openGL (PBO for example). Actually we do one-copy for hardware texture and the proposal is to do the same for SW renderer.

Palette is another story. We don't need tiling, palette are hopefully small enough. So here it would make sense to use 0-copy here.

mirh commented 6 years ago

I linked one-copy article, because it seemed to have some findings that supplemented the previous zero-copy one. In there glTexImage2D was deprecated in favour of VGEM (which they then seemingly replaced for something more linux-y - still the point would be that copy is handled at a lower level than opengl)

@Anti-Ultimate can you detail a bit more perf differences with pinned memory?

gregory38 commented 6 years ago

Well, Chrome have 3 processes versus openGL which work on 1 thread of 1 process. So I think they deals with others issues.

refractionpcsx2 commented 2 years ago

Closing as most of this really isn't relevant today, but here's some notes from Sten about the remaining ones which I believe he already has planned to deal with (if any remaining which need addressing).

1 - we already stream to the GPU buffer iirc. 2- not sure about, but Tellow redid the allocations so it should be fairly quick now.. 3 - is probably not relevant since Sten's target size stuff, VRAM is less of an issue, plus the hash cache keeps non-scaled sources alive longer. 5 - Some GPUs have very limited texture buffer sizes, and it probably won't make much difference. 6 - Instead of copying, do shader filtering with a clamped rectangle on the target texture.

PCSX2 / pcsx2

GSdx minor optimization #2310