OGRECave / ogre-next

aka ogre v2 - scene-oriented, flexible 3D C++ engine
https://ogrecave.github.io/ogre-next/api/latest
Other
1.03k stars 227 forks source link

[Question] Billboards2 #440

Closed kariem2k closed 5 months ago

kariem2k commented 5 months ago

Hello,

Thank you very much for doing the new particle system and Billboards 2. I am evaluating ogre to do something like They Are Billions (tens of thousands of sprites rendered on the screen). My original plan was to write my own system that is entirely in the GPU. But then saw the new Billboards stuff. I wanted your opinion if Billboards2 can be used for such purpose? I know I will have to change it (extend it) to support animated sprites and maybe stop using the CpuData completely since most of the simulation will be done in compute shaders. And was going to use the depth buffer of the last frame for collision detection with the environment. I know there are several limitations to the new particle system but my tiny brain can't think of consequences to these limitations on what I am trying to achieve.

Thank you

darksylinc commented 5 months ago

Hi!

I don't know if billions, but you can give it a try!

I wanted your opinion if Billboards2 can be used for such purpose?

The new particle system (Bilboards2 are just particle systems with a few things disabled) were designed with SIMD and multithreading performance in mind, and thus they're very suitable for GPU simulation (like you said, what happens in ParticleCpuData would have to be moved to a GPU buffer touched by a compute shader instead).

All particle systems basically boil down to this (simplified):

for( particleSystem : mActiveParticleSystems )
{
      numParticlesToUpdate = thread_start + particleSystem->numParticles / numThreads;
      for( i < numParticlesToUpdate )
              update( particleSystem->mParticleCpuData, i );
}

The way affectors and emitters were setup were also consciously written in a way that they'd be easy to move to Compute Shaders, and it works in the same logic as the code snippet above.

Trade offs

Particle Systems are about trading off between the following:

  1. Flexibility
  2. Correctness (i.e. sorting for transparency)
  3. Performance

Usually with more flexibility you get more correctness, which is what ParticleFX1 does.

The new PFX2 system is not very good at transparency sorting (i.e. correctness), but it is flexible when it comes to scripting like it did in PFX1.

That's why I wrote the alpha hashing transparency, since it doesn't care about order transparency. The documentation has new sections explaining it.

Sorting has many challenges:

  1. Sorting different ParticleSystemDef is basically almost a lost cause. If you have smoke (i.e. ParticleSystemDef Smoke) and fire (i.e. ParticleSystemDef Fire) then all particle systems emitting fire would have to be behind all the particle systems emitting smoke, or viceversa, but not half and half (unless you clone multiple ParticleSystemDef to achieve such a desired effect)
  2. If you are planning on having billions of particles, sorting is expensive anyway.

The only thing I can say is give it a try, and try to get a basic proof of concept that is similar to what you have in mind.

The only thing I can suggest is that if you plan on literal billions, calling systemDef->setParticleQuota( 1000000000u ) is probably a very bad idea.

You may get better results with cloning the systemDef multiple times (via systemDef->clone()) and call systemDef->setParticleQuota with smaller numbers.

You should also experiment if you should call systemDef->setParticleQuota( 65535u / 6u ) which allows OgreNext to use 16-bit index buffers, and then making 1000000000u / (65535u / 6u) clones, or if just calling something like systemDef->setParticleQuota( 1000000u ) and then making 1000000000u / 1000000u clones gives better performance. That is something you will have to see.

The bigger the quota, the more contiguous RAM and VRAM we will need (which can be a problem if there is no contiguous chunk available to satisfy our request). Billions of particles will need literal GBs of both RAM and VRAM, so be mindful of that.

kariem2k commented 5 months ago

Thank you very much for the detailed reply! yes I will give it a try. I did not mean billions of billboards (sorry for not being clear). I wanted to mimic the crowd behavior of a game called "They Are Billions". It is Orthographic strategy game with fixed view from the camera I believe. I think it will be 10s of thousands of sprites

ss_4c32c120efcf58bac44ae8316f22a1b2a722185f 1920x1080

https://www.youtube.com/watch?v=QX1rPWPN4DI

kariem2k commented 5 months ago

I could not find were the billboards are destroyed, since what I understand that the particles are not associated with their particle system. So destroying the system will not destroy the particles. Unless the billboards are handled differently?

kariem2k commented 5 months ago

I could not find were the billboards are destroyed, since what I understand that the particles are not associated with their particle system. So destroying the system will not destroy the particles. Unless the billboards are handled differently?

darksylinc commented 5 months ago

Hi!

You can destroy a billboard ("release" is a more correct term) via BillboardSet::deallocBillboard.

Note that if you see the code all it does is hide it and do some maintenance stuff:

void BillboardSet::deallocBillboard( Billboard billboard )
{
    billboard.setVisible( false );
    deallocParticle( billboard.mHandle );

    mVaoPerLod[0].back()->setPrimitiveRange( 0u,
                                             static_cast<uint32>( getParticlesToRenderTighter() * 6u ) );
}

The reason for this is that Particles work like this: We assume that:

  1. Most particles are (more or less) requested and released in FIFO order. i.e. the oldest particle created is the first to be destroyed.
  2. Dead particles are relatively cheap to work in the vertex shader

Let's say you set a Quota of 16.

  1. We call allocBillboard and receive a handle of 0.
  2. We call allocBillboard again and receive a handle of 1.
  3. We call allocBillboard again and receive a handle of 2.
  4. We call allocBillboard again and receive a handle of 4.

We will render particles all 4 particles in range [0; 4].

Now you do the following:

  1. We deallocBillboard for handle 0.

We will render particles all 3 particles in range [1; 4].

  1. We deallocBillboard for handle 2. We are not respecting FIFO order (we should've deallocated 1 first).

What happens is that we will still render 3 particles in range [1; 4]. But inside the vertex shader, we will set particle 2 to degenerate triangles, which means it wastes vertex shader but it will not waste pixel shader work.

Thus we try to be smart and don't do unnecessary work when we don't have to, but there will be cases where we will have to do it. As long as all live particles are in one contiguous chunk, no work is wasted.

Worst case scenario, the vertex shader will have to do work for 15 particles (almost the entire quota) just to actually render 2. But we started with the assumption that this scenario is unlikely, and even if it were to happen, it is an acceptable scenario.

As for the BillboardSet, you can destroy it with SceneManager::destroyBillboardSet2 (this function was added today).

kariem2k commented 5 months ago

Oh thank you very much. I would not have guessed that destroying billboardsets will do that for the particles. Are these ranges reused again when creating a new billboardset?

darksylinc commented 5 months ago

Hi!

ParticleSystemDef (the underlying system) uses a circular buffer.

Which means if you do alloc -> dealloc -> alloc on the first alloc we use range [0; 0] but by the end we use the range [1; 1].

Now suppose the current internal pointer is at idx = 14 (quota = 16), with no live particles, and you call alloc() 5 times.

We'll end up using the range [14; 15] (2 particles) and range [0; 2] (3 particles). That's a circular buffer. It also means for CPU simulation we will have to iterate twice (one for each range).

But for GPU rendering, we make sure to always copy it contiguously so that it the GPU sees range [0; 4] in one single draw call (I don't remember if we mix [14; 15] + [0; 2] or we can get [0; 2] + [14; 15] since without sorting we don't really care, and with sorting the order would look random).

However if you plan on moving the CPU simulation to a compute shader, depending on how you do it, you'll likely need to iterate twice too.

darksylinc commented 5 months ago

I forgot to comment on this!

Thank you very much for the detailed reply! yes I will give it a try. I did not mean billions of billboards (sorry for not being clear). I wanted to mimic the crowd behavior of a game called "They Are Billions". It is Orthographic strategy game with fixed view from the camera I believe. I think it will be 10s of thousands of sprites

Ahh hahaha!!! Now I get it. Sorry, I misread "tens of thousands of sprites" as thousands of millions (i.e. a billion).

I wonder if the current implementation would be enough. 20k sprites isn't that much and it will weight more heavily on GPU overdraw rather than everything else, unless you have some serious physics going on (particularly in the case of the zombies in that game, they're so close together that you need a good partition scheme or else you run into serious O(N²) collision problems).

But beware we rely on CPU multithreading a lot, so a a 2-core CPU will not perform as good as a 6-core CPU.

kariem2k commented 5 months ago

Thank you very much for this valuable information