chrxh / alien

ALIEN is a CUDA-powered artificial life simulation program.
https://alien-project.org
BSD 3-Clause "New" or "Revised" License
4.85k stars 152 forks source link

question about HashMap #34

Closed geotyper closed 2 years ago

geotyper commented 3 years ago

as understand, steps of init HashMap, add some code in project:

  1. DynamicMemory dynamicMemoryArbiter;and init dynamicMemoryArbiter.init(100000000); in SimulationData
  2. before run kernel dynamicMemoryArbiter.reset();
  3. in ParticleProccesor in private section: add __shared__ HashMap<int2, Arbiter, HashFunctor<int2>> arbiterMap;where arbiter simple class.
  4. First lines of kernel init block:
    
    _particleBlock = calcPartition(
            data.entities.particlePointers.getNumEntries(), threadIdx.x + blockIdx.x * blockDim.x, blockDim.x * gridDim.x);
    if ( threadIdx.x + blockIdx.x * blockDim.x==0)  // try to init once for all threads
            arbiterMap.init_block(100000, _data->dynamicMemoryArbiter);
          __syncthreads();
5. in threads of kernel
   int2 arbKey;
   arbKey.x=particle->id;
   arbKey.y=otherParticle->id;
   Arbiter arbiter(arbKey.x, arbKey.y, normal, contact);
    bool find=arbiterMap.insertOrAssign(arbKey, arbiter);

all code work, but last line of code (insert in Map) execute error, do you have some ideas where error or I use HashMap in not proper choice.
Thanks 
chrxh commented 3 years ago

Hi :)

to 3: Maybe I misunderstood: Do you use __shared__ for class members? This is not allowed in CUDA (here are more information https://stackoverflow.com/questions/12707478/why-cant-member-variables-be-shared).

to 4: Here could be the actual problem: arbiterMap is declared for each block (with __shared__) but not initialized for each block. Because with if ( threadIdx.x + blockIdx.x * blockDim.x==0) only the first thread of the first block is selected. This means that for all blocks except the first one arbiterMap is not initialized.

It's necessary that one thread from each block makes the initialization. Thus if ( threadIdx.x ==0) would be required. But you even don't need this here because HashMap::init_block already does it. The method itself selects the first thread. It can be simple called from all threads in a block. (that is the reason why I named it with postfix `_block').

geotyper commented 3 years ago

Hi, first of all sorry for my English. step back in your code (try to understand code before change it):

class ParticleMap : public BasicMap<Particle*>
{
public:
    __device__ __inline__ void set_block(int numEntities, Particle** entities)
    {
        if (0 == numEntities) {
            return;
        }

        __shared__ int* entrySubarray;
        if (0 == threadIdx.x) {
            entrySubarray = _mapEntries.getNewSubarray(numEntities);
        }
        __syncthreads();

        auto partition = calcPartition(numEntities, threadIdx.x, blockDim.x);
        for (int index = partition.startIndex; index <= partition.endIndex; ++index) {
            auto const& entity = entities[index];
            int2 posInt = {floorInt(entity->absPos.x), floorInt(entity->absPos.y)};
            mapPosCorrection(posInt);
            auto mapEntry = posInt.x + posInt.y * _size.x;
            _map[mapEntry] = entity;
            entrySubarray[index] = mapEntry;
        }
        __syncthreads();
    }

    __device__ __inline__ Particle* get(float2 const& pos) const
    {
        int2 posInt = { floorInt(pos.x), floorInt(pos.y) };
        mapPosCorrection(posInt);
        auto mapEntry = posInt.x + posInt.y * _size.x;
        return _map[mapEntry];
    } 

As understand function get() return particle pointer assign to pos, and as (float)pos convert to int, it can't be two particles in one (int) coord. to avoid two particles in one key map particle radius need be>1.

  for (int particleIndex = _particleBlock.startIndex; particleIndex <= _particleBlock.endIndex; ++particleIndex) {
        Particle* particleCenter = _data->entities.particlePointers.at(particleIndex);
        for (float dx = -3.0f; dx < 3.1f; dx += 1.0f) {
            for (float dy = -3.0f; dy < 3.1f; dy += 1.0f) {
                Particle* otherParticle = _data->particleMap.get(particleCenter ->absPos+ float2{ dx, dy });
                if (otherParticle && otherParticle != particleCenter ) {

In this part, we can received all particle in area near particleCenter (have risk to go in negative index near 0 coordinates, but mapPosCorrection think help us).

Then if task to find collide particles (mapDistance2 this is mapDistance without mapDisplacementCorrection)

 if (_data->particleMap.mapDistance2(particleCenter->absPos, otherParticle->absPos) >2.0f) {
                                           continue;
 }
 SystemDoubleLock lock;
 lock.init(&particleCenter-->locked, &otherParticle->locked);
 lock.tryLock();
 if (!lock.isLocked()) {
      continue;
 }

after this we can add to Array arbiters collide pairs (radius of particles=1 for simplify):

ArbiterS* arbiter=_data->entities.arbiters.getNewElement();
int2 arbKey {particleCenter->id,otherParticle->id};
if(particleCenter->id > otherParticle->id){
    arbKey.y=particleCenter->id;
    arbKey.x=otherParticle->id;
}
arbiter->a_i=arbKey.x;
arbiter->b_i=arbKey.y;

printf("Collision %u, %u \n", arbKey.x, arbKey.y);
lock.releaseLock();

and after run above kernel we receive arbiters array with collide pairs. if collide two particles in one timestep, in theory we recieve 2 pairs, but on practice (if print from cuda kernel or look on array size): Collision 4, 11 Collision 4, 11 Collision 4, 11 Collision 4, 11 Collision 4, 11 Collision 4, 11 Collision 4, 11 Collision 4, 11

Can you help where error?

p.s.

__inline__ __host__ __device__ void mapPosCorrection(int2& pos) const
    {
        pos = {((pos.x % _size.x) + _size.x) % _size.x, ((pos.y % _size.y) + _size.y) % _size.y};
    }

what this function do? (looks like type of normalization)

geotyper commented 3 years ago

Collision 1, 23 Collision 1, 23 Collision 1, 23 Collision 1, 23 Collision 1, 23 Collision 1, 23 Collision 1, 23 Collision 1, 23 Collision 1, 23 Collision 1, 23 Collision 1, 23 some time number of collide pair(collide only two particles) bigger than prev example

if make smaller area for search:

for (float dx = -1.0f; dx < 1.1f; dx += 1.0f) {
            for (float dy = -1.0f; dy < 1.1f; dy += 1.0f) {
result also smaller and near normal, but two times higher:
Collision 18, 19 
Collision 18, 19 
Collision 18, 19 
Collision 18, 19 

as idea look only on left and down neighbor cells

for (float dx = -1.0f; dx < 0.0f; dx += 1.0f) {
            for (float dy = -1.0f; dy < 0.0f; dy += 1.0f) {

Time step: 1307 Collision 4, 16

but it not find all collide pair, only part of them

chrxh commented 3 years ago

Hi

first of all sorry for my English.

No problem! I can recommend deepl.com for the translation of whole sentences.

As understand function get() return particle pointer assign to pos, and as (float)pos convert to int, it can't be two particles in one (int) coord. to avoid two particles in one key map particle radius need be>1.

That is right. If two particles come closer than 1, an entry in the particle map will be overriden. But this is not so a big problem, because if both particles check the map, than one of them will detect the other and can perform the collision. And if there are more than 2 particles overlapping, the worst which can happen is that a collision is not probably detected. ParticleMap is only a cheap and fast solution. But it suffices in most cases here. A more accurate solution would be to use quadtree or something like that. I also tried something like HashMap<int2, List<Particle*>> for the collision detection. I think it worked but was much slower and the result was nearly the same.

In this part, we can received all particle in area near particleCenter (have risk to go in negative index near 0 coordinates, but mapPosCorrection think help us).

Yes.

Can you help where error?

Maybe your code has been executed unnecessary from to other GPU threads? To check this, you can print the thread and threadBlockId, like printf("Block: %d, Thread: %d, Collision %u, %u\n", blockIdx.x, threadIdx.x, arbKey.x, arbKey.y); Independently form that, what I often did at the beginning is to set numBlocks=1 and numThreads=1 so you can exclude multithreading problems for the moment and make the algorithm work. After that one can increase the threads.

what this function do?

I needed this function because if, for example, a particles runs out of the right boundary, then it should appear on the left. An example: If the world size is 100 x 100 and a particle has position (320, 105) it should be actually on (20, 5). (-> Torus topology) The reason why I had to write ((pos.x % _size.x) + _size.x) % _size.x ... instead of simply pos.x % _size.x was because to account for negative values. For example: Position (-10, -20) should be (90, 80) is the world coordinate.

chrxh commented 2 years ago

I close this issue because the code is now completely rewritten. A short architectural overview is given in https://alien-project.gitbook.io/docs/under-the-hood/architecture.