HPCE / hpce-2017-cw6

2 stars 17 forks source link

UpdateCountHash not thread safe? #34

Closed nicholzk closed 6 years ago

nicholzk commented 6 years ago

Calling the function UpdateCountHash in parallel will lead to a double free error. I've isolated the problem down to this bit of code:

res->Nb = (smultin_CellType*)util_Realloc (res->Nb, (res->NbSize + 1) * sizeof (smultin_CellType));

where the util_Realloc is causing the memory issue.

Is this function not thread safe, or am I just doing something wrong with the pointer res/misunderstanding something?

giuliojiang commented 6 years ago

I'm making lots of parallel calls to UpdateCountHash, I'm not doing anything to lock res. Results seem to have some slight variation from run to run, but I haven't had any crash or corruption so far.

But I did have some suspect that it might not be thread safe...

m8pple commented 6 years ago

util_Realloc itself is thread-safe, as it's only side-effect is to call realloc, which is explicitly documented to be thread-safe.

Looking at UpdateCountHash, I don't see anything dangerous, so I think it is thread-safe in the sense that you can call UpdateCountHash in parallel with itself as long as each call uses different input arguments.

My guess is that you are trying to call it in parallel, but one of the data-structures is actually being shared between both calls. In particular, I would wonder about the res parameter - does each call operate on a different res, or are they operating on the same one? This is a more complicated version of the problem where a memory location is shared in a read-write way between multiple tasks, causing a data race between the two.

So my guess is that two tasks are trying to resize the same data structure at the same time, which is quite likely as the counts grow initially. That also suggests that there is a problem with races on the actual Nb and Count arrays in UpdateCountHash, except they don't cause visible crashes.

I'm suspecting that the problem is actually at a much higher-level, but only becomes apparent at the low-level. Where is that initial res structure being created, and how many parallel tasks is it being passed to? Can you create multiple res structures, and give one to each task?

nicholzk commented 6 years ago

Thanks for the replies! Looking at my code again, it's clear that I was sharing the res pointer between multiple calls, resulting in the crash.