OpenHFT / Java-Thread-Affinity

Bind a java thread to a given core
http://chronicle.software/products/thread-affinity/
Apache License 2.0
1.78k stars 361 forks source link

Reproduction of and fix for #81 #82

Closed nicktindall closed 2 years ago

nicktindall commented 3 years ago

See the issue for details on the problem.

This should be functionally equivalent to the old way, except it will guarantee mutual exclusion in the presence of a lock.

nicktindall commented 2 years ago

That's an interesting point. I wonder why the file lock version was implemented now. There was a non lock version as well but it wasn't used.

minborg commented 2 years ago

I think it would make sense to have a library that provides inter-process locking of various sorts. In Map, for example, we would need a lock that will release when the holding process dies. So, in this case, we might lock that portion of the file and CAS to acquire. If the lock remains acquired for a longer period of time, we can check if the file lock is still there. If it's not, then the original process died and we may safely assume the lock is implicitly released and can be re-acquired by another process. We could then get rid of unsafe "wait-for-15-seconds-and-then-just-grab-the-lock" solutions and we can determine process deaths much faster. So if a process dies, we can potentially be back in 10 ms rather than 10 seconds.

nicktindall commented 2 years ago

I updated the PR to not use locks, the process looks like

to acquire

1: if (lock file exists) and (lock file contains a live PID)
2:    fail to acquire
3: delete lock file (to cover the dead-PID case)
4: open file channel, creating lock file or fail if it already exists (atomic)
5: write PID to it

to release

1: close file channel
2: delete file

It doesn't seem like it should be atomic to me, but the tests indicate it is. The issue I thought would exist is when a orphaned lock was left and P1 and P2 attempt to acquire (using line numbers above)....

P2: 1
P1: 1
P1: 3
P1: 4
P2: 3 <--- the delete must fail here because P1 has the file open? 

But it seems to work. Or at least works a lot better than what was there (and is more tested)

nicktindall commented 2 years ago

I think it would make sense to have a library that provides inter-process locking of various sorts. In Map, for example, we would need a lock that will release when the holding process dies. So, in this case, we might lock that portion of the file and CAS to acquire. If the lock remains acquired for a longer period of time, we can check if the file lock is still there. If it's not, then the original process died and we may safely assume the lock is implicitly released and can be re-acquired by another process. We could then get rid of unsafe "wait-for-15-seconds-and-then-just-grab-the-lock" solutions and we can determine process deaths much faster. So if a process dies, we can potentially be back in 10 ms rather than 10 seconds.

I agree this would be useful, but for this particular case I think the simplest thing should be used because this library is very low level.

minborg commented 2 years ago

The PID may appear to be sufficiently random for this solution to work reasonably. However, if a process is running under Docker, it is likely that all instances will have the same PID (e.g. 5). In the future, we should consider having more unique data in the lock key.

nicktindall commented 2 years ago

Hi Nick - there is indeed a race condition where you suspected: the delete (P2: 3) is not safe (tested on Linux).

The following strategy fixes this, and also addresses Per's issue re Docker & PIDs. The tweak is to create and lock a file off to one side, before trying to (atomically) move it in to the shared position.

  1. create a temporary file (eg by Linux mktemp)
  2. lock the file
  3. atomically rename to the shared lock file eg "cpuN.lock" (eg using Linux renameat2 with RENAME_NOREPLACE). This will either fail if the file already exists, or perform the equivalent of an atomic-create-and-lock-file
  4. if succeeded you now have the lock. If failed, see 3a-d below
  5. ... use
  6. delete the lock file when done

If step 3 fails you either legitimately don't have the lock, or a process with the lock died without correctly cleaning up. If the latter, the OS will have released the lock, allowing the following to safely inherit where appropriate (note no PID check or timeout/forcing): 3a. open the lock file normally for r/w. if this fails go back to step 1 3b. try to acquire the lock 3c. if the lock acquisition (3b) worked then the old process detached incorrectly. You have inherited the lock. Clean up any tmp files. Use the shared resource. delete the lock file when done 3d. if the lock acquisition (3b) was denied then the lock is legitimately held by someone else (either normally or by inheritance)

I think there's still a race condition there

P1: 1, 2, 3, 4, 5
P2: 1, 2, 3, 3a
P1: 6
P2: 3b,  3c... (it has a lock on a non-existent file)
P3: 1, 2, 3, 4, 5 (it's able to acquire the lock because there was no lockfile)

I think this might work for acquireLock()

  1. Create new or open existing lock file (covers the case where a process doesn't clean up)
  2. Acquire exclusive lock on lock file
  3. If acquisition fails another process has the lock, return false
  4. If acquisition succeeds, check that the lock file exists, if it doesn't someone else has deleted it, go to 1 (covers off the case where the file was deleted between open and acquire lock)
  5. Write metadata to lock file, do work
  6. Delete the lock file
  7. Release the lock, close the channel

This also means we can clean up orphans on isFree()

  1. Open existing lock file, if it doesn't exist return true
  2. Acquire shared lock on the lock file. if that fails, someone else has an exclusive lock, return true
  3. If acquire succeeds, it was an orphaned file, delete it, return true
  4. Release lock

I think mutual exclusion is guaranteed because

  1. Nobody deletes without first having a lock
  2. After the locker acquires the lock, the first thing they do is check that the file exists and release the lock if it doesn't.
rogersimmons commented 2 years ago

Thanks Nick - looks good