Closed chlowell closed 1 year ago
I converted this to a draft because I found a race on Linux that allows two processes to hold the lock simultaneously:
A opens the file
A locks the file
B opens the file
A unlocks the file
A removes the file
B locks the (removed) file
- this works, I guess because the inode remains until all FDs are closed; B's is open
C opens the file, creating it because no file with the name exists (new inode)
C locks the file
Now B and C hold the lock so far as this library is concerned and both may write the cache simultaneously. The first solution that comes to mind is, don't remove the file after unlocking it.
@bgavrilMS, it looks like MSAL.NET and Python have the same race, is it a known one? The worst thing that could happen is two processes corrupting the cache with overlapping writes, but I imagine that would be extremely unlikely in real usage.
... I found a race on Linux that allows two processes to hold the lock simultaneously:
A opens the file A locks the file B opens the file A unlocks the file A removes the file B locks the (removed) file - this works, I guess because the inode remains until all FDs are closed; B's is open C opens the file, creating it because no file with the name exists (new inode) C locks the file
Now B and C hold the lock so far as this library is concerned and both may write the cache simultaneously. The first solution that comes to mind is, don't remove the file after unlocking it.
@bgavrilMS, it looks like MSAL.NET and Python have the same race, is it a known one? The worst thing that could happen is two processes corrupting the cache with overlapping writes, but I imagine that would be extremely unlikely in real usage.
Thanks for the convincing description, Charles! You confirmed my long time suspicion.
Our existing approach happened to be "create a lock file, lock it, and then delete it" since day 1, and MSAL .Net, Python and Java even spent significant effort to keep the implementation in-sync. But,
(1) The "locking on a file handle" operation should probably only be used on a stable file which will not be constantly removed and recreated. (2) Alternatively, we can abandon the "locking on a file handle" completely, and use "exclusive file creation (and later removing it)". At one point, a teammate on MSAL Java experimented with this approach and reported better performance.
Let's revisit this and choose either (1) or (2), but not both. I would prefer (2), because it can even remove our dependency on a file locking library.
@rayluo - do you have customer feedback related to performance? This is public client, you can at most have 5-6 apps performing very fast operations on the file, which is max ~200 KB (1 entry is about 7 KB and the biggest token cache I know of was reported by a consultant with accounts in some 30 companies), but generally the file is <20KB.
The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs.
@chlowell - the solution to the race is to lock the file first, then to read / write / delete it.
(1) The "locking on a file handle" operation should probably only be used on a stable file which will not be constantly removed and recreated.
Right. The current implementation is reliable under contention only if the filename consistently maps to the same underlying object; it doesn't on Linux when the file is recreated.
(2) Alternatively, we can abandon the "locking on a file handle" completely, and use "exclusive file creation (and later removing it)".
That solves this race, however it makes the file's presence the locked state, so failing to remove the file would cause deadlock.
the solution to the race is to lock the file first, then to read / write / delete it.
The lock only prevents other processes from applying another lock on the same file. It doesn't matter whether A unlocks or removes first because A's lock doesn't prevent B opening the file.
The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs.
I think it's reasonable to treat this as a known issue or out of scope. It does make the lock unreliable on Linux under even modest contention, but if that were common in real usage, I imagine someone would have noticed a perf impact from all the cache misses, and I think it's fair to say this library is intended only for low/no-contention scenarios like helping CLI tools avoid authenticating every time they run.
The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs.
Totally agreed. Heck, even writing comments in this issue turns out to be a mentally demanding task. :-) Locking is hard.
For the sake of completeness, when/if we would eventually make that move, the change might be smaller than we might think, because MSAL Java EX has already been doing the "atomic file system operations creation + file locks", and MSAL Python followed (not sure about MSAL .Net). It is just that, for backward compatibility in terms of behaviors, we chose to still proceed with normal file locking after file exclusive creation fails, and that middle ground does not utilize the benefits of either approach to the fullest.
By the way, driven by a different initiative (which is to make our SDK SAW-ready by reducing 3rd-party library dependency), I have experimented removing the file lock and solely relying on exclusive file creation. It passed local stress test.
(2) Alternatively, we can abandon the "locking on a file handle" completely, and use "exclusive file creation (and later removing it)".
That solves this race, however it makes the file's presence the locked state, so failing to remove the file would cause deadlock.
Good point. Back then we contemplated writing a timestamp into the lock file and then the next lock seeker would remove a stale lock file. That approach was not implemented/needed in the current middle ground. But thanks for reminding me that I would need to do that in my SAW experiment above.
@rayluo - do you have customer feedback related to performance?
Not much. Over the years, there have been roughly 8 issues reported to Azure CLI. Most of them were various errors during lock acquisition, which could arguably mean the lock mechanism was doing it job. Only one of them was a json decode error which - now that I think about it - could mean the lock acquisition succeeded but a different process still raced to mess up the file content.
It does make the lock unreliable on Linux under even modest contention
Coincidentally, the latest issue on the list above (https://github.com/Azure/azure-cli/issues/26109
) was on Linux. The exclusive file creation failed but at least those failures were easily detectable, but then the subsequent file lock acquisition somehow stuck. That customer had to revert to an ADAL-powered old version Azure CLI. :-(
Really just a thin wrapper around
flock
. It's internal because applications don't need to use it directly. (They use theCache
type instead, to be added in future PR.)To show how this fits into the bigger picture, here's how I've organized the packages:
Closes #5