The file_cache class in Cacheddownloader has a method that manages the oldEntries (such which are still in use, but there are newer versions for the object).
The code looks like this
func (c *FileCache) updateOldEntries(logger lager.Logger, cacheKey string, entry *FileCacheEntry) {
if entry != nil {
if !entry.inUse() && entry.ExpandedDirectoryPath != "" { //AAAAAAAAAAAA
// put it in the oldEntries Cache since somebody may still be using the directory
c.OldEntries[cacheKey+entry.ExpandedDirectoryPath] = entry
} else {
// We need to remove it from oldEntries
delete(c.OldEntries, cacheKey+entry.ExpandedDirectoryPath)
}
}
}
The problem is that if entry.inUse() == true and entry.ExpandedDirectoryPath != "" then else clause is executed, and it is deleted from OldEntries.
if someone wants to remove the older version of the entry - the first if will not match, and i the second - the entry will not be found as it was already deleted
Steps to Reproduce
It was not caught by the existing tests, because they do not test the only way how AddDirectory is used (as of now) in the code. And that is:
Call GetDirectory, which returns empty cache or the existing entry. And in both cases increments the directoryInUseCount
Call AddDirectory, which does not manage directoryInUseCount. What it does is that if it was an existing entry, it decrements the usage of the old one. And then calls the flawed method to decide what to do
So in essence this is the scenario where a multiple containers use some dependency, this dependency is updated (in our case one buildpack) and then some of the containers is updated, thus - triggering the addition of a new version of the Cache Entry.
Afterwards all of the other containers are not able to stop any more
Additional Text Output, Screenshots, contextual information (optional)
The impact of this in our usecase is following:
During the update of the CF-Deployment, a new version of the buildpack-lifecycle comes, which is used by most apps
Afterwards, a lot of internal applications are redeployed with their new versions
All of them fail to stop, thus decreasing the free capacity left for other applications
In some cases this exhausts all the available free memory and applications can no longer be re/started
In all cases, restarting the reps that hold such unstoppable containers leads to waiting up to the rep_evacuation_timeout property (set to 15 min in our env) thus - increasing 2-3x the time for updating the diego-cells
Enter an issue title
Fix Cacheddownloader's directly handling glitches
Summary
The file_cache class in Cacheddownloader has a method that manages the oldEntries (such which are still in use, but there are newer versions for the object). The code looks like this
The problem is that if
entry.inUse() == true
andentry.ExpandedDirectoryPath != ""
thenelse
clause is executed, and it is deleted from OldEntries.Later, in
CloseDirectory
if someone wants to remove the older version of the entry - the first
if
will not match, and i the second - the entry will not be found as it was already deletedSteps to Reproduce
It was not caught by the existing tests, because they do not test the only way how
AddDirectory
is used (as of now) in the code. And that is:directoryInUseCount
directoryInUseCount
. What it does is that if it was an existing entry, it decrements the usage of the old one. And then calls the flawed method to decide what to doSo in essence this is the scenario where a multiple containers use some dependency, this dependency is updated (in our case one buildpack) and then some of the containers is updated, thus - triggering the addition of a new version of the Cache Entry. Afterwards all of the other containers are not able to stop any more
Diego repo
https://github.com/cloudfoundry/cacheddownloader
Environment Details
cf-deployment to v30.5.0
Additional Text Output, Screenshots, contextual information (optional)
The impact of this in our usecase is following:
rep
s that hold such unstoppable containers leads to waiting up to therep_evacuation_timeout
property (set to 15 min in our env) thus - increasing 2-3x the time for updating the diego-cells