cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 212 forks source link

Cacheddownloader: Glitch in handling cache entries #773

Closed vlast3k closed 1 year ago

vlast3k commented 1 year ago

Enter an issue title

Fix Cacheddownloader's directly handling glitches

Summary

The file_cache class in Cacheddownloader has a method that manages the oldEntries (such which are still in use, but there are newer versions for the object). The code looks like this

func (c *FileCache) updateOldEntries(logger lager.Logger, cacheKey string, entry *FileCacheEntry) {
    if entry != nil {
        if !entry.inUse() && entry.ExpandedDirectoryPath != "" { //AAAAAAAAAAAA
            // put it in the oldEntries Cache since somebody may still be using the directory
            c.OldEntries[cacheKey+entry.ExpandedDirectoryPath] = entry
        } else {
            // We need to remove it from oldEntries
            delete(c.OldEntries, cacheKey+entry.ExpandedDirectoryPath)
        }
    }
}

The problem is that if entry.inUse() == true and entry.ExpandedDirectoryPath != "" then else clause is executed, and it is deleted from OldEntries.

Later, in CloseDirectory

    entry := c.Entries[cacheKey]
    if entry != nil && entry.ExpandedDirectoryPath == dirPath {
...
        entry.decrementDirectoryInUseCount()
        return nil
    }

    entry = c.OldEntries[cacheKey+dirPath]
    if entry == nil {
        return EntryNotFound
    }

if someone wants to remove the older version of the entry - the first if will not match, and i the second - the entry will not be found as it was already deleted

Steps to Reproduce

It was not caught by the existing tests, because they do not test the only way how AddDirectory is used (as of now) in the code. And that is:

  1. Call GetDirectory, which returns empty cache or the existing entry. And in both cases increments the directoryInUseCount
  2. Call AddDirectory, which does not manage directoryInUseCount. What it does is that if it was an existing entry, it decrements the usage of the old one. And then calls the flawed method to decide what to do

So in essence this is the scenario where a multiple containers use some dependency, this dependency is updated (in our case one buildpack) and then some of the containers is updated, thus - triggering the addition of a new version of the Cache Entry. Afterwards all of the other containers are not able to stop any more

Diego repo

https://github.com/cloudfoundry/cacheddownloader

Environment Details

cf-deployment to v30.5.0

Additional Text Output, Screenshots, contextual information (optional)

The impact of this in our usecase is following:

  1. During the update of the CF-Deployment, a new version of the buildpack-lifecycle comes, which is used by most apps
  2. Afterwards, a lot of internal applications are redeployed with their new versions
  3. All of them fail to stop, thus decreasing the free capacity left for other applications
  4. In some cases this exhausts all the available free memory and applications can no longer be re/started
  5. In all cases, restarting the reps that hold such unstoppable containers leads to waiting up to the rep_evacuation_timeout property (set to 15 min in our env) thus - increasing 2-3x the time for updating the diego-cells
mariash commented 1 year ago

Thank you for finding this issue, totally looks like a problem.