UCL-INGI / ICTV

ICTV is a simple content management system for digital signage on multiple screens.
Other
15 stars 5 forks source link

The asset caching mechanism should update assets based on HTTP headers #6

Open mpiraux opened 6 years ago

mpiraux commented 6 years ago

Currently, the caching mechanism is not updating assets in cache at any time. It should use the information exposed in HTTP headers (Last-Modified, Cache-Control, etc) to update assets in cache.

We have to define a policy for this update mechanism to trigger. Sending an HTTP request every time the PluginManager sees a slide referencing an external asset seems a bit heavy, but it is the policy that ensures the most consistent cache.

Drumor commented 6 years ago

Before we discussed about this case, I wrote this in the cache manager

def cache_file_at_url(self, url):
    """ Caches the remote file or retrieves its cached version if one exists. Returns None if an error occurs. """
    filename, extension = os.path.splitext(url)
    cache_asset_hash = hash(str(self.channel_id) + filename)
    asset = Asset.selectBy(plugin_channel=self.channel_id, filename=filename, is_cached=True).getOne(None)
    if asset is not None:
        # image comparaison on
        import urllib.request,web,math
        original = str(url)
        cache_file = ((str(web.ctx.homedomain)+ '/' + asset.path) if asset.path is not None else str(web.ctx.homedomain)+ '/cache/' + str(
            asset.id))
        from PIL import Image
        original_file = io.BytesIO(urllib.request.urlopen(original).read())
        cache_file_open = io.BytesIO(urllib.request.urlopen(cache_file).read())
        image1 = Image.open(original_file)
        image2 = Image.open(cache_file_open)
        h1 = image1.histogram()
        h2 = image2.histogram()
        rms = math.sqrt(sum(map(lambda a, b: (a - b) ** 2, h1, h2)) / len(h1))

        print(rms)
        if not rms == 0:
            asset = None

    if asset is None:
        if CacheManager._get_lock(cache_asset_hash, blocking=False):
            try:
                asset = Asset(plugin_channel=self.channel_id, filename=filename, user=None,
                              extension=extension if extension is not None and len(extension) > 0 else None,
                              is_cached=True)
                self.download_manager.enqueue_asset(asset)
            except (URLError, OSError):
                logger.warning('Exception encountered when attempting to cache file at url %s', url, exc_info=True)
                CacheManager._release_lock(cache_asset_hash)
                return None
            CacheManager._release_lock(cache_asset_hash)
        else:
            CacheManager._get_lock(cache_asset_hash, blocking=True)
            CacheManager._release_lock(cache_asset_hash)
            asset = Asset.selectBy(plugin_channel=self.channel_id, filename=filename, is_cached=True).getOne(None)

    return asset

_name_to_lock = {}
_master_lock = Lock()  # The lock of all locks

It's probably awful and works for image only. I'll change this code and work with your option!

Drumor commented 6 years ago

For above example, it's not fully functional as it doesn't dereference/rereference the asset.

mpiraux commented 6 years ago

For above example, it's not fully functional as it doesn't dereference/rereference the asset.

It has several flaws actually. First, it assumes that assets are images, this is not true as it can be videos too. Second, it downloads the whole file at the url and then performs a comparison. There is no gain in spending CPU time to compare the files if it has already been downloaded. The real cost we want to avoid is the download of the remote file. I would expect that using HTTP caching mechanism we can avoid downloading files entirely if not needed.

Regarding #12, this piece of code will be executed each time the PluginManager looks at a slide containing an URL. Opening the remote file and the local file (!) via HTTP(S) there will impact the slides generation latency quite heavily. I think the DownloadManager is more suited to handle the task. It could be extended to refresh an asset using the HTTP If-Modified-Since header.

I implemented this inside the client cache_daemon actually, have a look at these: https://github.com/UCL-INGI/ICTV/blob/master/ictv/client/cache_daemon.py#L134 https://github.com/UCL-INGI/ICTV/blob/master/ictv/client/cache_daemon.py#L288

Drumor commented 6 years ago

First part was a first inspiration and can be cancelled. For #12 , i'll adapt it with your remarks.