algoo / preview-generator

generates previews of files with cache management
https://pypi.org/project/preview-generator/
MIT License
228 stars 50 forks source link

Study external python modules for managing the previews cache #207

Open grignards opened 3 years ago

grignards commented 3 years ago

Currently previews are persisted on the hard drive without any possibility to expire them or manage the disk space they use. It could be useful to manage the previews as a cache with methods to ease their management.

As python disk caching modules do exist it would be interesting to see if one meets our needs or at least be an inspiration for better caching in preview generator:

Foreseen modules:

inkhey commented 3 years ago

Just checked usage of cache in preview_generator, it's used:

So has cache is used in builder in order to use intermédiate version of the preview to generate a last one, we do need a cache mechanism that builder are aware of, in order to be able to generate those "multistep" preview.

Another point: Currently there is only path, that both allow to "check if exist in cache" and "get the file itself", in order to allow later more storage mecanism, it may make sense to split those 2 id :

grignards commented 3 years ago

diskcache seems to be the only serious module that could be used in preview generator:

It would lead to separate the "internal" cache from what is returned by preview generators: the preview files. The current get_xxx_preview methods could be kept and implemented by get_xxx_preview_stream (from the cache) + writing to disk.

def get_jpeg_preview_stream(file_path, id) -> io.BytesIO:
  try:
    return cache[id]
  except KeyError:
    with get_temp_jpeg_preview_file(file_path) as preview:
      cache.set(id, preview)
      return io.BytesIO(preview)

def get_jpeg_preview(file_path, id) -> str:
  contents = get_jpeg_preview_stream(file_path, id)
  preview_file_path = output_path + id
  with open(preview_file_path, "w") as f:
    f.write(contents)
  return preview_file_path

But we should encourage people to use the stream version if suitable as it has better performances.

inkhey commented 3 years ago

diskcache seems to be the only serious module that could be used in preview generator:

* supports caching with arbitrary keys

* `Cache` objects (many modules do have a "decorator-only" approach).

* already uses sqlite to keep the mapping + files for "big" (tun-able threshold) data.

* has extensible storage possibilities (by inheriting `Disk` class)

It would lead to separate the "internal" cache from what is returned by preview generators: the preview files. The current get_xxx_preview methods could be kept and implemented by get_xxx_preview_stream (from the cache) + writing to disk.

I'm really not sure about the idea of returning a stream. It seems to me that it fit not really correctly in your example.

def get_jpeg_preview_stream(file_path, id) -> io.BytesIO:
  try:
    return cache[id]
  except KeyError:
    with get_temp_jpeg_preview_file(file_path) as preview:
      cache.set(id, preview)
      return io.BytesIO(preview)

Doesn't converting to BytesIO mean storing the full preview in ram here ?

def get_jpeg_preview(file_path, id) -> str:
  contents = get_jpeg_preview_stream(file_path, id)
  preview_file_path = output_path + id
  with open(preview_file_path, "w") as f:
    f.write(contents)
  return preview_file_path

I see 2 problems here:

grignards commented 3 years ago

The pseudo-code I wrote was more to illustrate my ideas rather than be totally correct, I didn't reproduce all code, locking for instance. I think we can avoid keeping the entire preview in memory, but I do not see the use-case for having a huge (in terms of memory usage) preview which means "relatively small" for me (not like preview generator was branded as a universal file format conversion tool). And most of our builders do use a program which will load the entire file in memory.