EpicWink / proxpi

PyPI caching proxy
MIT License
109 stars 22 forks source link

long-term use of proxpi on a low-resource server #46

Open berouques opened 3 months ago

berouques commented 3 months ago

dear EpicWink, i am considering using your caching proxy proxpi on my field (semi-portable) NAS. the server is solely for my use, and it has a pretty weak CPU and 1.5GB of RAM.

i wonder if your caching proxy is suitable for long-term cache storage in case of internet disruptions. for instance, if i set PROXPI_INDEX_TTL=6570000 (several years) and accumulate cache over a few months (and after numerous power cycles), will the server operate normally? will i be able to use the cached files for up to a year?

i would appreciate any information or recommendations you could provide. best regards, ~le berouque

EDIT: added remarks about "semi-portable" and "power cycles"

berouques commented 3 months ago

well...

i cached some packages and rebooted my "field" NAS (with proxpi configured as a service).

Before the reboot:

After reboot:

as a part of the experiment, I am going to run the script "pypi_warming_up.ps1" on my win10 PC which installs the "top 8000 most popular python packages", according to hugovk. pip is configured to query the NAS first, and in the case of 120sec timeout -- pypi repo.

staying tuned!

EpicWink commented 3 months ago

There are two caches:

The index cache has a TTL, which invalidates the cache on next access (ie download attempt from a client like pip). This has no memory bound, so for a sufficiently large TTL it can cause MemoryError. This cache is not saved to disk, so is wiped on server restart.

The files cache has a configurable max disk usage, and should use very little memory. It is also resilient to server restarts.

If you need a persistent cache that survives server restarts, proxpi is not what you need; it's designed as an optimistic proxy foremost. You could check out some of the alternatives, especially devpi

[^1]: aka packages

berouques commented 3 months ago

the alternatives don't work for me because they are performance-oriented, and i need data availability — performance doesn't really bother me. when there is no internet access, slow index searches are not the biggest problem. on the other hand, NAS resources consumption does bother me.

please tell me, are there architectural obstacles in your application to saving the index on disk and loading it when needed? i want to know this before making changes to the code.

EpicWink commented 3 months ago

are there architectural obstacles in your application to saving the index on disk and loading it when needed?

I'm not sure. If you keep the API of proxpi._cache._IndexCache the same, and replace the _index and _packages dict attributes with some file-based storage, it should work. You will of course need to change the eviction to not just be a time-to-live.

I won't merge any PR that makes this change (unless you create a subclass of _IndexCache, and enable it via a configuration flag) as the in-memory cache is necessary for simplicity (and therefore reliability of the code) and performance. I'm happy to help and answer questions if you simply want make a fork to suit your requirements

berouques commented 2 months ago

Good afternoon. In the server.py file you wrote:

@app.route("/index/<package_name>/<file_name>")
def get_file(package_name: str, file_name: str):
...  
if scheme and scheme != "file":
         return flask. Redirect(path)

However, this does not allow you to use the app in Windows. I rewrote this part like this:

    if scheme in ['http', 'https', 'ftp']:
        return flask.redirect(path)

So it works now in Windows as well, but I am not sure if this could cause any problems?

EpicWink commented 2 months ago

if scheme and scheme != "file": is intended to have all URLs be treated as redirect targets, and all paths to point to files to be served.

I think a better solution is to return a different type (eg pathlib.Path) rather than requiring the server to always parse a string:

Diff (click to expand): ```diff diff --git a/src/proxpi/_cache.py b/src/proxpi/_cache.py index 3e09f51..85049c5 100644 --- a/src/proxpi/_cache.py +++ b/src/proxpi/_cache.py @@ -7,6 +7,7 @@ import abc import time import shutil import logging +import pathlib import tempfile import warnings import functools @@ -719,13 +720,13 @@ class _FileCache: return True # default to original URL (due to timeout or HTTP error) return False - def _get_cached(self, url: str) -> t.Union[str, None]: + def _get_cached(self, url: str) -> t.Union[pathlib.Path, None]: """Get file from cache.""" if url in self._files: file = self._files[url] assert isinstance(file, _CachedFile) file.n_hits += 1 - return file.path + return pathlib.Path(file.path) return None def _start_downloading(self, url: str): @@ -751,7 +752,7 @@ class _FileCache: os.unlink(file.path) existing_size -= file.size - def get(self, url: str) -> str: + def get(self, url: str) -> t.Union[str, pathlib.Path]: """Get a file using or updating cache. Args: @@ -884,7 +885,7 @@ class Cache: raise exc return files - def get_file(self, package_name: str, file_name: str) -> str: + def get_file(self, package_name: str, file_name: str) -> t.Union[str, pathlib.Path]: """Get a file. Args: diff --git a/src/proxpi/server.py b/src/proxpi/server.py index 1124eca..69c754b 100644 --- a/src/proxpi/server.py +++ b/src/proxpi/server.py @@ -4,8 +4,8 @@ import os import gzip import zlib import logging +import pathlib import typing as t -import urllib.parse import flask import jinja2 @@ -203,8 +203,7 @@ def get_file(package_name: str, file_name: str): except _cache.NotFound: flask.abort(404) raise - scheme = urllib.parse.urlparse(path).scheme - if scheme and scheme != "file": + if not isinstance(path, pathlib.Path): return flask.redirect(path) return flask.send_file(path, mimetype=_file_mime_type) ```

See #48