dragonfly client seems to not utilize huggingface caches

Yeosangho commented 2 days ago

Bug report:

Hello, I am a first-time user of Dragonfly. We are planning to use Dragonfly to set up a proxy service for Hugging Face. Based on the manual content, I succeeded in fetching Hugging Face repos through the Dragonfly client.

However, after deleting the Hugging Face repository cache locally (rm -rf ~/.cache/hugginface) and re-running the test code, I confirmed that the download speed is almost the same as the initial repository download. The strange thing is that on the Dragonfly client (i.e., peer), a cache for the repository is created in /var/lib/dragonfly/contents/task with each download.

So, I wonder why it does not use the cache of repo.

How to reproduce it:

Cluster setting : I followed huggingface integration manual (https://d7y.io/docs/next/operations/integrations/hugging-face/)

Helm values and codes :

values.yaml

manager:
  image:
    repository: dragonflyoss/manager
    tag: latest
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

scheduler:
  image:
    repository: dragonflyoss/scheduler
    tag: latest
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

seedClient:
  replicas: 1
  hostNetwork: false
  image:
    repository: dragonflyoss/client
    tag: latest
  metrics:
    enable: true
  config:
    verbose: true
    seedPeer:
      # -- enable indicates whether enable seed peer.
      enable: true
    download:
      server:
        # -- socketPath is the unix socket path for dfdaemon GRPC service.
        socketPath: /var/run/dragonfly/dfdaemon.sock
      # -- rateLimit is the default rate limit of the download speed in GiB/Mib/Kib per second, default is 50GiB/s.
      rateLimit: 50GiB
      # --  pieceTimeout is the timeout for downloading a piece from source.
      pieceTimeout: 30s
      # -- concurrentPieceCount is the number of concurrent pieces to download.
      concurrentPieceCount: 15
    proxy:
      server:
        port: 4001
      registryMirror:
        addr: http://cdn-lfs-us-1.hf.co
      rules:
        - regex: ".*"
          useTLS: true

client:
  enabled: true
  image:
    repository: dragonflyoss/client
    tag: latest
  hostNetwork: true
  metrics:
    enable: true

  config:
    verbose: true
    seedPeer:
      # -- enable indicates whether enable seed peer.
      enable: true
    download:
      server:
        # -- socketPath is the unix socket path for dfdaemon GRPC service.
        socketPath: /var/run/dragonfly/dfdaemon.sock
      # -- rateLimit is the default rate limit of the download speed in GiB/Mib/Kib per second, default is 50GiB/s.
      rateLimit: 50GiB
      # --  pieceTimeout is the timeout for downloading a piece from source.
      pieceTimeout: 30s
      # -- concurrentPieceCount is the number of concurrent pieces to download.
      concurrentPieceCount: 15
    proxy:
      server:
        port: 4001
      registryMirror:
        addr: http://cdn-lfs-us-1.hf.co
      rules:
        - regex: ".*"
          useTLS: true

hf_download.py (I changed CDN address due to the example repo resides in another CDN in huggingface.

import requests
from requests.adapters import HTTPAdapter
from urllib.parse import urlparse
from huggingface_hub import hf_hub_download
from huggingface_hub import configure_http_backend
import urllib3
import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

class DragonflyAdapter(HTTPAdapter):
    def get_connection(self, url, proxies=None):
        # Change the schema of the LFS request to download large files from https:// to http://,
        # so that Dragonfly HTTP proxy can be used.
        if url.startswith('https://cdn-lfs-us-1.hf.co'):
            url = url.replace('https://', 'http://')
        return super().get_connection(url, proxies)

    def send(self, request, **kwargs):  #get_connection does not convert the https to http, thus I override send method also.
        if request.url.startswith('https://'):
            #if 'cdn-lfs.hf.co' in request.url:
            if 'cdn-lfs-us-1.hf.co' in request.url: #this url is CDN of current target huggingface repo.
                logger.debug(f"Original URL: {request.url}")
                request.url = request.url.replace('https://', 'http://')
                logger.debug(f"Converted URL: {request.url}")
        return super().send(request, **kwargs)

    def add_headers(self, request, **kwargs):
        super().add_headers(request, **kwargs)
        if request.url.find('example.com') != -1:
            request.headers["X-Dragonfly-Registry"] = 'https://example.com'

def backend_factory() -> requests.Session:
    session = requests.Session()
    session.mount('http://', DragonflyAdapter())
    session.mount('https://', DragonflyAdapter())
    session.proxies = {
        'http': 'http://192.168.2.41:4001',
        #'https': 'http://192.168.2.41:4001' # I disabled the https case, because https is converted to http by overrided send method. 
    }
    session.verify = False
    return session

configure_http_backend(backend_factory=backend_factory)

try:
    hf_hub_download(
        repo_id="OuteAI/OuteTTS-0.1-350M",
        filename="model.safetensors",
        local_files_only=False
    )
except Exception as e:
    print(f"Download error: {e}")

And to connect with client, we apply the service spec for kind cluster.

apiVersion: v1
kind: Service
metadata:
  name: peer
  namespace: dragonfly-system
spec:
  type: NodePort
  ports:
    - name: http-4001
      nodePort: 30950
      port: 4001
  selector:
    app: dragonfly
    component: client
    release: dragonfly

Lastly, we run the code with increasing read time to avoid read time out in huggingface file download.

HF_HUB_DOWNLOAD_TIMEOUT=1000 python hf_download.py

Test results.

hf_hub_download without proxy : 65 seconds (I deleted session proxies in the code)

hf_hub_download with proxy , first trial(i.e. supposed to not use cache), : 41 seconds

After rm -rf ~/.cache/hugginface,... hf_hub_download with proxy , second trial(i.e., it supposed to be use cache) : 40 seconds

In addition, I tested additional cases with concurrentPieceCount as 1 in both seedclient and client to check the impact of cache only.

hf_hub_download with proxy, first trial(i.e. supposed to not use cache), concurrentPieceCount 1 : 231 seconds

After rm -rf ~/.cache/hugginface ,... hf_hub_download with proxy, second trial,(i.e., supposed to use cache), concurrentPieceCount 1 : 202-236 seconds

I'll looking forward your answer!!! 👍

Environment:

Dragonfly version: I used latest images of dragonfly2, but I don't know how can I get the version of dragonfly.
OS: Ubuntu 22.04.3 LTS
Kernel (e.g. uname -a): Linux vnode1.pnode3.idc1.ten1010.io 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Others:

gaius-qi commented 1 day ago

@Yeosangho Please provide the client logs(/var/log/dragonfly/dfdaemon/*.log), thanks.

Yeosangho commented 1 day ago

Hello @gaius-qi, First of all, Thank you for your reply and attention.

When I run hf_hub_download at first, it logs following infos. (I made the log file by your suggestion(i.e., cat /var/log/dragonfly/dfdaemon/*.log) kind-worker2-client(first trial).log kind-worker-client(first trial).log

After that, I removed caches in local env which runs the python code, and run the same code again. This is log for second trial (It includes the log of first trial!) kind-worker2-client(second trial).log kind-worker-client(second trial).log

Yeosangho commented 1 day ago

I finally discovered why cache is not being used in hf_hub_download.

The issue occurs because download requests from hf_hub_download are redirected from huggingface.co to their CDN. All redirected requests are recognized as different tasks by dragonfly even when accessing files from the same repository, which prevents cache usage.

I was able to confirm this behavior when I modified both huggingface's CDN address and huggingface.co to route through the dragonfly proxy for closed network testing.

In this case, the redirection from huggingface.co to CDN occurs within the dragonfly proxy, and users receive a 200 response code with a huggingface.co URL. While this URL doesn't match hf_hub_download's expected response (301 response + CDN address redirect) and thus fails to download files through hf_hub_download, the files can still be downloaded using curl and request calls. For example, files can be downloaded using the following curl command:

curl -O -k -v -x {dragonfly proxy url} http://huggingface.co/OuteAI/OuteTTS-0.1-350M/model.safetensors

With this approach, dragonfly's caching and relay functions work properly. However, the main issues are that huggingface models are typically accessed through the huggingface python library rather than curl or requests, and using curl/requests requires users to know the exact filename stored in the repository. This suggests that this approach would be very poor from a user experience perspective.

To summarize, there are two issues with dragonfly regarding huggingface_hub's CDN data downloads:

huggingface_hub's CDN redirect process prevents dragonfly from utilizing its cache system.
When attempting to use dragonfly as a proxy for huggingface in a closed network environment, dragonfly processes huggingface.co's CDN redirection response and returns a 200 response to the client, making it impossible to download models using huggingface_hub. While models can be downloaded using curl and requests with dragonfly's cache functionality in this scenario, it results in poor developer experience.

In our situation, second issue regarding closed network environment usage needs to be addressed first.

Yeosangho commented 1 day ago

Here is the code to bypass the problem that currently happened. It can use cache but also proxy huggingface repo download processes inside a closed network.

import requests
import json
import logging
import os
from tqdm import tqdm

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

def get_repo_files(repo_id, revision="main"):
   url = f"http://huggingface.co/api/models/{repo_id}/tree/{revision}"

   proxies = {
       "http": "http://192.168.2.41:4001",
       "https": "http://192.168.2.41:4001"
   }

   headers = {
       "Accept": "application/json",
       "User-Agent": "python-requests/2.31.0"
   }

   try:
       response = requests.get(
           url,
           proxies=proxies,
           headers=headers,
           verify=False
       )

       logger.debug(f"Response status: {response.status_code}")

       if response.status_code == 200:
           files = response.json()
           for file in files:
               print(f"Type: {file.get('type')}")
               print(f"Path: {file.get('path')}")
               print(f"Size: {file.get('size', 'N/A')} bytes")
               print(f"Last Modified: {file.get('lastModified', 'N/A')}")
               print("---")
           return files
       else:
           logger.error(f"Error: {response.status_code}")
           return None

   except Exception as e:
       logger.error(f"Request failed: {e}")
       return None

def download_file(repo_id, file_path, output_dir=None):
   """
   Download a specific file from the repository
   """
   if output_dir:
       os.makedirs(output_dir, exist_ok=True)

   url = f"http://huggingface.co/{repo_id}/resolve/main/{file_path}"

   proxies = {
       "http": "http://192.168.2.41:4001",
       "https": "http://192.168.2.41:4001"
   }

   try:
       response = requests.get(
           url,
           proxies=proxies,
           verify=False,
           stream=True
       )

       if response.status_code == 200:
           file_size = int(response.headers.get('content-length', 0))

           # Get filename from file_path
           filename = os.path.basename(file_path)
           if output_dir:
               filename = os.path.join(output_dir, filename)

           with open(filename, 'wb') as f, tqdm(
               desc=filename,
               total=file_size,
               unit='iB',
               unit_scale=True,
               unit_divisor=1024,
           ) as pbar:
               for data in response.iter_content(chunk_size=1024):
                   size = f.write(data)
                   pbar.update(size)

           logger.info(f"Successfully downloaded: {filename}")
           return filename
       else:
           logger.error(f"Error downloading file: {response.status_code}")
           return None

   except Exception as e:
       logger.error(f"Download failed: {e}")
       return None

def download_repo_files(repo_id, output_dir=None, file_types=None):
   """
   Download all files from the repository
   file_types: List of file extensions to download (e.g., ['.bin', '.json'])
   """
   files = get_repo_files(repo_id)
   if not files:
       return

   downloaded_files = []
   for file in files:
       file_path = file.get('path')

       # Skip if file_types is specified and file doesn't match
       if file_types:
           if not any(file_path.endswith(ft) for ft in file_types):
               continue

       result = download_file(repo_id, file_path, output_dir)
       if result:
           downloaded_files.append(result)

   return downloaded_files

# Download all files
download_repo_files("tiiuae/falcon-rw-1b", "output_dir")

Belows are the tests which proxy the huggingface by one seed client(i.e., w/o any peer clients) the measured time of each trial is only considered the model download(2.44G), while it downloaded other files in the repo also.

First trial in closed network : 150 seconds

Second trial in closed network: 23 seconds

Yeosangho commented 1 day ago

Additionally, dragonfly can replace jfrog's huggingface proxy functionality! I was also looking for this program for this specific purpose. If you've researched jfrog, you'll know that jfrog's huggingface repo feature only works with expensive commercial licenses.

If any of you reading this are facing similar concerns as I did, I recommend configuring dragonfly with a single seed client and not deploying peer clients, and try using the seed client as a proxy instead of peers then.

This setup allows you to utilize dragonfly's huggingface proxy functionality without the P2P feature.

Furthermore, dragonfly can serve the both repo data and huggingface api result, if it already is cached on the client and the client can not reach to huggingface and its cdn.

dragonflyoss / Dragonfly2