Open Yeosangho opened 2 days ago
@Yeosangho Please provide the client logs(/var/log/dragonfly/dfdaemon/*.log
), thanks.
Hello @gaius-qi, First of all, Thank you for your reply and attention.
When I run hf_hub_download at first, it logs following infos. (I made the log file by your suggestion(i.e., cat /var/log/dragonfly/dfdaemon/*.log) kind-worker2-client(first trial).log kind-worker-client(first trial).log
After that, I removed caches in local env which runs the python code, and run the same code again. This is log for second trial (It includes the log of first trial!) kind-worker2-client(second trial).log kind-worker-client(second trial).log
I finally discovered why cache is not being used in hf_hub_download.
The issue occurs because download requests from hf_hub_download are redirected from huggingface.co to their CDN. All redirected requests are recognized as different tasks by dragonfly even when accessing files from the same repository, which prevents cache usage.
I was able to confirm this behavior when I modified both huggingface's CDN address and huggingface.co to route through the dragonfly proxy for closed network testing.
In this case, the redirection from huggingface.co to CDN occurs within the dragonfly proxy, and users receive a 200 response code with a huggingface.co URL. While this URL doesn't match hf_hub_download's expected response (301 response + CDN address redirect) and thus fails to download files through hf_hub_download, the files can still be downloaded using curl and request calls. For example, files can be downloaded using the following curl command:
curl -O -k -v -x {dragonfly proxy url} http://huggingface.co/OuteAI/OuteTTS-0.1-350M/model.safetensors
With this approach, dragonfly's caching and relay functions work properly. However, the main issues are that huggingface models are typically accessed through the huggingface python library rather than curl or requests, and using curl/requests requires users to know the exact filename stored in the repository. This suggests that this approach would be very poor from a user experience perspective.
To summarize, there are two issues with dragonfly regarding huggingface_hub's CDN data downloads:
In our situation, second issue regarding closed network environment usage needs to be addressed first.
Here is the code to bypass the problem that currently happened. It can use cache but also proxy huggingface repo download processes inside a closed network.
import requests
import json
import logging
import os
from tqdm import tqdm
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
def get_repo_files(repo_id, revision="main"):
url = f"http://huggingface.co/api/models/{repo_id}/tree/{revision}"
proxies = {
"http": "http://192.168.2.41:4001",
"https": "http://192.168.2.41:4001"
}
headers = {
"Accept": "application/json",
"User-Agent": "python-requests/2.31.0"
}
try:
response = requests.get(
url,
proxies=proxies,
headers=headers,
verify=False
)
logger.debug(f"Response status: {response.status_code}")
if response.status_code == 200:
files = response.json()
for file in files:
print(f"Type: {file.get('type')}")
print(f"Path: {file.get('path')}")
print(f"Size: {file.get('size', 'N/A')} bytes")
print(f"Last Modified: {file.get('lastModified', 'N/A')}")
print("---")
return files
else:
logger.error(f"Error: {response.status_code}")
return None
except Exception as e:
logger.error(f"Request failed: {e}")
return None
def download_file(repo_id, file_path, output_dir=None):
"""
Download a specific file from the repository
"""
if output_dir:
os.makedirs(output_dir, exist_ok=True)
url = f"http://huggingface.co/{repo_id}/resolve/main/{file_path}"
proxies = {
"http": "http://192.168.2.41:4001",
"https": "http://192.168.2.41:4001"
}
try:
response = requests.get(
url,
proxies=proxies,
verify=False,
stream=True
)
if response.status_code == 200:
file_size = int(response.headers.get('content-length', 0))
# Get filename from file_path
filename = os.path.basename(file_path)
if output_dir:
filename = os.path.join(output_dir, filename)
with open(filename, 'wb') as f, tqdm(
desc=filename,
total=file_size,
unit='iB',
unit_scale=True,
unit_divisor=1024,
) as pbar:
for data in response.iter_content(chunk_size=1024):
size = f.write(data)
pbar.update(size)
logger.info(f"Successfully downloaded: {filename}")
return filename
else:
logger.error(f"Error downloading file: {response.status_code}")
return None
except Exception as e:
logger.error(f"Download failed: {e}")
return None
def download_repo_files(repo_id, output_dir=None, file_types=None):
"""
Download all files from the repository
file_types: List of file extensions to download (e.g., ['.bin', '.json'])
"""
files = get_repo_files(repo_id)
if not files:
return
downloaded_files = []
for file in files:
file_path = file.get('path')
# Skip if file_types is specified and file doesn't match
if file_types:
if not any(file_path.endswith(ft) for ft in file_types):
continue
result = download_file(repo_id, file_path, output_dir)
if result:
downloaded_files.append(result)
return downloaded_files
# Download all files
download_repo_files("tiiuae/falcon-rw-1b", "output_dir")
Belows are the tests which proxy the huggingface by one seed client(i.e., w/o any peer clients) the measured time of each trial is only considered the model download(2.44G), while it downloaded other files in the repo also.
First trial in closed network : 150 seconds
Second trial in closed network: 23 seconds
Additionally, dragonfly can replace jfrog's huggingface proxy functionality! I was also looking for this program for this specific purpose. If you've researched jfrog, you'll know that jfrog's huggingface repo feature only works with expensive commercial licenses.
If any of you reading this are facing similar concerns as I did, I recommend configuring dragonfly with a single seed client and not deploying peer clients, and try using the seed client as a proxy instead of peers then.
This setup allows you to utilize dragonfly's huggingface proxy functionality without the P2P feature.
Furthermore, dragonfly can serve the both repo data and huggingface api result, if it already is cached on the client and the client can not reach to huggingface and its cdn.
Bug report:
Hello, I am a first-time user of Dragonfly. We are planning to use Dragonfly to set up a proxy service for Hugging Face. Based on the manual content, I succeeded in fetching Hugging Face repos through the Dragonfly client.
However, after deleting the Hugging Face repository cache locally (rm -rf ~/.cache/hugginface) and re-running the test code, I confirmed that the download speed is almost the same as the initial repository download. The strange thing is that on the Dragonfly client (i.e., peer), a cache for the repository is created in /var/lib/dragonfly/contents/task with each download.
So, I wonder why it does not use the cache of repo.
How to reproduce it:
Cluster setting : I followed huggingface integration manual (https://d7y.io/docs/next/operations/integrations/hugging-face/)
Helm values and codes :
values.yaml
hf_download.py (I changed CDN address due to the example repo resides in another CDN in huggingface.
And to connect with client, we apply the service spec for kind cluster.
Lastly, we run the code with increasing read time to avoid read time out in huggingface file download.
Test results.
hf_hub_download without proxy : 65 seconds (I deleted session proxies in the code)
hf_hub_download with proxy , first trial(i.e. supposed to not use cache), : 41 seconds
After rm -rf ~/.cache/hugginface,... hf_hub_download with proxy , second trial(i.e., it supposed to be use cache) : 40 seconds
In addition, I tested additional cases with concurrentPieceCount as 1 in both seedclient and client to check the impact of cache only.
hf_hub_download with proxy, first trial(i.e. supposed to not use cache), concurrentPieceCount 1 : 231 seconds
After rm -rf ~/.cache/hugginface ,... hf_hub_download with proxy, second trial,(i.e., supposed to use cache), concurrentPieceCount 1 : 202-236 seconds
I'll looking forward your answer!!! 👍
Environment:
uname -a
): Linux vnode1.pnode3.idc1.ten1010.io 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux