Open cjac opened 3 weeks ago
Both conda-forge
and nvidia
channels should be available by CDN via Cloudflare. Am curious why in this case it appears to be going to Anaconda.org directly?
It might just be as easy as specifying a mirror in the call to mamba/conda.
https://github.com/cjac/initialization-actions/blob/rapids-20240806/rapids/rapids.sh#L473
On Fri, Oct 25, 2024, 18:21 jakirkham @.***> wrote:
Both conda-forge and nvidia channels should be available by CDN via Cloudflare. Am curious why in this case it appears to be going to Anaconda.org directly?
— Reply to this email directly, view it on GitHub https://github.com/conda/infrastructure/issues/1051#issuecomment-2439157981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6UTHO24G2ENVNSJG24DZ5LVAFAVCNFSM6AAAAABQUDV4B6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGE2TOOJYGE . You are receiving this because you authored the thread.Message ID: @.***>
What I mean is this should already be happening by default. For example note the last line in the output below
$ curl -I https://conda.anaconda.org/conda-forge
HTTP/2 302
date: Sat, 26 Oct 2024 01:49:17 GMT
content-type: text/html; charset=utf-8
location: https://anaconda.org/conda-forge/repo?type=conda&label=main
…
server: cloudflare
The fact that the query above is not getting through suggests there is some other kind of network issue. Not sure if that is somewhere within CI or some other infrastructure between that build and the CDN (like some security protocol?)
It might be worth trying some simple network diagnostics at this point outside of Conda to isolate issues like this
Looking for: ["cuda-version[version='>=12,<13']", "rapids[version='>=24.08']", "dask[version='>=2024.7']", 'dask-bigquery', 'dask-ml', 'dask-sql', 'cudf', 'numba', "python[version='>=3.11']"]
conda-forge/linux-64
+ sync
+ [[ 1 == \0 ]]
+ test -d /opt/conda/miniconda3/envs/dask-rapids
+ /opt/conda/miniconda3/bin/conda config --set channel_priority flexible
+ for installer in "${mamba}" "${conda}"
+ /opt/conda/miniconda3/bin/conda create -m -n dask-rapids -y --no-channel-priority -c conda-forge -c nvidia -c rapidsai 'cuda-version>=12,<13' 'rapids>=24.08' 'dask>=2024.7' dask-bigquery dask-ml
dask-sql cudf numba 'python>=3.11'
real 1m19.604s
user 0m0.326s
sys 0m0.048s
+ retval=1
+ cat /mnt/shm/install.log
Collecting package metadata (current_repodata.json): ...working... failed
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/current_repodata.json>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https://conda.anaconda.org/conda-forge/linux-64'
Hello folks, it looks like this is becoming a problem. I'm sorry for swamping your service. Let's get a regional conda mirror set up as part of the product I'm producing. Can you please direct me to the best instructions on mirroring the full conda archive? I will work on bringing up a load balancer to direct the traffic to our local mirror and take that load off of your infrastructure.
Were you able to run the command suggested above ( https://github.com/conda/infrastructure/issues/1051#issuecomment-2439170923 )?
It would be good to know if Cloudflare (the CDN provider used for conda-forge) is actually used in your case or not
oops! Sorry, I think I missed that.
Oh, sorry! I didn't know you were asking me to run that command from the context of one of the cluster nodes being installed to. Here is that output now.
cjac@cluster-1718310842-m:~$ curl -I https://conda.anaconda.org/conda-forge
HTTP/2 302
date: Thu, 31 Oct 2024 23:09:00 GMT
content-type: text/html; charset=utf-8
location: https://anaconda.org/conda-forge/repo?type=conda&label=main
cf-ray: 8db74f6c1bcb3101-LAX
cf-cache-status: DYNAMIC
strict-transport-security: max-age=15552000
content-security-policy: frame-ancestors 'self';
referrer-policy: no-referrer
x-content-type-options: nosniff
x-download-options: noopen
set-cookie: __cf_bm=.Is3CsF554BOaHnScWmISSkVQpl6Bnrsas5J5UFGXA0-1730416140-1.0.1.1-frOx3IudLF.K9RCGwdQgrurX.DlFsI1LpQNoPNEVzapNXoP9UU6rFC_QbyLo8sSWoJo_WsjrXuKfy9c8eZNFr2JQAS9.bH7bdHdxG0ZAoGw; path=/; expires=Thu, 31-Oct-24 23:39:00 GMT; domain=.anaconda.org; HttpOnly; Secure; SameSite=None
server: cloudflare
Do these channels make a difference? Are those mirrored as well?
-c conda-forge -c nvidia -c rapidsai
This looks like it might be what I need:
Sorry for being unclear. Thanks for the info! 🙏
Ok so you are able to reach the CDN through curl
. Would think conda should as well. IOW it doesn't look like a networking issue
Both conda-forge
and nvidia
are on the CDN
Currently rapidsai
is not, but we plan to fix that: https://github.com/conda/infrastructure/issues/1055
Let's see if someone can help before going down the mirroring route
@jezdez could you please help us look into this?
okay. I started the mirroring route because it might be faster to have a local copy. Let me compare and let you know whether it's too much effort to maintain a mirror for use with my reproduction environment.
I've got a couple of files in my example. sync-mirror.sh is run on an instance created using create-conda-mirror.sh.
Please pardon the mess. I re-used some code I was using for a different purpose. The docs that I read about mirrors suggested that attaching GPUs to the mirror host might help accelerate things, too, so I used the latest rapids image and attached 4x T4s.
wow. It looks like I got cut off.
root@dpgce-conda-mirror-us-west4:~# links https://conda.anaconda.org/defaults/linux-64/repodata.json
+ /opt/conda/miniconda3/bin/conda-mirror -v --upstream-channel=conda-forge --upstream-channel=rapidsai --upstream-channel=nvidia --upstream-channel=defaults --platform=linux-64 --temp-directory=/mnt/shm --target-directory=/var/www/html --num-threads=7
Log level set to WARNING
Traceback (most recent call last):
File "/opt/conda/miniconda3/lib/python3.11/site-packages/conda_mirror/conda_mirror.py", line 635, in get_repodata
resp.raise_for_status()
File "/opt/conda/miniconda3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/defaults/linux-64/repodata.json
It looks like I was attempting to mirror portions of the repo that I don't need and won't help our cache.
The current implementation looks promising. The first one resulted in a mirror with size of ~120GB. I think it may have been the nvidia channel alone. I attempted to pass multiple instances of the --upstream-channel argument, and it took only the last.
After learning from this mistake, I have bifurcated the previous, simple, and incorrect single conda-mirror call into concurrent conda-mirror calls in their own screen tabs. Since this is a long-running process, it's probably best not to have it fail when a terminal is detached. And once all of the tabs have completed, the screen session will terminate and return control to the sync-mirror.sh shell process.
I am about 20 minutes into this latest run. It picked up in the mirroring where it had left off despite the deletion of the previous VM that had been running it. I increased the memory and CPU count so that it can accommodate three concurrent conda-mirror processes. Here's a snapshot of disk usage.
root@dpgce-conda-mirror-us-west4:~# df -h /var/www/html
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 15T 130G 15T 1% /var/www/html
This question moved to a different forum
wow. It looks like I got cut off.
root@dpgce-conda-mirror-us-west4:~# links https://conda.anaconda.org/defaults/linux-64/repodata.json
+ /opt/conda/miniconda3/bin/conda-mirror -v --upstream-channel=conda-forge --upstream-channel=rapidsai --upstream-channel=nvidia --upstream-channel=defaults --platform=linux-64 --temp-directory=/mnt/shm --target-directory=/var/www/html --num-threads=7 Log level set to WARNING Traceback (most recent call last): File "/opt/conda/miniconda3/lib/python3.11/site-packages/conda_mirror/conda_mirror.py", line 635, in get_repodata resp.raise_for_status() File "/opt/conda/miniconda3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/defaults/linux-64/repodata.json
https://conda.anaconda.org/main/linux-64/repodata.json is the correct repodata URL for Anaconda Distribution
@cjac I'm not aware of any throttling from GCP. The original issue seems to have been a transient connection error, is this really still happening from GCP? The channels are hosted on Cloudflare CDN.
For the other questions, if this relates to commercial support for GCP related services, this isn't the right repo to raise an issue, please reach out through your Anaconda support channels instead.
I have not tried to reproduce the issue yet. I'm going to finish building a mirror and use a locally mounted filesystem with the packages on it to provide the conda-forge, rapidsai and nvidia channels.
Once the mirror is up, probably by monday, I will try the build of the rapids image again, this time using file:///var/www/html/«channel» instead of https://conda.anaconda.org/«channel»
I can then share the example instruction on how to build and utilize a conda mirror, and close this issue.
The mirror has been built, but it seems conda does an extra write of ~15G to the temp directory, much of which could be skipped when the source is on a file:// path.
In any case, the code which I used to build the anaconda mirror can be found here:
https://github.com/cjac/dataproc-repro/blob/conda-mirror-20241031/lib/mirror/sync-conda.pl
On a 96 core machine, I believe that it could mirror the channels we use in about 8 hours.
Checklist
What is the idea?
Hello folks,
I've been maintaining the github.com/GoogleCloudDataproc/initialization-actions repository for a bit now, and I'm seeing some flakey tests. The tests are installing dask from conda.anaconda.org. Would we be able to avoid this by using a regional GCP mirror of the conda packages? How complex is it to maintain a mirror with CVE updates?
Why is this needed?
reduce load on the global mirrors and keep installer's resources locally to GCP
What should happen?
mirror with CVE updates created for each GCP region
Additional Context
Tests were run during work on this pull request.
https://github.com/GoogleCloudDataproc/initialization-actions/pull/1219