[gcp-dataproc] Dataproc open source component integration tests flakeyness - regional mirroring?

cjac commented 3 weeks ago

Checklist

[x] I added a descriptive title
[x] I searched open requests and couldn't find a duplicate

What is the idea?

Hello folks,

I've been maintaining the github.com/GoogleCloudDataproc/initialization-actions repository for a bit now, and I'm seeing some flakey tests. The tests are installing dask from conda.anaconda.org. Would we be able to avoid this by using a regional GCP mirror of the conda packages? How complex is it to maintain a mirror with CVE updates?

+ /opt/conda/default/bin/mamba create -m -n dask -y --no-channel-priority -c conda-forge -c nvidia 'cuda-version>=12,<=12.5' 'dask>=2024.5' dask-bigquery dask-ml dask-sql python=3.10
Download error (28) Timeout was reached [https://conda.anaconda.org/conda-forge/noarch/repodata.json.zst]
Failed to connect to conda.anaconda.org port 443 after 262119 ms: Couldn't connect to server

Why is this needed?

reduce load on the global mirrors and keep installer's resources locally to GCP

What should happen?

mirror with CVE updates created for each GCP region

Additional Context

Tests were run during work on this pull request.

https://github.com/GoogleCloudDataproc/initialization-actions/pull/1219

jakirkham commented 3 weeks ago

Both conda-forge and nvidia channels should be available by CDN via Cloudflare. Am curious why in this case it appears to be going to Anaconda.org directly?

cjac commented 3 weeks ago

It might just be as easy as specifying a mirror in the call to mamba/conda.

https://github.com/cjac/initialization-actions/blob/rapids-20240806/rapids/rapids.sh#L473

On Fri, Oct 25, 2024, 18:21 jakirkham @.***> wrote:

Both conda-forge and nvidia channels should be available by CDN via Cloudflare. Am curious why in this case it appears to be going to Anaconda.org directly?

— Reply to this email directly, view it on GitHub https://github.com/conda/infrastructure/issues/1051#issuecomment-2439157981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6UTHO24G2ENVNSJG24DZ5LVAFAVCNFSM6AAAAABQUDV4B6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGE2TOOJYGE . You are receiving this because you authored the thread.Message ID: @.***>

jakirkham commented 3 weeks ago

What I mean is this should already be happening by default. For example note the last line in the output below

$ curl -I https://conda.anaconda.org/conda-forge 
HTTP/2 302 
date: Sat, 26 Oct 2024 01:49:17 GMT
content-type: text/html; charset=utf-8
location: https://anaconda.org/conda-forge/repo?type=conda&label=main
…
server: cloudflare

The fact that the query above is not getting through suggests there is some other kind of network issue. Not sure if that is somewhere within CI or some other infrastructure between that build and the CDN (like some security protocol?)

It might be worth trying some simple network diagnostics at this point outside of Conda to isolate issues like this

cjac commented 3 weeks ago

Looking for: ["cuda-version[version='>=12,<13']", "rapids[version='>=24.08']", "dask[version='>=2024.7']", 'dask-bigquery', 'dask-ml', 'dask-sql', 'cudf', 'numba', "python[version='>=3.11']"]

conda-forge/linux-64      
+ sync
+ [[ 1 == \0 ]]
+ test -d /opt/conda/miniconda3/envs/dask-rapids
+ /opt/conda/miniconda3/bin/conda config --set channel_priority flexible
+ for installer in "${mamba}" "${conda}"
+ /opt/conda/miniconda3/bin/conda create -m -n dask-rapids -y --no-channel-priority -c conda-forge -c nvidia -c rapidsai 'cuda-version>=12,<13' 'rapids>=24.08' 'dask>=2024.7' dask-bigquery dask-ml
 dask-sql cudf numba 'python>=3.11'

real    1m19.604s
user    0m0.326s
sys     0m0.048s
+ retval=1
+ cat /mnt/shm/install.log
Collecting package metadata (current_repodata.json): ...working... failed

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/current_repodata.json>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https://conda.anaconda.org/conda-forge/linux-64'

cjac commented 3 weeks ago

Hello folks, it looks like this is becoming a problem. I'm sorry for swamping your service. Let's get a regional conda mirror set up as part of the product I'm producing. Can you please direct me to the best instructions on mirroring the full conda archive? I will work on bringing up a load balancer to direct the traffic to our local mirror and take that load off of your infrastructure.

jakirkham commented 3 weeks ago

Were you able to run the command suggested above ( https://github.com/conda/infrastructure/issues/1051#issuecomment-2439170923 )?

It would be good to know if Cloudflare (the CDN provider used for conda-forge) is actually used in your case or not

cjac commented 3 weeks ago

oops! Sorry, I think I missed that.

cjac commented 3 weeks ago

curl -I https://conda.anaconda.org/conda-forge

Oh, sorry! I didn't know you were asking me to run that command from the context of one of the cluster nodes being installed to. Here is that output now.

cjac@cluster-1718310842-m:~$ curl -I https://conda.anaconda.org/conda-forge 
HTTP/2 302 
date: Thu, 31 Oct 2024 23:09:00 GMT
content-type: text/html; charset=utf-8
location: https://anaconda.org/conda-forge/repo?type=conda&label=main
cf-ray: 8db74f6c1bcb3101-LAX
cf-cache-status: DYNAMIC
strict-transport-security: max-age=15552000
content-security-policy: frame-ancestors 'self';
referrer-policy: no-referrer
x-content-type-options: nosniff
x-download-options: noopen
set-cookie: __cf_bm=.Is3CsF554BOaHnScWmISSkVQpl6Bnrsas5J5UFGXA0-1730416140-1.0.1.1-frOx3IudLF.K9RCGwdQgrurX.DlFsI1LpQNoPNEVzapNXoP9UU6rFC_QbyLo8sSWoJo_WsjrXuKfy9c8eZNFr2JQAS9.bH7bdHdxG0ZAoGw; path=/; expires=Thu, 31-Oct-24 23:39:00 GMT; domain=.anaconda.org; HttpOnly; Secure; SameSite=None
server: cloudflare

cjac commented 3 weeks ago

Do these channels make a difference? Are those mirrored as well?

-c conda-forge -c nvidia -c rapidsai

cjac commented 3 weeks ago

This looks like it might be what I need:

https://pypi.org/project/conda-mirror/

jakirkham commented 3 weeks ago

Sorry for being unclear. Thanks for the info! 🙏

Ok so you are able to reach the CDN through curl. Would think conda should as well. IOW it doesn't look like a networking issue

Both conda-forge and nvidia are on the CDN

Currently rapidsai is not, but we plan to fix that: https://github.com/conda/infrastructure/issues/1055

Let's see if someone can help before going down the mirroring route

@jezdez could you please help us look into this?

cjac commented 3 weeks ago

okay. I started the mirroring route because it might be faster to have a local copy. Let me compare and let you know whether it's too much effort to maintain a mirror for use with my reproduction environment.

I've got a couple of files in my example. sync-mirror.sh is run on an instance created using create-conda-mirror.sh.

Please pardon the mess. I re-used some code I was using for a different purpose. The docs that I read about mirrors suggested that attaching GPUs to the mirror host might help accelerate things, too, so I used the latest rapids image and attached 4x T4s.

cjac commented 3 weeks ago

wow. It looks like I got cut off.

root@dpgce-conda-mirror-us-west4:~# links https://conda.anaconda.org/defaults/linux-64/repodata.json

+ /opt/conda/miniconda3/bin/conda-mirror -v --upstream-channel=conda-forge --upstream-channel=rapidsai --upstream-channel=nvidia --upstream-channel=defaults --platform=linux-64 --temp-directory=/mnt/shm --target-directory=/var/www/html --num-threads=7
Log level set to WARNING
Traceback (most recent call last):
  File "/opt/conda/miniconda3/lib/python3.11/site-packages/conda_mirror/conda_mirror.py", line 635, in get_repodata
    resp.raise_for_status()
  File "/opt/conda/miniconda3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/defaults/linux-64/repodata.json

cjac commented 3 weeks ago

It looks like I was attempting to mirror portions of the repo that I don't need and won't help our cache.

The current implementation looks promising. The first one resulted in a mirror with size of ~120GB. I think it may have been the nvidia channel alone. I attempted to pass multiple instances of the --upstream-channel argument, and it took only the last.

After learning from this mistake, I have bifurcated the previous, simple, and incorrect single conda-mirror call into concurrent conda-mirror calls in their own screen tabs. Since this is a long-running process, it's probably best not to have it fail when a terminal is detached. And once all of the tabs have completed, the screen session will terminate and return control to the sync-mirror.sh shell process.

I am about 20 minutes into this latest run. It picked up in the mirroring where it had left off despite the deletion of the previous VM that had been running it. I increased the memory and CPU count so that it can accommodate three concurrent conda-mirror processes. Here's a snapshot of disk usage.

root@dpgce-conda-mirror-us-west4:~# df -h /var/www/html
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb         15T  130G   15T   1% /var/www/html

cjac commented 3 weeks ago

This question moved to a different forum

jezdez commented 3 weeks ago

wow. It looks like I got cut off.

root@dpgce-conda-mirror-us-west4:~# links https://conda.anaconda.org/defaults/linux-64/repodata.json

+ /opt/conda/miniconda3/bin/conda-mirror -v --upstream-channel=conda-forge --upstream-channel=rapidsai --upstream-channel=nvidia --upstream-channel=defaults --platform=linux-64 --temp-directory=/mnt/shm --target-directory=/var/www/html --num-threads=7
Log level set to WARNING
Traceback (most recent call last):
  File "/opt/conda/miniconda3/lib/python3.11/site-packages/conda_mirror/conda_mirror.py", line 635, in get_repodata
    resp.raise_for_status()
  File "/opt/conda/miniconda3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/defaults/linux-64/repodata.json

https://conda.anaconda.org/main/linux-64/repodata.json is the correct repodata URL for Anaconda Distribution

jezdez commented 3 weeks ago

@cjac I'm not aware of any throttling from GCP. The original issue seems to have been a transient connection error, is this really still happening from GCP? The channels are hosted on Cloudflare CDN.

For the other questions, if this relates to commercial support for GCP related services, this isn't the right repo to raise an issue, please reach out through your Anaconda support channels instead.

cjac commented 3 weeks ago

I have not tried to reproduce the issue yet. I'm going to finish building a mirror and use a locally mounted filesystem with the packages on it to provide the conda-forge, rapidsai and nvidia channels.

Once the mirror is up, probably by monday, I will try the build of the rapids image again, this time using file:///var/www/html/«channel» instead of https://conda.anaconda.org/«channel»

I can then share the example instruction on how to build and utilize a conda mirror, and close this issue.

cjac commented 1 week ago

The mirror has been built, but it seems conda does an extra write of ~15G to the temp directory, much of which could be skipped when the source is on a file:// path.

In any case, the code which I used to build the anaconda mirror can be found here:

https://github.com/cjac/dataproc-repro/blob/conda-mirror-20241031/lib/mirror/sync-conda.pl

On a 96 core machine, I believe that it could mirror the channels we use in about 8 hours.

conda / infrastructure