conda-incubator / conda-store

Data science environments, for collaboration. ✨
https://conda.store
BSD 3-Clause "New" or "Revised" License
137 stars 44 forks source link

[BUG] - Updating existing environment fails w/ network error from conda_store_server #744

Open gzt5142 opened 5 months ago

gzt5142 commented 5 months ago

Describe the bug

Attempting to change the composition of an existing conda environment built with conda-store as part of a nebari deployment.

Expected behavior

environment should build with updated YAML and produce the lock and log artifacts, just as if it were a new environment.

How to Reproduce the problem?

  1. Create a new environment in global namespace with name pangeo (this succeeds).
  2. Edit the pangeo environment to add a new package dependency in the YAML description
  3. Launch the build
  4. Wait -- conda-store will be in building mode for about 15 minutes
  5. The build fails. (stack trace below).
  6. Use the UI to delete the pangeo environment.
  7. Create a new environment in global with name pangeo (same as above)
  8. Launch the build
  9. Wait -- conda-store will be in building mode for about 15 minutes
  10. The build fails.

Output

The conda-store log includes the stack trace for the failure:

traceback.txt

Versions and dependencies used.

Anything else?

This installation is on a sequestered network with strict network policy -- most traffic is blocked except specific whitelisted connections. We have opened ports/hosts/etc to allow conda builds from specific channels (these connections and builds work well for new environments), but fails while attempting to edit existing environments.

pavithraes commented 5 months ago

@gzt5142 Thank you for opening this issue, and welcome to the repository. :)

I'm not able to reproduce this issue on a regular Nebari deployment. Could you please share some more details about the whitelist of connections and a YAML spec that fails for you, so that we can try reproducing and isolating the issue?

gzt5142 commented 5 months ago

... Could you please share some more details about the whitelist of connections and a YAML spec that fails for you, so that we can try reproducing and isolating the issue?

I have been able to reproduce this problem with a minimal YAML. Start with just xarray, for example.

channels:
  - conda-forge
dependencies:
  - python=3.10.*
  - xarray

Then modify this environment to add any new dependency....rebuild the environment and I get the same failure after network timeout.

channels:
  - conda-forge
dependencies:
  - python=3.10.*
  - xarray
  - cf_xarray

Furthermore.... if I build an environment (call it "Test" in the "global" namespace) with that first YAML containing only xarray... then delete that environment using conda-store UI, I now can't build a new environment (regardless of YAML content) with the same name ("Test" in "global"). Even if I use the same initial YAML with just xarray.

As to network whitelist, our security people don't really like it when we disclose configuration... so I'm reluctant to give details. We have whitelisted one or two conda channels (including conda-forge). Most anything else will be blocked.

But the whitelist shouldn't matter in this case -- I can build an environment from new without any issue; connection to the conda channel works perfectly. If I try to make any **updates*** to an existing environment, that's when I get this error. Or if I try to re-create an environment with the same name.

So the question I'm stuck on is, what additional network requests are made for a re-build that are not made for an initial build.

--gt

gzt5142 commented 5 months ago

@pavithraes

Could this be related to conda-lock ???

I notice that conda-lock makes a sneaky network connection to

https://raw.githubusercontent.com/regro/cf-graph-countyfair/master/mappings/pypi/grayskull_pypi_mapping.yaml

for a lookup function. If this is not done for initial builds (but is done for rebuilds), then that may cause this timeout -- assuming that raw.githubusercontent.com is not on the whitelist for network connections.

Would conda-lock be invoked for a re-creation of an environment with an old name?

--gt

nkaretnikov commented 4 months ago

Thanks for the detailed report! I'll take a look at what requests are made.

nkaretnikov commented 4 months ago

@gzt5142

Do you have access to the terminal on that machine? Could you try building and rebuilding the same environment with pure conda, and a slightly edited environment as well?

In conda-store, I don't think there's any difference server-side in terms of whether you're building from scratch or after editing.

I wonder if it's due to caching that conda does. Because you have a proxy configured on your system, as mentioned here: #767, so maybe these urls get stored in the package cache and conda cannot resolve them.

https://redacted@nexus.internal.host/repository/conda-forge/linux-64/python_abi-3.11-4_cp311.conda

gzt5142 commented 3 months ago

Because you have a proxy configured on your system, as mentioned here: #767, so maybe these urls get stored in the package cache and conda cannot resolve them.

I think there may be some misunderstanding here. We tried the proxy, and abandoned it in favor of whitelisting a conda channel (as described above) and allowing direct connections. No proxy is involved in this issue.