Rebuild for numpy 2.0; use larger runners on GPU server

h-vetinari commented 4 months ago

2.3.0 should be compatible with numpy 2.0 already; for some reason, the bot stumbles over the templated outputs here (probably an old failure before stdlib was added), so let's do it by hand.

conda-forge-webservices[bot] commented 4 months ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe:

It looks like the 'libtorch' output doesn't have any tests.

beckermr commented 4 months ago

@h-vetinari Can you link to the log error on the bot? I want to investigate to see why it cannot process this recipe.

h-vetinari commented 4 months ago

@h-vetinari Can you link to the log error on the bot? I want to investigate to see why it cannot process this recipe.

This is what was on the status page:

bot error (
[bot CI job](https://github.com/regro/cf-scripts/actions/runs/8962449813)
): main: Traceback (most recent call last):
  File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/auto_tick.py", line 1271, in _run_migrator
    migrator_uid, pr_json = run(
                            ^^^^
  File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/auto_tick.py", line 237, in run
    migrator.run_pre_piggyback_migrations(recipe_dir, feedstock_ctx.attrs, **kwargs)
  File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/migrators/core.py", line 268, in run_pre_piggyback_migrations
    mini_migrator.migrate(recipe_dir, attrs, **kwargs)
  File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/migrators/cstdlib.py", line 206, in migrate
    sections = _slice_into_output_sections(lines, attrs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/migrators/libboost.py", line 48, in _slice_into_output_sections
    raise RuntimeError("Could not find all output sections in meta.yaml!")
RuntimeError: Could not find all output sections in meta.yaml!

h-vetinari commented 4 months ago

In the CPU megabuild, we built libtorch and several pytorches, but then fail on the py38 build with:

  CMake Error in torch/CMakeLists.txt:
    Imported target "numpy::numpy" includes non-existent path

      "/home/conda/feedstock_root/build_artifacts/libtorch_1715855634503/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.8/site-packages/numpy/_core/include"

    in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

    * The path was deleted, renamed, or moved to another location.

    * An install or uninstall procedure did not complete successfully.

    * The installation package was faulty and references files it does not
    provide.

Also, we haven't caught a 4xl runner in over 10h, so I'll try if 2xl works better...

beckermr commented 4 months ago

Oh interesting. The bot issued a PR now: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/239

beckermr commented 4 months ago

Ahhh that error message is old. When I put in the PR to improve that section of the bot, I changed the error message. See the code here: https://github.com/regro/cf-scripts/blob/master/conda_forge_tick/migrators/libboost.py#L107.

conda-forge-webservices[bot] commented 4 months ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

h-vetinari commented 4 months ago

Ahhh that error message is old.

That's why I had said in the OP "probably an old failure before stdlib was added", because once that string is found in the recipe, the piggyback isn't even attempted. :)

beckermr commented 4 months ago

Ahhhh thanks @h-vetinari. I didn't grok your comment fully! sorry!

h-vetinari commented 4 months ago

Thanks for the help @isuruf!

Sidenote @beckermr: I don't understand why the two jobs with the green checkmarks aren't skipped completely (as in: smithy shouldn't generate a job for them at all; all outputs are skipped for them AFAICT).

beckermr commented 4 months ago

Which ci support files?

FWIW, I've seen this happen before in various feedstocks before and after the changes I made recently.

h-vetinari commented 4 months ago

Which ci support files?

The two with the green check marks... because they get skipped immediately. IOW:

linux_64_blas_implgenericc_compiler_version11cuda_compilernvcccuda_compiler_version11.8cxx_compiler_version11
linux_64_blas_implgenericc_compiler_version12cuda_compilercuda-nvcccuda_compiler_version12.0cxx_compiler_version12

These variant configs should both never be created due to

skip: true  # [cuda_compiler_version != "None" and linux64 and blas_impl != "mkl"]

which is applied for all outputs/stages.

isuruf commented 4 months ago

Also, we haven't caught a 4xl runner in over 10h, so I'll try if 2xl works better...

You need to request access for those. See https://github.com/conda-forge/admin-requests/blob/main/examples/example-open-gpu-server.yml

h-vetinari commented 4 months ago

Hm, this is still failing to get the 2xl runners due to:

{
    "error": "User not authorized for the requested runners, user not present in 'users' list."
}

I don't see how https://github.com/Quansight/open-gpu-server/blob/main/access/conda-forge-users.json would distinguish user rights between different runners. Are we the first ones to try something with 2xl runners?

Edit: for posterity, the required https://github.com/conda-forge/.cirun/pull/12

h-vetinari commented 4 months ago

Alright, this is back to fully green CI, no manual builds required! 🥳

PTAL @conda-forge/pytorch-cpu :)

h-vetinari commented 4 months ago

Just noticing the following

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'bin/torch_shm_manager'
has an incorrect size.
  reported size: 43368 bytes
  actual size: 49545 bytes
Verifying transaction: ...working... done

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'lib/libc10.so'
has an incorrect size.
  reported size: 1094504 bytes
  actual size: 1158313 bytes

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'lib/libc10_cuda.so'
has an incorrect size.
  reported size: 614840 bytes
  actual size: 646881 bytes

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'lib/libcaffe2_nvrtc.so'
has an incorrect size.
  reported size: 21088 bytes
  actual size: 26289 bytes

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'lib/libshm.so'
has an incorrect size.
  reported size: 48192 bytes
  actual size: 56665 bytes

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'lib/libtorch.so'
has an incorrect size.
  reported size: 14816 bytes
  actual size: 17129 bytes

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'lib/libtorch_cpu.so'
has an incorrect size.
  reported size: 210126256 bytes
  actual size: 214128009 bytes

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'lib/libtorch_cuda.so'
has an incorrect size.
  reported size: 1343656712 bytes
  actual size: 1353944689 bytes

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'lib/libtorch_cuda_linalg.so'
has an incorrect size.
  reported size: 845552 bytes
  actual size: 901641 bytes

SafetyError: The package for libtorch located at /home/conda/feedstock_root/build_artifacts/pkg_cache/libtorch-2.3.0-cuda118_h9e56e6c_301
appears to be corrupted. The path 'lib/libtorch_global_deps.so'
has an incorrect size.
  reported size: 14984 bytes
  actual size: 17201 bytes

I think we might need to clean some build caches in the megabuild.

conda-forge-webservices[bot] commented 4 months ago

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe) and found some lint.

Here's what I've got...

For recipe:

Failed to even lint the recipe, probably because of a conda-smithy bug :cry:. This likely indicates a problem in your meta.yaml, though. To get a traceback to help figure out what's going on, install conda-smithy and run conda smithy recipe-lint . from the recipe directory.

conda-forge-webservices[bot] commented 4 months ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

h-vetinari commented 4 months ago

Luckily, the SafetyErrors are gone now too. Anything else you'd want here @hmaarrfk? 🙃

hmaarrfk commented 4 months ago

Thanks @h-vetinari, this looks great! I think it addresses most parties' concerns.

We can consider the build refactor in a separate PR.

@isuruf so this comment doesn't get lost, do take a chance to consider it if you ever happen to back to this PR. https://github.com/conda-forge/pytorch-cpu-feedstock/pull/238#discussion_r1606051235

conda-forge / pytorch-cpu-feedstock

Rebuild for numpy 2.0; use larger runners on GPU server #238