Attempt microarch build

count0 commented 4 months ago

Checklist

[ ] Used a personal fork of the feedstock to propose changes
[ ] Bumped the build number (if the version is unchanged)
[ ] Reset the build number to 0 (if the version changed)
[ ] Re-rendered with the latest conda-smithy (Use the phrase code>@<space/conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
[ ] Ensured the license file is being packaged.

conda-forge-webservices[bot] commented 4 months ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

stuarteberg commented 4 months ago

@conda-forge-admin please rerender

count0 commented 4 months ago

So, the CI is failing with things like

The following packages are incompatible
└─ _x86_64-microarch-level >=4  is not installable because there are no viable options
   ├─ _x86_64-microarch-level 4 would require
   │  └─ __archspec 1 cascadelake, which is missing on the system;

Is this something that can be fixed by somehow installing the archspec, or is it a no go since the builder has the wrong architecture?

stuarteberg commented 4 months ago

It seems I was mistaken about the ability to cross-compile, at least with this mechanism. We may need to ask for help or clarification.

count0 commented 4 months ago

If each builder would cross-compile to all microarchs this would certainly blow the 6h time limit... So, the only viable alternative would be to have one builder per microarch, but that would mean more than a hundred builders. So, I'm not sure this will work.

stuarteberg commented 4 months ago

Yeah, the docs do say:

Adding microarchitecture variants can result in too many entries in the build matrix. Do not overuse it.

One compromise could be to enable the multi-arch build for only recent versions of Python.

stuarteberg commented 4 months ago

Regarding how we might enable level 4 on a CI machine that doesn't have a compatible CPU, I think there is a way, albeit slightly convoluted.

Keep conda_build_config.yaml as you have it now. It will still launch a build in which microarch_level == 4.
But we won't actually use the x86_64-microarch-level package for that build, since it would require that the build machine supports that CPU.
Since we didn't install that package:
- We have to explicitly add -march=x86-64-v4 to CXXFLAGS ourselves.
- We have to explicitly add the appropriate run requirement (normally the run_exports would have done this for us)

# meta.yaml

requirements:
  build:
    ...
    - x86_64-microarch-level {{ microarch_level }}  # [unix and x86_64 and microarch_level < 4]
  run:
    ...
    # Note the leading underscore
    # See https://github.com/conda-forge/microarch-level-feedstock/blob/main/recipe/meta.yaml
    - _x86_64-microarch-level {{ microarch_level }}  # [unix and x86_64 and microarch_level == 4]

# build.sh
if [[ "${microarch_level}" == "4" ]]; then
    CXXFLAGS="${CXXFLAGS} -march=x86-64-v4"
fi

I'm not sure if that would work, but that's my best guess.

count0 commented 4 months ago

It seems to be going forward.

But there are some issues still:

For aarch64 we get:

Error: unrecognized option -march=x86-64-v4
cc1plus: error: unknown value 'x86-64-v4' for '-march'
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8.7-a armv8.8-a armv8-r armv9-a

so, we need to disable the microarchs for that arch, since it doesn't make sense. How do we do that? (Or maybe we don't need to; see below).

I'm seeing things like -march=core2 -mtune=haswell -mssse3 being added to CXXFLAGS, so maybe we do not need to add -march=x86-64-vX by hand at all?
Clang on MacOS is not cooperating. During the configure checks we get:
```
/configure: line 2052: 12563 Illegal instruction: 4  ./conftest$ac_exeext
```
So, the CGAL check is failing since the binary cannot be run. :-( Can we enforce GCC in this case? Although, the proper solution in this case is to change the test so that only successful compilation is needed, not execution.
It seems like -O2 is being appended to CXXFLAGS along the way, but we should have -O3, otherwise the whole thing is moot. Do you know where this override is coming from?

stuarteberg commented 4 months ago

I'm seeing things like -march=core2 -mtune=haswell -mssse3 being added to CXXFLAGS, so maybe we do not need to add -march=x86-64-vX by hand at all?

Right, the x86_64-microarch-level package introduces that flag (by placing a script into the build environment's etc/conda/activate.d/). The only time we need to add it by hand is when we declined to install that package in the build environment, i.e. for level 4.

So, the CGAL check is failing since the binary cannot be run. :-( Can we enforce GCC in this case? Although, the proper solution in this case is to change the test so that only successful compilation is needed, not execution.

I think we need to stick with the conda-provided compiler on osx, which means sticking with clang. I think the "proper" solution is the best option, even if we need to implement it with a patch. In fact, we already have a patch for this. See meta.yaml:

source:
  patches:
    # When cross-compiling, don't let configure run a CGAL test program.
    # Just trust the location we pass via --with-cgal
    - patches/skip-cgal-check.patch  # [build_platform != target_platform]

... so we just need to expand that condition in the "selector":

    - patches/skip-cgal-check.patch  # [build_platform != target_platform or microarch_level == 4]

(Or we could just remove the selector entirely, in which case the patch is always active and the cgal check would never be executed.)

It seems like -O2 is being appended to CXXFLAGS along the way, but we should have -O3, otherwise the whole thing is moot. Do you know where this override is coming from?

I'm not 100% certain about this, but I think it's coming from the following places on linux and mac, respectively:

This might sound hacky, but I suppose we could just strip out O2 and replace it with O3 in our own build. Apparently even the Python feedstock does this under certain configurations.

stuarteberg commented 4 months ago

BTW, is it even necessary to strip out the -O2? From what I've read, one can simply append -O3 in which case gcc (and presumably clang, too) will just ignore the earlier -O2 flag.

count0 commented 4 months ago

Right, the x86_64-microarch-level package introduces that flag (by placing a script into the build environment's etc/conda/activate.d/). The only time we need to add it by hand is when we declined to install that package in the build environment, i.e. for level 4.

Oh, I misunderstood this. Fixed now.

I fixed also the CGAL test selector.

This might sound hacky, but I suppose we could just strip out O2 and replace it with O3 in our own build.

I'm totally fine with this.

count0 commented 4 months ago

BTW, is it even necessary to strip out the -O2? From what I've read, one can simply append -O3 in which case gcc (and presumably clang, too) will just ignore the earlier -O2 flag.

This is true, but I wasn't sure where exactly -O2 was being appended... It seemed to be happening at the very end.

count0 commented 4 months ago

This is true, but I wasn't sure where exactly -O2 was being appended... It seemed to be happening at the very end.

Also, they seem to pollute CPPFLAGS and LDFLAGS as well... I'm not sure why.

stuarteberg commented 4 months ago

If this ends up working, perhaps we should open an issue on the microarch-level-feedstock. Right now that feedstock produces two packages for each arch, such as:

x86_64-microarch-level a. Introduces the -march=x86-64-v${level} flag in CFLAGS etc. b. Depends on a specific __archspec virtual package
_x86_64-microarch-level a. Adds the appropriate run_exports

...but it seems like cross-compilation would be easier if they were to split up the functionality from 1.a and 1.b. into two separate packages, so we could easily obtain the correct CFLAGS without pulling in the __archspec dependency.

Update: I opened https://github.com/conda-forge/microarch-level-feedstock/issues/5

stuarteberg commented 4 months ago

Apparently I misremembered how that CGAL patch works. Upon reading the actual patch file, I see that we need something different. That patch relies on autoconf's ability to detect cross-compilation.

(I had assumed the patch was just commenting out some lines in the .m4 file somewhere.)

count0 commented 4 months ago

OK, good catch. Let's try now.

count0 commented 4 months ago

It seems the test

if [[ "${microarch_level}" == "4" ]]; then
    CXXFLAGS="${CXXFLAGS} -march=x86-64-v4"
fi

is being triggered by the builds:

linux linux_aarch64_numpy1.22python3.10.____cpython
osx osx_arm64_numpy1.22python3.8.____cpython
osx osx_arm64_numpy1.23python3.11.____cpython

But I don't get it...

stuarteberg commented 4 months ago

But I don't get it...

I agree. I can't figure out why microarch_level is defined at all in those build variants, since the selectors in conda_build_config.yaml should be eliminating it entirely. It must be a bug in conda-build. Maybe one workaround might be the following?

microarch_level:
  - 1
  - 3  # [unix and x86_64]
  - 4  # [unix and x86_64]

stuarteberg commented 4 months ago

Meanwhile, the other builds got far enough along that the test step ran, and failed. We should disable testing entirely for level 4.

stuarteberg commented 4 months ago

We need to disable level-4 testing in both graph-tool-base and graph-tool.

https://github.com/conda-forge/graph-tool-feedstock/pull/140/commits/7285f9e5fe88a402d90cd1bfaf30a9c6ee8ac72e#diff-f3725a55bf339595bf865fec73bda8ac99f283b0810c205442021f29c06eea9aL124-L128

count0 commented 4 months ago

Will do, but the flag -march=x86-64-v4 is still seeping in for osx osx_arm64_numpy1.22python3.8.____cpython somehow...

stuarteberg commented 4 months ago

Oh we probably have to rerender after making that change to conda_build_config.yaml.

@conda-forge-admin please rerender

stuarteberg commented 4 months ago

I'm not sure what was taking the bot so long, so I ran the re-render myself.

stuarteberg commented 4 months ago

I downloaded some of the built artifacts from this PR, and tried installing them on my own machine (by following these instructions). They seem to work as expected!

For the record, here's what I did:

You can see which architecture your machine has via:

conda info | grep archspec

And you can figure out which level that corresponds to using the following table, which I generated from this file.

Click here for table of microarch names and levels

``` microarchitecture level family 0 x86_64 1 x86_64 1 nocona 1 x86_64 2 core2 1 x86_64 3 k10 1 x86_64 4 x86_64_v2 2 x86_64 5 nehalem 2 x86_64 6 westmere 2 x86_64 7 sandybridge 2 x86_64 8 ivybridge 2 x86_64 9 bulldozer 2 x86_64 10 piledriver 2 x86_64 11 steamroller 2 x86_64 12 x86_64_v3 3 x86_64 13 haswell 3 x86_64 14 broadwell 3 x86_64 15 skylake 3 x86_64 16 mic_knl 3 x86_64 17 cannonlake 3 x86_64 18 excavator 3 x86_64 19 zen 3 x86_64 20 zen2 3 x86_64 21 zen3 3 x86_64 22 x86_64_v4 4 x86_64 23 skylake_avx512 4 x86_64 24 cascadelake 4 x86_64 25 icelake 4 x86_64 26 sapphirerapids 4 x86_64 27 zen4 4 x86_64 28 ppc64le 8 ppc64le 29 power8le 8 ppc64le 30 power9le 9 ppc64le 31 power10le 10 ppc64le ```

I have access to a Mac with a skylake CPU (level 3) and a Linux machine with sapphirerapids CPU.

On the Mac, I downloaded both a level3 and level4 package and tried them each. As expected, level4 could NOT be installed on my level-3 machine:

cd conda_artifacts_20240528.12.1_osx_64_microarch_level4numpy1.26python3.12.____cpython
conda create -n test-gt -c file://$(pwd) 'graph-tool-base=2.68=*402'

...

Could not solve for environment specs
The following package could not be installed
└─ graph-tool-base ==2.68 py312hef4c000_402 is not installable because it requires
   └─ _x86_64-microarch-level 4.*  but there are no viable options

But level3 CAN be installed, and it's the package that conda seems to prefer if I ask for graph-tool-base with no specific version or build string. And the library seems to function normally once installed.

On my linux machine, I tried a level4 package, and it worked. Just to be sure it actually uses AVX-512, I ran objdump -d on libgraph_tool_inference.so, and inspected the output as described here. Indeed, it seems to contain plenty of AVX-512 instructions.

stuarteberg commented 4 months ago

PS -- Here's a useful article (assuming it's correct) detailing which consumer CPUs (including Macs) support AVX-512 and thus could theoretically benefit from a level-4 build. It's not all that many models.

count0 commented 4 months ago

This is outstanding!

I think we benefit from level 3 substantially, even if level 4 is not widely available.

So, can we merge this?

stuarteberg commented 4 months ago

So, can we merge this?

Sounds good to me…

mbargull commented 4 months ago

Adding such builds unconditionally induces unnecessary strain on the infrastructure. Could you please heed https://conda-forge.org/docs/maintainer/knowledge_base/#microarch

Before learning how to use it, please read these considerations:

Adding microarchitecture variants can result in too many entries in the build matrix. Do not overuse it.

These optimized builds should only be used when the performance improvements are significant.

Preferrably, the project should rely on runtime dispatch for arch-specific optimizations.

If the package is already too large, consider using smaller outputs for the arch-optimized variants.

and reconsider the extend of your approach?

To put things into perspective: The single run https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=945053&view=results produced 2GB of package files and occupied about 74h of overall CI time (assuming I summed everything up correctly).

stuarteberg commented 4 months ago

@mbargull Happy to reconsider. I had also been wondering if this would be too much.

Before this PR, this feedstock ordinarily built 20 variants, resulting in ~1GB of storage. After this PR, that doubled to 40 variants and ~2GB storage.

One option we were considering is to restrict the platforms that get microarch-specific builds:

osx_64 should probably only support level 3. IIUC, I doubt many level-2 Macs are still in use, and very few level-4 MacBooks were ever made. Eliminates 10 variants. [Edit: See this wonderful wikipedia page for details.]
Python 3.8 (and perhaps Python 3.9) is old, so maybe we shouldn't bother with extra optimization. Eliminates 2 (or 4 variants).

That would bring us down to 26 variants in total (~1.3 GB storage).

corneliusroemer commented 2 months ago

osx_64 should probably only support level 3. IIUC, I doubt many level-2 Macs are still in use, and very few level-4 MacBooks were ever made. Eliminates 10 variants. [Edit: See this wonderful wikipedia page for details.]

@stuarteberg, beware that Rosetta2 doesn't support AVX (at least until macOS 15=Sequoia, still in beta), so if you build just a v3 for osx-64, there will be issues for those who use an osx-64 environment with Rosetta2.

Only recently, bioconda started adding osx-arm64 builds, so almost all environments that required at least one bioconda package had to use Rosetta2 with osx-64 environments.

conda-forge / graph-tool-feedstock

Attempt microarch build #140