conda-forge / cupy-feedstock

A conda-smithy repository for cupy.
BSD 3-Clause "New" or "Revised" License
5 stars 23 forks source link

[DO NOT MERGE] Test build time reduction #178

Closed leofang closed 2 years ago

leofang commented 2 years ago

Checklist

conda-forge-linter commented 2 years ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

leofang commented 2 years ago

@conda-forge-admin, please rerender

hmaarrfk commented 1 year ago

Its not clear to me how many arches you attempted to build here?

The patch is too large to skim on github. Can you summarize? Did you still try to make 1 feedstock for all cuda arches?

leofang commented 1 year ago

Hi @hmaarrfk I generated and took the patch from https://github.com/cupy/cupy/pull/6941, so maybe it's easier that you refer to the summary there. Note that I haven't even split the CUDA archs yet (and CuPy still supports archs since cc35), as that would need additional work, specifically each template function needs to have CC as a template parameter, so that you can get the function pointer to the correct template specialization. It's very tedious work.

But regardless if CUDA arch is split-compiled or not, the lesson is the same: By splitting, we don't give the compiler the chance to reuse done optimizations, and we basically redo everything from scratch for each TU.

NVCC in fact has started working on compile-time reduction, see, e.g.

so this is another reason why such manual splitting should better be avoided.

hmaarrfk commented 1 year ago

ok thank you for the pointer. I'll read the references you provided.

My idea is more (and likely that nvidia has already thought about this and wrote it off)

  1. Compile (approximately) 10x 5.5 hours builds, one build for each architecture (this is something we can do at conda-forge easily)
  2. Combine all 10 builds.

Even if it increases the total build time.

total 55 hours >> 8 hours (estimated)

it is something that is "possible" given our infrastructure.

hmaarrfk commented 1 year ago

I think the fundamental problem with conda-forge's infrastructure is that we have 2 threads. So even if you try to do things "concurrently" you are limited to "2". I thought cmake and ninja already try to run things in 2 parallel processes when they detect they can.

leofang commented 1 year ago

That's right. My feeling is the same. Whatever I did was eventually limited by the CI env.

hmaarrfk commented 1 year ago

Right, so i think my question in the other thread (and I'm happy to move the conversation there again) is:

  1. Can we build 10 (or so package), one for each compute architecture on 10 CIs.
  2. Then combine them all in a meta package?
leofang commented 1 year ago

Not without significant code refactoring & manual stitching (see how I split it in https://github.com/cupy/cupy/pull/6941 to generate hundreds of tiny TUs, the diff is really a mess), and in the end I really don't know if it'd work, without someone working out a prototype solution first. The project (not package) maintainers must be on board for such messy changes, and in CuPy's case I couldn't even convince myself it's worth, not to mention the team 🙂

leofang commented 1 year ago

In your case it's worse I'd say, because

2. Then combine them all in a meta package?

I have no idea how this can be done. Static linking, maybe?

hmaarrfk commented 1 year ago

alright. i thought maybe I was missing an obvious solution .

thanks for explaining.