ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.01k stars 210 forks source link

A way to pre-build kernels for several architectures #1373

Open littlewu2508 opened 2 years ago

littlewu2508 commented 2 years ago

Hi, I'm maintainer of gentoo miopen package. I'd like to know if you can also provide the source code of the pre-compiled kernels, so that we can package it and users can compile them for their specific card (the pre-compiled kernel only contains few archs, which leads to problems such as https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1309).

atamazov commented 2 years ago

The pre-compiled kernels are just the ordinary kernels, and all their sources already provided in this repo. You can run MIOpen on whatever card you want and then collect compiled kernels from the binary cache directory.

junliume commented 2 years ago

@atamazov Indeed the users can generate compiled kernels on any cards they process. However, the GPU detection check at the moment has blocked them from building it first.

Can we move "GPU DETECTION FAILED DURING CMAKE PHASE" to a warning instead?

atamazov commented 2 years ago

It should be warning already. Error will happen only if you run "make check"

littlewu2508 commented 2 years ago

The pre-compiled kernels are just the ordinary kernels, and all their sources already provided in this repo. You can run MIOpen on whatever card you want and then collect compiled kernels from the binary cache directory.

Thanks. Also, I'd like to know, whether there is a way to build all these kernels to .kdb for several architectures, even if there is no available GPU on the build machine? I suppose there is a build tool in ROCm team to generate the binary cache hosted on https://repo.radeon.com/rocm/miopen-kernel/, and I'd like to compile these before during the build of MIOpen.

atamazov commented 2 years ago

@littlewu2508

whether there is a way to build all these kernels to .kdb for several architectures...

Yes, but for each arch you'll get separate .ukdb file. I think that @JehandadKhan can provide you with the details on how to properly do that (as soon as time permits).

:warning: This feature is not intended for the end user, so the process (including API calls, environment settings etc) are subject to change without notice.

Madouura commented 1 year ago

I'm also trying to package this for nixpkgs, and would like to know how to properly generate the kdb files for all/specified architectures. The closest thing I'm finding is the performance database, but I'm not entirely sure that's it.

atamazov commented 1 year ago

@Madouura IIRC you can do this in straightforward way:

/cc @JehandadKhan

Madouura commented 1 year ago

Thank you for the prompt response. I'm a bit sick ATM so I may be misunderstanding here.

Prepare the development system that is identical to the intended target system, including GPU and ROCm.

Does this mean having the actual GPU hardware, or just a setting? I own two AMD GPUs so I can at least generate for RX 6900XT and RX 6800 if it's the latter.

Run necessary configs on the development system. For example, you can run the neural network(s) that you would like to later run on the target system. During this process, MIOpen builds all the necessary kernels and stores then in the user kernel database (~/.cache/miopen//*.ukdb IIRC).

Given that, I'm thinking we need GPU hardware on the development system. Something is concerning me however, by "running the neural networks" am I wrong in thinking that implies that the database generated is different depending on the neural networks ran? Can I also safely assume that user kernel databases are safely interchangeable with (system) kernel databases? (I don't see why they wouldn't be, but best to make sure.)

For the .ukdb (specifically, gfx1030_40) generated on my user's ~/.cache/miopen, I notice it is quite a bit smaller than what I see for the .kdb in https://repo.radeon.com/rocm/miopen-kernel/rel-5.0. could the reason be that the miopen team generates/combines the kdb for a large assortment of neural networks?

atamazov commented 1 year ago

@Madouura

Does this mean having the actual GPU hardware, or just a setting? I own two AMD GPUs so I can at least generate for RX 6900XT and RX 6800 if it's the latter.

The former. It is also possible without hardware, but not easy AFAIK. Maybe @JehandadKhan can help you with it, when/if he has time.

am I wrong in thinking that implies that the database generated is different depending on the neural networks ran?

Different networks use different primitives with different configurations, so MIOpen generates different kernels to implement them.

Can I also safely assume that user kernel databases are safely interchangeable with (system) kernel databases? (I don't see why they wouldn't be, but best to make sure.)

Yes IIRC

For the .ukdb (specifically, gfx1030_40) generated on my user's ~/.cache/miopen, I notice it is quite a bit smaller than what I see for the .kdb in https://repo.radeon.com/rocm/miopen-kernel/rel-5.0. could the reason be that the miopen team generates/combines the kdb for a large assortment of neural networks?

Yes.

Madouura commented 1 year ago

It is also possible without hardware, but not easy AFAIK

Yes, if possible that would be ideal since we could stick MIOpen on a (hopefully hydra or hydra-adjacent) server, which will likely be GPU-less and generate new KDBs per consumer/pro GPU release, as well as each major MIOpen change. The other option is to write a common configuration script and crowdsource (probably trusted, are there any potential security vulnerabilities you know of that a MIOpen kernel database can cause?) users who have hardware to generate and submit kernel databases. We can hopefully work around what makes this "not easy," and not have to rely on trust from user submissions.

Different networks use different primitives with different configurations, so MIOpen generates different kernels to implement them.

could the reason be that the miopen team generates/combines the kdb for a large assortment of neural networks?

Yes.

Okay, so in that case, if it's not proprietary and you and/or the team can share it, would you mind sharing the list (or a subset of the list depending on proprietariness) of neural networks the MIOpen team uses to generate the kernel databases? Considering "Different networks use different primitives with different configurations," I believe it may also be possible to write a program (probably using rocPRIM) to generate all possible different configurations. However, I strongly suspect we will reach ridiculous file sizes, and possibly ridiculous generation times if we attempt that. I'm unsure as to how realistic this approach would be. It can be titrated by generating for a subset of common (?) configurations, I suppose.

Madouura commented 1 year ago

One more thing to make sure of. For MIOpen, I suspect user kernel databases are generated on first neural network use, and then updated for each subsequent use. i.e.: I start up stable-diffusion, there is no .ukdb until I generate an image (I haven't verified this, it may be generated at the start), the .ukdb is generated, and then for each subsequent image generation the .ukdb is updated with anything that is different from the generations before. Would this be correct, or is everything generated on startup/first actual use? (i.e.: image generation)

Madouura commented 1 year ago

Are kernel databases just a collection of compute shaders?

atamazov commented 1 year ago

Yes.

atamazov commented 1 year ago

@Madouura Please provide a list of GPUs you need the pre-built kernels for. Maybe it would be possible to extend the list of "officially" supported GPUs with these.

@JehandadKhan Is it possible to provide an instruction that would allow the end users to generate the pre-compiled kernel packages for their GPUs without actual hardware?

/cc @junliume

Madouura commented 1 year ago

Sure. Looking at https://llvm.org/docs/AMDGPUUsage.html#processors, these should suffice.

Need

RDNA1 and up should be officially supported IIRC by most ROCm projects. I own the RX 6900XT (80CU) and the RX 6800 (60CU). When using with MIOpen, the 6900XT creates gfx1030_40 in the cache directory. I'm unsure as to why it's not gfx1030_80. I suspect the same might hold true if I used the 6800.

atamazov commented 1 year ago

When using with MIOpen, the 6900XT creates gfx1030_40 in the cache directory. I'm unsure as to why it's not gfx1030_80. I suspect the same might hold true if I used the 6800.

In MIOpen we use the number of Hardware Compute Units which is twice less than rocminfo reports for gfx103X GPUs.

Madouura commented 1 year ago

In that case, it makes sense to move gfx1031_20 down to "want."