Is there a document on setting auxinfo_t for calling microkernels?

xrq-phys commented 4 years ago

I'm recently trying to utilize BLIS microkernels for implementing some antisymmetric matrix operations. I read the KernelsHowTo.md but still have no idea on how to tag the packed memory and call ((dgemm_ukr_ft)bli_cntx_get_l3_nat_ukr_dt(...))(k, &alpha, ...) correctly.

If NULL is passed to auxinfo_t BLIS seems to give a correct answer but bad memory access might occur depending on type of the kernel.

My test code

fgvanzee commented 4 years ago

@xrq-phys Sorry for the delay in response.

It sounds like you are using BLIS for a very advanced purpose. I applaud you for attempting to utilize the low-level building blocks in BLIS to do something entirely new. :)

Note that you asked about packed memory first, and then about proper use of auxinfo_t, which is a different issue. As for packing the memory, it may actually be useful to write your own code to do the packing since the code in BLIS is quite generalized and has many parameters. You can try to recycle some of the lower-level packing functions such as the packing kernels, which may be queried via the bli_cntx_get_packm_ker_dt( dt, ker_id, cntx ) function, which is shown in bli_packm_cxk.c.

You also didn't mention what subconfiguration you are using, which will affect which kernel is being executed. Some of the kernels don't actually use any of the fields in auxinfo_t, but some of them need the pointer to point to a valid struct.

xrq-phys commented 4 years ago

Thanks a lot for the reply!

I'm mainly trying to utilize the haswell, skx and armv8 kernels (specifically bli_dgemm_haswell_asm_6x8, bli_dgemm_skx_asm_16x14 and bli_dgemm_armv8a_asm_6x8).

In armv8 cases it worked smoothly and I confirmed in the assembly that auxinfo_t is not even referenced, but in hasswell and skx cases it fails very often (because my hand crafted auxinfo_t is incorrect, I think).

I'm afraid I'm unable to write in x86_64 assembler at the moment (sorry), I'll try looking at the reference returned by bli_cntx_get_packm_ker_dt but I'd very much appreciate it if some general rules for setting auxinfo->ps, auxinfo->schema and auxinfo->next could be made available.

Both A and B are packed in default way specified by KernelsHowTo.md, i.e. A in column-major and B in row-major for operation C += AB.

fgvanzee commented 4 years ago

@xrq-phys Once again, sorry for the delay. We've been struggling with some high-priority items lately.

In armv8 cases it worked smoothly and I confirmed in the assembly that auxinfo_t is not even referenced, but in hasswell and skx cases it fails very often (because my hand crafted auxinfo_t is incorrect, I think).

I'm still not convinced that your auxinfo_t is the problem. The dgemm kernels in kernels/haswell/3/bli_gemm_haswell_asm_d6x8.c and kernels/skx/3/bli_dgemm_skx_asm_16x14.c do not dereference the auxinfo_t pointer that is passed in. Even though the idea is for the calling code to always pass auxiliary information into the microkernel, those particular microkernels are choosing to not make use of that information. So I don't think that passing NULL into the microkernel function (at least for those two kernels) can be the cause of any errors. I confirmed this by hard-coding NULL to be passed into the microkernel in frame/3/gemm/bli_gemm_ker_var2.c, configuring for haswell, and running make; make check. All tests passed.

I'm afraid I'm unable to write in x86_64 assembler at the moment (sorry), I'll try looking at the reference returned by bli_cntx_get_packm_ker_dt but I'd very much appreciate it if some general rules for setting auxinfo->ps, auxinfo->schema and auxinfo->next could be made available.

We have APIs for setting/querying the information in an auxinfo_t. You may find these APIs in frame/base/bli_auxinfo.h. (Note that they are static functions instead of true functions, so they exist only in the header file.) Here is a quick rundown:

_schema_a(), _schema_b(). These functions relate to the packing schemas used for the left-hand matrix operand (A) and right-hand matrix operand (B). Typically, these are not needed when using "native" assembly kernels. It sounds like you will not need to set them.
_next_a(), _next_a(). These functions relate to the pointer/address of the next micropanels of A and B. But the nuance of what "next" means here is important! We don't necessarily mean the next micropanel in memory, but rather the next micropanel that will be used! Oftentimes, it is the next micropanel in memory, except when you get to the end of the packed matrix, at which time you must wrap back around to the first micropanel. So that's the semantic meaning of "next" in this case. The purpose of these values in the auxinfo_t is for the microkernel to be able to (if it wishes) to prefetch elements of the next micropanel of A and/or B that will be computed upon.
_is_a(), _is_b(). These functions relate to the so-called imaginary stride of A and B. They represent the distance, in units of fundamental elements (if your datatype is dcomplex, its fundamental elements are of type double), from the real part of element i to the imaginary part of element i. So, for regular "interleaved" storage of complex values, the imaginary stride is 1. Note that these strides are: (a) experimental, (b) used only is select situations, and (c) only relate to certain complex domain microkernels. You may safely ignore them since it appears you are interested in dgemm.
_ps_a(), _ps_b(). These functions relate to the so-called panel stride of A and B. The panel stride is the distance in memory (in units of elements) from the beginning of micropanel i to the beginning of micropanel i+1. Usually, ps_a is equal to MR * KC and ps_b is equal to KC * NR. Technically, they are PACKMR * KC and KC * PACKNR, since the micropanels may have "leading dimensions" that are slightly larger than MR or NR, but in almost all cases, MR = PACKMR and NR = PACKNR, so usually the two methods of computing the panel stride are equivalent. Note: Most microkernels will never need to use these values.

xrq-phys commented 4 years ago

@fgvanzee Thanks a lot for your reply!

I'm quite ashamed to say but reason of the segmentation fault came out to be my mishandling of C functional pointer in C++. The test passed in all 3 platforms after wrapping calls of gemm kernels into extern "C" blocks.

Here is a quick rundown: ...

Thank you very much for the information! I'm in fact working with both double and complex<double> (i.e. double complex) so I'll take a look on the is parameters as well.

fgvanzee commented 4 years ago

@xrq-phys Glad you were able to figure out your problem!

flame / blis

Is there a document on setting auxinfo_t for calling microkernels? #378