Runtime config selection is executed many times

devinamatthews commented 6 years ago

IIRC the runtime config selection is supposed to just run once and be "remembered", but in debugging arm32 runtime selection with printfs, I get:

% 
% level-3 implementations        s       d       c       z
bli_cpuid_query_id
bli_cpuid_query
n1, n2, n3 = 42 20 103
proc_str: model name  : ARMv7 Processor rev 4 (v7l)
ptno_str: CPU part    : 0xd03
feat_str: Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
features var: 1
model: 0
part#: d03
bli_cpuid_query_id
bli_cpuid_query
n1, n2, n3 = 42 20 103
proc_str: model name  : ARMv7 Processor rev 4 (v7l)
ptno_str: CPU part    : 0xd03
feat_str: Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
features var: 1
model: 0
part#: d03
bli_cpuid_query_id
bli_cpuid_query
n1, n2, n3 = 42 20 103
proc_str: model name  : ARMv7 Processor rev 4 (v7l)
ptno_str: CPU part    : 0xd03
feat_str: Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
features var: 1
model: 0
part#: d03
bli_cpuid_query_id
bli_cpuid_query
n1, n2, n3 = 42 20 103
proc_str: model name  : ARMv7 Processor rev 4 (v7l)
ptno_str: CPU part    : 0xd03
feat_str: Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
features var: 1
model: 0
part#: d03

fgvanzee commented 6 years ago

Glad to see my leaving those printf() calls in there was useful. :)

Ideally, yes, the value would be cached somewhere. (The current design calls CPUID or equivalent each time bli_cpuid_query_id() is called. But this only happens for multi-configuration builds. If you configure via configure haswell or configure auto, only one configuration is selected and that selection is hard-coded into the definition of bli_arch_query_id().)

loveshack commented 6 years ago

I hadn't noticed this before.I don't know how much effect caching would have, but there's a reason to make doing it configurable -- checkpoint/restart in the common case of a heterogeneous cluster.

jeffhammond commented 6 years ago

You mean like VM migration between incompatible hardware? I think that’s a rather exceptional corner case. Checkpoint-restart in HPC means restarting the program and reading in state. That will initialize shared libraries from scratch and doesn’t break with caching the config.

loveshack commented 6 years ago

No, I mean like using BLCR or, these days, probably DMTCP [unless you use PSM :-(]. It's a known problem with things like openblas that you need to restart on a compatible micro-architecture, though I'd be happy to know how to avoid that if the scheduler doesn't support it directly -- never added to SGE, for instance. I thought dumping the memory of HPC applications was what things like SCR are for. Not a big deal, anyhow.

fgvanzee commented 6 years ago

@loveshack I'm afraid I'm unfamiliar with any of the acronyms you mention above. Could you elaborate?

I also was unaware of the need to "restart" OpenBLAS in certain situations. Could you say a few words about this as well?

jeffhammond commented 6 years ago

@fgvanzee Did you google these first? In any case:

BLCR is Berkeley Labs Checkpoint Restart.
DMTCP
PSM and PSM2 are Performance Scaled Messaging, which are the low-level APIs for Intel True Scale and Omni Path interconnects.
SGE is Sun Grid Engine.
SCR is Scalable Checkpoint Restart from LLNL.

fgvanzee commented 6 years ago

@jeffhammond I made no attempt to google any of them under concern for acronym collisions. :)

The links were marginally helpful, but I still don't see the application to the topic at hand. I don't understand why this isn't an issue of simply storing the result of a "deep" hardware query and using it to define future "shallow" queries.

loveshack commented 6 years ago

Sorry for causing confusion talking to Jeff, but as I said it's not a big deal, and definitely not worth supporting if there's significant overhead, but: This is about dumping the entire state of a running program to be able to restart it from that point after an error or running out of time for the job (as opposed to the application writing state itself and picking that up if re-run from scratch; Jeff will know of chemistry-ish programs that can do that, others that can't, and some that can only in some of their modes). Clearly if you dump the executing code that's made a dynamic SIMD selection, say avx512, and try to restart it on a system that's only, say, avx2, it fails. That may not happen on the sort of systems Jeff's thinking of, either because they're homogenous or require selecting a specific node type; however it can on one I've been used to using and managing. BLCR and DMTCP are just two systems for doing the dump-running-state checkpointing (entirely in user space with DMTCP). If you do a cpuid-type test on each call, it would solve the problem unless the checkpoint happened in the middle of such a call, which it probably wouldn't for an MPI application, for instance. SCR is a fancy user space filesystem for supporting such checkpoint data at large scale. OpenBLAS is just an example of an existing dynamic SIMD-selection library that can cause the sort of problem I'm talking about, like MKL, fftw, and some others -- nothing directly related to it.

fgvanzee commented 6 years ago

@loveshack Thanks for that explanation. I now understand how we got from CPUID selection to checkpointing. (It was not at all clear, even if it should have been. Sorry.)

Anyhow, my response to this is that people who do this kind of checkpointing will simply need to make sure they configure BLIS in the appropriate way--that is, that they enable the more frequent CPUID method, rather than once-and-cache method. My read of the situation is that this would be sufficient for their purposes. Please correct me if I'm mistaken.

rvdg commented 6 years ago

Talking about algorithmic fault tolerance. There is this delightful paper https://experts.illinois.edu/en/publications/fault-tolerant-high-performance-matrix-multiplication-theory-and- https://experts.illinois.edu/en/publications/fault-tolerant-high-performance-matrix-multiplication-theory-and- that was revisited with BLIS a few years ago, and then promptly rejected by IPDPS (or was it SCXX)?

On Apr 19, 2018, at 3:56 PM, Field G. Van Zee notifications@github.com wrote:

@loveshack https://github.com/loveshack Thanks for that explanation. I now understand how we got from CPUID selection to checkpointing. (It was not at all clear, even if it should have been. Sorry.)

Anyhow, my response to this is that people who do this kind of checkpointing will simply need to make sure they configure BLIS in the appropriate way--that is, that they enable the more frequent CPUID method, rather than once-and-cache method. My read of the situation is that this would be sufficient for their purposes. Please correct me if I'm mistaken.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/flame/blis/issues/175#issuecomment-382878091, or mute the thread https://github.com/notifications/unsubscribe-auth/AFa8J63nf-KwdM0oeHGfos-Zp1bFeLloks5tqPnlgaJpZM4SrAXJ.

fgvanzee commented 6 years ago

@devinamatthews This should be addressed now in 10c9e8f.

tonyskjellum commented 6 years ago

Field, do you have a specialized kernel family for specific xeons such as the embedded xeons— just curious . Thanks Tony

Anthony Skjellum, PhD 205-807-4968

fgvanzee commented 6 years ago

@tonyskjellum We target by instruction sets; product names and models are mostly meaningless to us. For example: Sandy Bridge and Ivy Bridge both used AVX, so they use the same level-3 microkernel. Haswell and Broadwell added FMA instructions, and so they get a different microkernel.

devinamatthews commented 6 years ago

@tonyskjellum are there any differences w.r.t. server Xeon in pipelines or the L1 and L2 caches? We probably should do something at runtime to detect how much L3 cache is present but that is a fairly weak effect on performance anyways.

tonyskjellum commented 6 years ago

Hi, Got it ; will look at those variants ; thank you Tony

Anthony Skjellum, PhD 205-807-4968

On May 17, 2018, at 2:15 PM, Field G. Van Zee notifications@github.com wrote:

@tonyskjellum We target by instruction sets; product names and models are mostly meaningless to us. For example: Sandy Bridge and Ivy Bridge both used AVX, so they use the same level-3 microkernel. Haswell and Broadwell added FMA instructions, and so they get a different microkernel.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

flame / blis

Runtime config selection is executed many times #175