Closed devinamatthews closed 6 years ago
Glad to see my leaving those printf()
calls in there was useful. :)
Ideally, yes, the value would be cached somewhere. (The current design calls CPUID or equivalent each time bli_cpuid_query_id()
is called. But this only happens for multi-configuration builds. If you configure via configure haswell
or configure auto
, only one configuration is selected and that selection is hard-coded into the definition of bli_arch_query_id()
.)
I hadn't noticed this before.I don't know how much effect caching would have, but there's a reason to make doing it configurable -- checkpoint/restart in the common case of a heterogeneous cluster.
You mean like VM migration between incompatible hardware? I think that’s a rather exceptional corner case. Checkpoint-restart in HPC means restarting the program and reading in state. That will initialize shared libraries from scratch and doesn’t break with caching the config.
No, I mean like using BLCR or, these days, probably DMTCP [unless you use PSM :-(]. It's a known problem with things like openblas that you need to restart on a compatible micro-architecture, though I'd be happy to know how to avoid that if the scheduler doesn't support it directly -- never added to SGE, for instance. I thought dumping the memory of HPC applications was what things like SCR are for. Not a big deal, anyhow.
@loveshack I'm afraid I'm unfamiliar with any of the acronyms you mention above. Could you elaborate?
I also was unaware of the need to "restart" OpenBLAS in certain situations. Could you say a few words about this as well?
@jeffhammond I made no attempt to google any of them under concern for acronym collisions. :)
The links were marginally helpful, but I still don't see the application to the topic at hand. I don't understand why this isn't an issue of simply storing the result of a "deep" hardware query and using it to define future "shallow" queries.
Sorry for causing confusion talking to Jeff, but as I said it's not a big deal, and definitely not worth supporting if there's significant overhead, but: This is about dumping the entire state of a running program to be able to restart it from that point after an error or running out of time for the job (as opposed to the application writing state itself and picking that up if re-run from scratch; Jeff will know of chemistry-ish programs that can do that, others that can't, and some that can only in some of their modes). Clearly if you dump the executing code that's made a dynamic SIMD selection, say avx512, and try to restart it on a system that's only, say, avx2, it fails. That may not happen on the sort of systems Jeff's thinking of, either because they're homogenous or require selecting a specific node type; however it can on one I've been used to using and managing. BLCR and DMTCP are just two systems for doing the dump-running-state checkpointing (entirely in user space with DMTCP). If you do a cpuid-type test on each call, it would solve the problem unless the checkpoint happened in the middle of such a call, which it probably wouldn't for an MPI application, for instance. SCR is a fancy user space filesystem for supporting such checkpoint data at large scale. OpenBLAS is just an example of an existing dynamic SIMD-selection library that can cause the sort of problem I'm talking about, like MKL, fftw, and some others -- nothing directly related to it.
@loveshack Thanks for that explanation. I now understand how we got from CPUID selection to checkpointing. (It was not at all clear, even if it should have been. Sorry.)
Anyhow, my response to this is that people who do this kind of checkpointing will simply need to make sure they configure BLIS in the appropriate way--that is, that they enable the more frequent CPUID method, rather than once-and-cache method. My read of the situation is that this would be sufficient for their purposes. Please correct me if I'm mistaken.
Talking about algorithmic fault tolerance. There is this delightful paper https://experts.illinois.edu/en/publications/fault-tolerant-high-performance-matrix-multiplication-theory-and- https://experts.illinois.edu/en/publications/fault-tolerant-high-performance-matrix-multiplication-theory-and- that was revisited with BLIS a few years ago, and then promptly rejected by IPDPS (or was it SCXX)?
On Apr 19, 2018, at 3:56 PM, Field G. Van Zee notifications@github.com wrote:
@loveshack https://github.com/loveshack Thanks for that explanation. I now understand how we got from CPUID selection to checkpointing. (It was not at all clear, even if it should have been. Sorry.)
Anyhow, my response to this is that people who do this kind of checkpointing will simply need to make sure they configure BLIS in the appropriate way--that is, that they enable the more frequent CPUID method, rather than once-and-cache method. My read of the situation is that this would be sufficient for their purposes. Please correct me if I'm mistaken.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/flame/blis/issues/175#issuecomment-382878091, or mute the thread https://github.com/notifications/unsubscribe-auth/AFa8J63nf-KwdM0oeHGfos-Zp1bFeLloks5tqPnlgaJpZM4SrAXJ.
@devinamatthews This should be addressed now in 10c9e8f.
Field, do you have a specialized kernel family for specific xeons such as the embedded xeons— just curious . Thanks Tony
Anthony Skjellum, PhD 205-807-4968
@tonyskjellum We target by instruction sets; product names and models are mostly meaningless to us. For example: Sandy Bridge and Ivy Bridge both used AVX, so they use the same level-3 microkernel. Haswell and Broadwell added FMA instructions, and so they get a different microkernel.
@tonyskjellum are there any differences w.r.t. server Xeon in pipelines or the L1 and L2 caches? We probably should do something at runtime to detect how much L3 cache is present but that is a fairly weak effect on performance anyways.
Hi, Got it ; will look at those variants ; thank you Tony
Anthony Skjellum, PhD 205-807-4968
On May 17, 2018, at 2:15 PM, Field G. Van Zee notifications@github.com wrote:
@tonyskjellum We target by instruction sets; product names and models are mostly meaningless to us. For example: Sandy Bridge and Ivy Bridge both used AVX, so they use the same level-3 microkernel. Haswell and Broadwell added FMA instructions, and so they get a different microkernel.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
IIRC the runtime config selection is supposed to just run once and be "remembered", but in debugging arm32 runtime selection with
printf
s, I get: