digego / extempore

A cyber-physical programming environment
1.41k stars 127 forks source link

ubuntu LLVM errors on knight's landing (maybe AVX512?) #378

Closed benswift closed 4 years ago

benswift commented 4 years ago

The new cross-platform automated testing is super-cool.

However, I noticed that ubuntu-latest was giving intermittent timeouts. Looking at the logs, it was hanging on trying to load a pre-compiled libs/core/instruments.xtm---the same places as I detailed here.

That's weird, I thought. And then I looked at the logs more closely, and because the VMs are just drawn from a pool of runners they're not identical, hardware-wise.

Here's the Extempore startup banner from a test run which succeeded:

-------------- Extempore -------------- 
Andrew Sorensen (c) 2010-2020
andrew@moso.com.au, @digego

ARCH           : x86_64-unknown-linux-gnu
CPU            : broadwell
ATTRS          : -sse4a,-avx512bw,+cx16,-tbm,+xsave,-fma4,-avx512vl,+prfchw,+bmi2,+adx,-xsavec,+fsgsbase,+avx,-avx512cd,-avx512pf,+rtm,+popcnt,+fma,+bmi,+aes,+rdrnd,-xsaves,+sse4.1,+sse4.2,+avx2,-avx512er,+sse,+lzcnt,+pclmul,-avx512f,+f16c,+ssse3,+mmx,-pku,+cmov,-xop,+rdseed,+movbe,+hle,+xsaveopt,-sha,+sse2,+sse3,-avx512dq
LLVM           : 3.8.0 MCJI
Primary        : thread 0
---------------------------------------

and here's the startup banner from one which failed:

------------- Extempore -------------- 
Andrew Sorensen (c) 2010-2020
andrew@moso.com.au, @digego

ARCH           : x86_64-unknown-linux-gnu
CPU            : knl
ATTRS          : -sse4a,+avx512bw,+cx16,-tbm,+xsave,-fma4,+avx512vl,+prfchw,+bmi2,+adx,+xsavec,+fsgsbase,+avx,+avx512cd,-avx512pf,+rtm,+popcnt,+fma,+bmi,+aes,+rdrnd,+xsaves,+sse4.1,+sse4.2,+avx2,-avx512er,+sse,+lzcnt,+pclmul,+avx512f,+f16c,+ssse3,+mmx,-pku,+cmov,-xop,+rdseed,+movbe,+hle,+xsaveopt,-sha,+sse2,+sse3,+avx512dq
LLVM           : 3.8.0 MCJIT
Primary        : thread 0
---------------------------------------

Notice that the success is on a broadwell (CPU: broadwell) failure is on a knights landing (CPU: knl). Also, the box I was having the trouble on the other day is also CPU: knl. Interestingly, that same box dual-boots Windows, and it works fine there.

It could be co-incidence, and I really need to go back and look at the LLVM debugging output listed in that other issue. However, the AVX512 attrs are certainly suspicious (turned on for knl, off for broadwell).

This is a bummer, firstly because it's broken, and secondly it means that our tests will randomly fail depending on the hardware they're assigned to (which we have no control over).

Bummer.

benswift commented 4 years ago

@digego when you built it on your VM the other day what was the (micro)arch? did it have AVX512?

digego commented 4 years ago

sorry mate, deleted the vm after use

On Tue, Apr 14, 2020 at 10:37 AM Ben Swift notifications@github.com wrote:

@digego https://github.com/digego when you built it on your VM the other day what was the (micro)arch? did it have AVX512?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/digego/extempore/issues/378#issuecomment-613163802, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEHPKIVAKNBEXVDETRHHQDRMOV6JANCNFSM4MG34S7A .

benswift commented 4 years ago

all g, it was just to double-check anyway

benswift commented 4 years ago

Ok, so it does look like one of the attrs is the issue [glances suspiciously at AVX512].

Using the same test file as in the other issue:

;; these all compile ok
(bind-func works_1
  (lambda (inverted:i1)
    (let ((rising (if inverted 1 0)))
      (lambda ()
        rising))))

(bind-func works_2
  (lambda (inverted:i1)
    (let ((rising (if inverted #t #f)))
      rising)))

(bind-func works_3
  (lambda (inverted:i1)
    (lambda ()
      (if inverted #t #f))))

;; this is broken
(bind-func broken
  (lambda (inverted:i1)
    (let ((rising (if inverted #t #f)))
      (lambda ()
        rising))))

Now, on my beefy xeon-y box, here are the results:

I think that I can try toggling individual attrs using that CLI thing, so that might be the next step. Anyway, updating LLVM will solve all our problems and give us all ponies.

benswift commented 4 years ago

It's looking more likely that AVX512 on older LLVM is the culprit.

Will put together a workaround.

benswift commented 4 years ago

Ok, well it looks like this is “fixed” (worked around) in 3599d8b484253f8b29eb8681e0cbb8e8b24a4181.

We’ll have a proper fix (and be able to use avx512) when we update LLVM.