LLVM native throws "SIGSEGV: invalid address" due to fused FP operation

jasigal commented 6 years ago

Hello, I am submitting a bug report for accelerate-llvm-native (based on what I saw issues for it were moved/resubmitted/double submitted here?). I've created a gist describing the issue and containing the relevant files and output in addition to information here.

I believe this bug may actually be with LLVM itself, and I plan to file a bug there as well, but I think there is a workaround that can be done in accelerate-llvm-native in the mean time.

Description

The following Accelerate code SIGSEGV's by trying to access an invalid address.


module Main where

import           Data.Array.Accelerate             as A
import           Data.Array.Accelerate.LLVM.Native

g :: Scalar Double -> Scalar Double
g = (runN (A.map f :: Acc (Scalar Double) -> Acc (Scalar Double)))
  where
    f x = let y = recip x
              b = (-y) * y
          in y + tanh b

main :: IO ()
main = do
  let r = g (fromList Z [1])
  print (indexArray r Z)

This issue first occurred while I was writing some simple numeric functions for automatic differentiation and the bug caused my code to unpredictably crash.

Expected behaviour

The expected behaviour is the program should output 0.23840584404423515 (= 1 + tanh(-1)).

Current behaviour

Running the program currently produces “stack exec -- accelerate-llvm-b…” terminated by signal SIGSEGV (Address boundary error).

Possible solution

See the gist for a detailed description, but a workaround would be to turn off (or give the ability to turn off) FP contract option (-fp-contract for llc and opt). Compiling the generated LLVM IR without this option by hand produced an object which, when placed in Accelerates cache, produces the expected behaviour.

Steps to reproduce

Take the files from the gist, (namely accelerate-llvm-bug.cabal, stack.yaml, and Main.hs) and place them in the same folder (e.g. download the ZIP and unzip it).
Run stack build.
Run stack exec accelerate-llvm-bug.

Your environment

Haskell package versions are all from Stack LTS 10.2.

Version used: accelerate-1.1.1.0
Backend(s) used and version: accelerate-llvm-1.1.0.0, accelerate-llvm-native-1.1.0.1

Operating system and version:

$ lsb_release -a
Distributor ID:   Ubuntu
Description:  Ubuntu 17.04
Release:  17.04
Codename: zesty

Link to your project: the gist

Misc. versions of other tools:

$ clang-5.0 --version
clang version 5.0.1-svn319952-1~exp1 (branches/release_50)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ llc-5.0 --version
LLVM (http://llvm.org/):
LLVM version 5.0.1

Optimized build.
Default target: x86_64-pc-linux-gnu
Host CPU: broadwell

Registered Targets: [...]

Thank you, and don't hesitate to ask for more info!

tmcdonell commented 6 years ago

Thanks for the really detailed bug report!

My mac (ivybridge) worked; it seems to lack the problematic vfnmsub213sb instruction, or LLVM knows to avoid it on my architecture (log). I could reproduce this on a linux box with broadwell cpu though (log).

jasigal commented 6 years ago

Of course, no problem!

Glad to see it is actually reproducible (on some architectures). Would you be open to a PR with my suggestion above? I don't have one at the moment, but if in your opinion it's worth it (since who knows how long it could take to get a fix in LLVM) I'd be up for trying. If so, what's the best way to add it if you know off the top of your head?

tmcdonell commented 6 years ago

I think this is a bug in the object code loader (at least for ELF).

The following works for me; if you could confirm that would be great:

main.c

#include <stdio.h>
#include <stdint.h>

void map_1bf1a791bb25369d(int64_t start, int64_t end, double* out, const double* in);

int main(int argc, char** argv)
{
  double input = 1.0;
  double result;

  map_1bf1a791bb25369d(0,1, &result, &input);

  printf("result = %.16lf\n", result);

  return 0;
}

Save the generated .ll or .s from Accelerate and compile with:

$ clang-5.0 -O3 -lm -Wall main.c map_1bf1a791bb25369d.s
$ ./a.out
result = 0.2384058440442351

objdump -d should still show the offending instruction (here).

jasigal commented 6 years ago

I can confirm that the code above works. Guess I should stop being so certain about compiler bugs when I don't know what I'm talking about 🤔 Does this mean the problem is most likely on the Accelerate side?

NB: One thing I noticed when working with this issue is that changing the -code-model option to llc had an effect on whether or not addressing relative to %rip was used. From my understanding that's an "obvious" statement and I don't know if that's useful.

tmcdonell commented 6 years ago

haha. I mean technically you were correct, just about the wrong compiler 😅

I am guessing that the bug is somewhere in (or very closely related to) this file.

I've never experimented with the different code-model options actually; I don't know if that would be useful either.

jasigal commented 6 years ago

How important of an issue does this appear to be? I could attempt to poke around to see if I can figure out what's going wrong in the ELF object loader, but I really don't have any experience in the area. If this issue rates high enough for you or another developer/maintainer to tackle, I doubt I'd be useful help. However, if it doesn't rate high enough or everyone has other time commitments, I'm happy to attempt a fix! One of the core specs of the project I'm working on is CPU support, so it's not an unproductive use of my time.

tmcdonell commented 6 years ago

If you want to take a look that would be great! If you can't make heads or tails of it don't feel bad about giving up though, it could be a bit hairy.

Looking at the -ddump-ld output, I'm curious why the second line there doesn't have the symbol name like the first line does. Figuring that out might be a good place to start looking.

[   0.117] ld: .LCPI0_0: local symbol in section 4 at 0x00
[   0.117] ld: : local symbol in section 4 at 0x00
[   0.117] ld: map_1bf1a791bb25369d: local symbol in section 2 at 0x00
[   0.117] ld: tanh: external symbol found at 0x00007fd29327a970
[   0.117] ld: section 2: Mem: 0x000000000-0x000000064         .text
[   0.117] ld: section 4: Mem: 0x000000068-0x000000078         .rodata.cst8
[   0.117] ld: relocation: 0x0024 to symbol 1 in section 2, type=R_X86_64_PC32, value=.LCPI0_0-4
[   0.118] ld: relocation: 0x0038 to symbol 2 in section 2, type=R_X86_64_32S, value=+8
[   0.118] ld: relocation: 0x003d to symbol 4 in section 2, type=R_X86_64_PC32, value=tanh-4

jasigal commented 6 years ago

Yeah, I'd noticed that as well. When I compiled the LLVM IR manually, the symbols showed up with nm and don't show (except as debug) when autogenerated.

jasigal commented 6 years ago

I'm currently looking at the generated assembly for the following combination of choices to llc-5.0:

-relocation-model: pic, static (default)
-code-model: small (default), medium, large, kernel

It appears that adding the PIC option (or kernel and PIC) emits promising assembly, and additionally the large code model seems to produce code which looks correct (although reading it seems this has a performance cost).

My plan at the moment is to:

Find promising looking compiler options (probably just the two above, but maybe more -relocation-model options).
Determine how to pass these in Accelerate (will involve hacking on Accelerate) to figure out how to set these on the LLVM C++ library (since I don't recall seeing them already accessible).
Run the Accelerate test suite with my guesses for the options.
Propose the changes that let all tests pass (and if more than one combination works, the combination which produces minimal changes to assembly).

My reason for this approach is because I feel it will be quicker than understanding all the assumptions in the ELF loading code (unless you have that written up somewhere!). Would a fix produced above be accepted by you, or would you consider this too much of an ad-hoc fix?

jasigal commented 6 years ago

Possibly scratch this approach, it seems the PIC options involve R_X86_64_PLT32 relocation types, which aren't handled.

tmcdonell commented 6 years ago

Does changing the code/relocation model really affect the generated assembly that much? (I am concerned that that wouldn't address the underlying issue, but it is possible there are better options for what we are trying to do with the result)

I'll have a look at the ELF loader and see what is wrong there. It isn't really documented, I guess, but it doesn't really do anything special either.

jasigal commented 6 years ago

When PIC is not enabled:

generated:no-PIC:kernel.o:     file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000025 R_X86_64_32S      .rodata.cst8
0000000000000039 R_X86_64_32S      .rodata.cst8+0x0000000000000008
000000000000003e R_X86_64_PC32     tanh-0x0000000000000004

generated:no-PIC:large.o:     file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000020 R_X86_64_64       .rodata.cst8
0000000000000034 R_X86_64_64       tanh

generated:no-PIC:medium.o:     file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
000000000000001e R_X86_64_64       .rodata.cst8
0000000000000049 R_X86_64_PC32     tanh-0x0000000000000004

generated:no-PIC:small.o:     file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000024 R_X86_64_PC32     .LCPI0_0-0x0000000000000004
0000000000000038 R_X86_64_32S      .rodata.cst8+0x0000000000000008
000000000000003d R_X86_64_PC32     tanh-0x0000000000000004

When PIC is enabled:

generated:PIC:kernel.o:     file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000024 R_X86_64_PC32     .LCPI0_0-0x0000000000000004
0000000000000037 R_X86_64_PC32     .LCPI0_1-0x0000000000000004
000000000000003c R_X86_64_PLT32    tanh-0x0000000000000004

generated:PIC:large.o:     file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000020 R_X86_64_64       .rodata.cst8
0000000000000034 R_X86_64_64       tanh

generated:PIC:medium.o:     file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
000000000000001e R_X86_64_64       .rodata.cst8
0000000000000049 R_X86_64_PLT32    tanh-0x0000000000000004

generated:PIC:small.o:     file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000024 R_X86_64_PC32     .LCPI0_0-0x0000000000000004
0000000000000037 R_X86_64_PC32     .LCPI0_1-0x0000000000000004
000000000000003c R_X86_64_PLT32    tanh-0x0000000000000004

And there are some small changes (but still involving the unhandled relocation types) when -relocation-model is rwpi or ropi.

jasigal commented 6 years ago

I think I've figured out some of the issues. I don't have time at the moment to write everything down (and I don't have it completely figured out), but I should have time tomorrow.

One issue is this function which overwrites the old symbol value with the new one. It assumes that all addresses are 32 bits by casting the PC pointer to a Word32 pointer. This causes, for example, R_X86_64_64 relocations to be done incorrectly if the upper 32 bits are needed.

Another issue is that the R_X86_64_32S relocations are done incorrectly, and I believe can't be done correctly with the current object loading method. This source describes the assumptions of the small (default) code model as:

the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default code model.

(although for some reason the pointers for vfnmsub213sb seem 32 bit, albeit the source references gcc). However, when I run my program the segment created by the backend is Segment 4096 0x00000042001db000, which is well above 32 bit addresses. Hence, even the jump table is out of range. From what I see there isn't a way to ask Haskell to allocate a pinned byte array (generated through mallocPlainForeignPtrAlignedBytes) in a specific memory range, so I don't know how to solve the absolute 32 bit relocation issues.

ohad commented 6 years ago

@tmcdonnel , are you going to be at POPL next week?

ohad commented 6 years ago

@tmcdonell even (sorry, typing from my phone)

tmcdonell commented 6 years ago

@ohad I will not be at POPL this year ):

tmcdonell commented 6 years ago

@jasigal okay, I should (finally) have a chance to look into this more this afternoon. sorry for the delay 😓

I still think small is fine here; we just need the IP-relative jumps to be within ±2GB. Anything outside that is going to be a function call for us (e.g. to shared library functions like tanh, memset ...), which (if necessary) will go via the jump tables. Anyway, I'll investigate further...

ohad commented 6 years ago

@tmcdonell no worries --- I won't be there either this year (visa refused :( ) --- but @adscib will be there, so if you two think we should lobby the Haskell architects for some changes, we can try to do that ;)

tmcdonell commented 6 years ago

@jasigal could you see if this fix works for you? https://github.com/tmcdonell/accelerate-llvm/commit/77078de7241ddd2e5fd6620fbcea2cfdc02cc2a6

tmcdonell commented 6 years ago

eer, wait, never mind. that worked for me for a while then stopped \:

jasigal commented 6 years ago

Yeah, I can confirm this doesn't work for me ☹️

jasigal commented 6 years ago

@tmcdonell I agree that IP-relative jumps should be fine. Those ones, like the one for tanh, seem to work fine as is, it's the absolute relocations which want to reference memory addresses outside unsigned 32 bit range which I think are the issue.

jasigal commented 6 years ago

Using mmap to allocate memory explicitly in the lower 2GB of memory allows my program to run, see this commit jasigal/accelerate-llvm@ae5bf3e176383e9d4de0c98c48adac02d222a568 and the one before it on the fix/#409 branch. I'm going to run tests now to check if the changes to the relocation function don't mess anything up.

jasigal commented 6 years ago

Well, all the tests passed! (How long are they supposed to take anyway, it's ~20 minutes for the accelerate-llvm-native tests on my computer which has an 8 core i7)

tmcdonell commented 6 years ago

Yeah, the tests take ages, unfortunately. Last I checked I think it was spending all its time in either (a) generating the random data; or (b) running the reference implementations (which mostly just use lists). I would like to speed it up though...

I fixed this by setting relocation-model=dynamic-no-pic, which I think is more appropriate to how we are using the result. I pulled in your mmap patch too though (it might not be necessary to have both fixes, but if nothing else I think it's fair you get your name on a patch (: )

Thanks for your help on this!

jasigal commented 6 years ago

No worries, and thanks for the fix, everything seems to be working now! 😄

AccelerateHS / accelerate