bespoke-silicon-group / bsg_manycore

Tile based architecture designed for computing efficiency, scalability and generality
Other
229 stars 58 forks source link

Incorrect execution based on the presence of a global variable, possibly due to linking or alignment #111

Open sampsyo opened 5 years ago

sampsyo commented 5 years ago

Hi, folks—this is a weird one. We (@avanhatt and I) ran into this using our F1 instance for v3.0.6, which is i-038b396c43b1e06e1.

The symptom is that a pretty small function executes incorrectly if we add a (totally unused!) 64-bit global. It executes normally if the global is missing, or if it's a 32-bit number. Here's a self-contained test program, including a Makefile, a host driver, and the complete device code. The problem function is this, which is a vastly C-Reduced extraction from a function that feeds into bsg_printf: https://github.com/cucapra/hb-examples/blob/476543052b62aeef6b6e0eb8c8cdc2f319d4931e/link-bug/problem.c

That function returns different values depending on whether this global is declared as an int32_t or an int64_t: https://github.com/cucapra/hb-examples/blob/476543052b62aeef6b6e0eb8c8cdc2f319d4931e/link-bug/device.c#L14-L18

Because of the dependency on global declarations, we suspected a linker problem. But investigating it was inconclusive. Here's the code for the problem function in the final, linked executable when we include a 32-bit constant (i.e., when everything works fine):

   0x0000035c <+0>: 37 17 00 80 lui a4,0x80001
   0x00000360 <+4>: 13 07 07 9c addi    a4,a4,-1600 # 0x800009c0
   0x00000364 <+8>: 13 05 00 00 li  a0,0
   0x00000368 <+12>:    13 06 70 01 li  a2,23
   0x0000036c <+16>:    b7 15 00 00 lui a1,0x1
   0x00000370 <+20>:    93 85 85 05 addi    a1,a1,88 # 0x1058
   0x00000374 <+24>:    6f 00 80 01 j   0x38c <ee_vsprintf+48>
   0x00000378 <+28>:    33 05 a7 00 add a0,a4,a0
   0x0000037c <+32>:    13 75 f5 0f andi    a0,a0,255
   0x00000380 <+36>:    13 07 17 00 addi    a4,a4,1
   0x00000384 <+40>:    83 47 07 00 lbu a5,0(a4)
   0x00000388 <+44>:    63 82 07 02 beqz    a5,0x3ac <ee_vsprintf+80>
   0x0000038c <+48>:    83 47 07 00 lbu a5,0(a4)
   0x00000390 <+52>:    93 87 f7 f9 addi    a5,a5,-97
   0x00000394 <+56>:    93 f6 f7 0f andi    a3,a5,255
   0x00000398 <+60>:    e3 60 d6 fe bltu    a2,a3,0x378 <ee_vsprintf+28>
   0x0000039c <+64>:    93 97 26 00 slli    a5,a3,0x2
   0x000003a0 <+68>:    b3 87 b7 00 add a5,a5,a1
   0x000003a4 <+72>:    83 a7 07 00 lw  a5,0(a5)
   0x000003a8 <+76>:    67 80 07 00 jr  a5
   0x000003ac <+80>:    67 80 00 00 ret

And here it is when we include a 64-bit constant (i.e., when the function returns the wrong value):

   0x0000035c <+0>: 37 17 00 80 lui a4,0x80001
   0x00000360 <+4>: 13 07 07 9c addi    a4,a4,-1600 # 0x800009c0
   0x00000364 <+8>: 13 05 00 00 li  a0,0
   0x00000368 <+12>:    13 06 70 01 li  a2,23
   0x0000036c <+16>:    b7 15 00 00 lui a1,0x1
   0x00000370 <+20>:    93 85 05 06 addi    a1,a1,96 # 0x1060
   0x00000374 <+24>:    6f 00 80 01 j   0x38c <ee_vsprintf+48>
   0x00000378 <+28>:    33 05 a7 00 add a0,a4,a0
   0x0000037c <+32>:    13 75 f5 0f andi    a0,a0,255
   0x00000380 <+36>:    13 07 17 00 addi    a4,a4,1
   0x00000384 <+40>:    83 47 07 00 lbu a5,0(a4)
   0x00000388 <+44>:    63 82 07 02 beqz    a5,0x3ac <ee_vsprintf+80>
   0x0000038c <+48>:    83 47 07 00 lbu a5,0(a4)
   0x00000390 <+52>:    93 87 f7 f9 addi    a5,a5,-97
   0x00000394 <+56>:    93 f6 f7 0f andi    a3,a5,255
   0x00000398 <+60>:    e3 60 d6 fe bltu    a2,a3,0x378 <ee_vsprintf+28>
   0x0000039c <+64>:    93 97 26 00 slli    a5,a3,0x2
   0x000003a0 <+68>:    b3 87 b7 00 add a5,a5,a1
   0x000003a4 <+72>:    83 a7 07 00 lw  a5,0(a5)
   0x000003a8 <+76>:    67 80 07 00 jr  a5
   0x000003ac <+80>:    67 80 00 00 ret

The two executable versions are identical, except for this one addi instruction that gets filled in by the linker:

7c7
<    0x00000370 <+20>:  93 85 85 05 addi    a1,a1,88 # 0x1058
---
>    0x00000370 <+20>:  93 85 05 06 addi    a1,a1,96 # 0x1060

But that seems right because 0x1058 and 0x1060 are the start address for the .rodata.dmem section in each executable. With a 32-bit constant:

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .data.dmem    00000018  00001000  00001000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  1 .sdata.dmem   00000024  00001018  00001018  00001018  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  2 .sbss.dmem    0000001c  0000103c  0000103c  0000103c  2**2
                  ALLOC
  3 .rodata.dmem  000000e4  00001058  00001058  00001058  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
...

And with a 64-bit constant:

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .data.dmem    00000018  00001000  00001000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  1 .sdata.dmem   00000024  00001018  00001018  00001018  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  2 .sbss.dmem    00000020  00001040  0000103c  00001040  2**3
                  ALLOC
  3 .rodata.dmem  000000e4  00001060  0000105c  00001060  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
...

That is, the .sbss.dmem section is 4 bytes bigger in the second case, which is what you'd expect with a constant that's twice the size.

One thing that seems odd to me, however, is the alignment specification for that .sbss.dmem section. It has 4-byte alignment for the first case and 8-byte alignment in the second case. I don't know why that would matter, but it probably explains why the start address moves from 0x103c (which is not 8-byte-aligned) to 0x1040 (which is).

It is probably also worth noting that, in the "OK" version, virtual memory address (VMA) == load memory address (LMA) == file offset. In the "bad" version, the LMA is different for the third and fourth sections, which seems bad.

I'm worried that the linker thing may be a red herring, and we may be looking at a real hardware bug, perhaps to do with alignment. But I thought I'd ask if anything jumped out at y'all.

avanhatt commented 5 years ago

Just in case this helps relate this to other things you've seen, we originally noticed this because depending on the global declarations in a more complicated example, we were seeing unexpected behavior frombsg_prinf.

In particular, calls to bsg_printf would print the print format specifier rather than the value itself depending on the (unused) globals declared. For example, DESTINATION_ID: %d vs. DESTINATION_ID: 1 pending declaring two unused global structs in otherwise identical code.

mrutt92 commented 5 years ago

As @sampsyo noted, it's interesting that in the non-working executable the LMA and VMA are different for those dmem sections. If I remember correctly that's only supposed to be the case for .text (because PCs are limited to 12(?) bits or so but we keep it in DRAM).

The LMA is the address where the loader actually puts the data for that section. The VMA is the address that the vcore will use to access that value. In this case they should absolutely be the same. I think this is very likely the problem you are running into.

This issue could displace where the string constant "xdy" is stored in memory causing a mismatch in where the linker places that string constant and where the vcore expects that string constant to be.

As to WHY this is happening; I'd need to get back to you. @vb000 maybe you could have a look too?

sampsyo commented 5 years ago

To fill in a little more color after further investigation:

@max-ruttenberg's hypothesis about accesses to string literals seemed sound, except for one thing: string literals in other contexts seem to work fine. But there's something else that lives in .rodata.dmem: jump tables. The switch statement in this example was getting compiled to some sort of inscrutable indirect jump through a table. (I verified this by changing the characters used in the cases; the code for the function itself doesn't change at all, but some memory right at the beginning of .rodata.dmem does. And changing seemingly-irrelevant cases can indeed change the behavior of the program.)

So I think the string literals are the red herring in this mystery. The problem is switch statements.

mrutt92 commented 5 years ago

@sampsyo who would think a bug would be so cool?

vb000 commented 5 years ago

I tried to reproduce the issue on standalone manycore (without the host) here. This program seems to be running fine with PROBLEM defined or undefined. May be I'm working on a different version to that of yours. Which version of manycore is this observed on?

vb000 commented 5 years ago

Also, there is an interesting linker optimization... uninitialized data (.bss) is not filled in the elf to reduce the the size of executable. Meaning, objdump doesn't dump .bss. bss is is expected to be initialized at load time or run time: https://stackoverflow.com/questions/610682/do-bss-section-zero-initialized-variables-occupy-space-in-elf-file

vb000 commented 5 years ago

Having different VMA and LMA can definitely be the problem because, our loader assumes the load address of dmem data it sees in objdump's output is equal to the virtual address!

sampsyo commented 5 years ago

Hi, @vb000—thanks for looking into this! I indicated the F1 AMI ID and Bladerunner version (v3.0.6) above. AFAIK, every Bladerunner version maps to a specific manycore git commit.

Maybe it would be worth checking whether the program you compiled has the same suspicious out-of-sync address phenomenon as ours?

mrutt92 commented 5 years ago

@vb000 I believe the loader in Cosim and F1 handle uninitialized data.

vb000 commented 5 years ago

Hi @sampsyo, that's right -- in my version string is placed in dram address space.

Hi @max-ruttenberg, you mean cosim/f1 loader loads data to correct EVAs even when LMAs are slightly off due to the optimization?

mrutt92 commented 5 years ago

@vb000 I don’t think that’s the optimization you are describing. Uninitialized data does not cause the LMA and VMA to differ. The optimization is done using the MEMSZ and FILESZ fields in the segments (not sections). The point of the optimization is to avoid storing a bunch of zero bytes in the object file.

vb000 commented 5 years ago

@max-ruttenberg Oh right, the issue I mentioned is a quirk of objdump, which doesn't apply to cosim/f1 loader.

vb000 commented 5 years ago

@max-ruttenberg Aside from the bss detail, I'd still like to understand the loader's behavior when VMA and LMA are different. Would the loader load the data to correct EVA?

mrutt92 commented 5 years ago

@vb000 Why is this closed? Is it fixed?

The loader ALWAYS loads a segment to its Load Memory Address (LMA). In general, it doesn't even look at the VMA because, in fact, it shouldn't need to.

AFAIK there is only one scenario for the manycore in which the VMA and LMA should not match; segments with executable code have an LMA that places the code in DRAM but have a VMA with only (12?) bits.

vb000 commented 5 years ago

Sorry, by mistake!

@max-ruttenberg So, the issue boils down to why VMA and LMA won't match, correct?

taylor-bsg commented 5 years ago

Not sure why LMA and VMA mismatch in this case (avoid storing non-initialized data?), but a few things noted:

Online postings talk about scenarios that differentiate between where stuff is loaded (LMA) versus where it is run (VMA); I.e. stuff could be loaded in a rom in a machine, and then at execution something copies it to the VMA. For text segment we have the 2^20 virtual text address space that is loaded into DRAM; the hardware effectively handles the copying dynamically via the icache. But for data, it sounds like our loader program is not handling the movement?

On Fri, Jul 26, 2019 at 10:02 AM max-ruttenberg notifications@github.com wrote:

@vb000 https://github.com/vb000 Why is this closed? Is it fixed?

The loader ALWAYS loads a segment to its Load Memory Address (LMA). In general, it doesn't even look at the VMA because, in fact, it shouldn't need to.

AFAIK there is only one scenario for the manycore in which the VMA and LMA should not match; segments with executable code have an LMA that places the code in DRAM but have a VMA with only (12?) bits.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_manycore/issues/111?email_source=notifications&email_token=AEFG5AGI2IEPTXNYKU5GTLDQBMU33A5CNFSM4IG3NA2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD25FL3A#issuecomment-515528172, or mute the thread https://github.com/notifications/unsubscribe-auth/AEFG5ABAK2PMCTEZ234YMP3QBMU33ANCNFSM4IG3NA2A .

mrutt92 commented 5 years ago

On most systems the LMA and the VMA are supposed to be the same. Our system has a twist in that the VMA for code is a different address space from data. However, as far as I can tell, that has nothing to do with this situation. (I could be wrong though).

mrutt92 commented 5 years ago

@sampsyo when I clone your posted code and try to compile I get /home/centos/bsg_bladerunner/bsg_manycore_86ef340/software/riscv-tools/riscv-install/lib/gcc/riscv32-unknown-elf/8.3.0/../../../../riscv32-unknown-elf/bin/ld: cannot find -l:crt.o How do you guys build and run? Do I need to be in a specific directory?

sampsyo commented 5 years ago

Arg; that's always annoying. That object file is at ~/bsg_bladerunner/bsg_manycore_*/software/spmd/common/crt.o, but it's not built by default. Try running this first (in the link-bug directory):

make /home/centos/bsg_bladerunner/bsg_manycore_86ef340/software/spmd/common/crt.o

And then go ahead with make and make run.

vb000 commented 5 years ago

As @sampsyo reckoned, there's something weird happening with jump tables; I confirmed this by comparing instruction traces of "ok" and "bad" versions of an even simpler test. Debugging this as a potential hw issue... will keep you posted.

taylor-bsg commented 5 years ago

Let’s add some jump table tests to the regression!

M

On Fri, Jul 26, 2019 at 3:23 PM Bandhav Veluri notifications@github.com wrote:

As @sampsyo https://github.com/sampsyo reckoned, there's something weird happening with jump tables; I confirmed this by comparing instruction traces of "ok" and "bad" versions of an even simpler test https://github.com/bespoke-silicon-group/bsg_manycore/tree/issue111/software/spmd/issue111. Debugging this as a potential hw issue... will keep you posted.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_manycore/issues/111?email_source=notifications&email_token=AEFG5AHOANYDPYZZJHZYTJDQBN2ONA5CNFSM4IG3NA2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD253BRI#issuecomment-515616965, or mute the thread https://github.com/notifications/unsubscribe-auth/AEFG5AA6GHREV3OAP273TTDQBN2ONANCNFSM4IG3NA2A .

mrutt92 commented 5 years ago

OK; I've root caused the problem. Here's some code from our linker script:

  /* bss segment */
  .sbss.dmem : 
  AT(LOADADDR(.sdata.dmem) + SIZEOF(.data.dmem))
  {
    *(.sbss .sbss.* .gnu.linkonce.sb.*)
    *(.scommon)
  } >DMEM_VMA 

The issue is using SIZEOF the previous section to determine the new LMA. This causes a mismatch because if .sbss.dmem is aligned, the extra bytes are NOT included in the SIZEOF operator.

I replaced above with the code below and then the program ran as expected.

  /* bss segment */
  .sbss.dmem : 
  AT(LOADADDR(.sdata.dmem) + (ADDR(.sbss.dmem) - ADDR(.sdata.dmem)))
  {
    *(.sbss .sbss.* .gnu.linkonce.sb.*)
    *(.scommon)
  } >DMEM_VMA 
sampsyo commented 5 years ago

Perfect!!

vb000 commented 5 years ago

@max-ruttenberg Could you push the fix?

vb000 commented 5 years ago

On retrospection though, the disassembly of int64_t yyy version is skewed:

Disassembly of section .sbss.dmem:

00001028 <yyy>:
        ...

Disassembly of section .rodata.dmem:

00001030 <_bsg_data_end_addr-0x88>:
    1030:       0164                    addi    x9,x2,140
    1032:       0000                    unimp
    1034:       015c                    addi    x15,x2,132
    1036:       0000                    unimp
    1038:       015c                    addi    x15,x2,132
    103a:       0000                    unimp
    103c:       015c                    addi    x15,x2,132
    103e:       0000                    unimp
    1040:       015c                    addi    x15,x2,132

Regret having ignored this and paid the price with quite a bit of time :(

drichmond commented 5 years ago

@vb000 Do you understand the issue?

vb000 commented 5 years ago

Yes, as is visible in the snip I attached in my latest comment, .rodata.dmem section is encroaching yyy's space due to the fact that alignment is ignored in link script.

I'll push a fix shortly...

drichmond commented 5 years ago

Let's talk after this meeting?

drichmond commented 5 years ago

Can we make a regression test for this too?

drichmond commented 5 years ago

Fixed?