Open sampsyo opened 5 years ago
Just in case this helps relate this to other things you've seen, we originally noticed this because depending on the global declarations in a more complicated example, we were seeing unexpected behavior frombsg_prinf
.
In particular, calls to bsg_printf
would print the print format specifier rather than the value itself depending on the (unused) globals declared. For example, DESTINATION_ID: %d
vs. DESTINATION_ID: 1
pending declaring two unused global structs in otherwise identical code.
As @sampsyo noted, it's interesting that in the non-working executable the LMA and VMA are different for those dmem sections. If I remember correctly that's only supposed to be the case for .text (because PCs are limited to 12(?) bits or so but we keep it in DRAM).
The LMA is the address where the loader actually puts the data for that section. The VMA is the address that the vcore will use to access that value. In this case they should absolutely be the same. I think this is very likely the problem you are running into.
This issue could displace where the string constant "xdy" is stored in memory causing a mismatch in where the linker places that string constant and where the vcore expects that string constant to be.
As to WHY this is happening; I'd need to get back to you. @vb000 maybe you could have a look too?
To fill in a little more color after further investigation:
@max-ruttenberg's hypothesis about accesses to string literals seemed sound, except for one thing: string literals in other contexts seem to work fine. But there's something else that lives in .rodata.dmem
: jump tables. The switch
statement in this example was getting compiled to some sort of inscrutable indirect jump through a table. (I verified this by changing the characters used in the case
s; the code for the function itself doesn't change at all, but some memory right at the beginning of .rodata.dmem
does. And changing seemingly-irrelevant case
s can indeed change the behavior of the program.)
So I think the string literals are the red herring in this mystery. The problem is switch
statements.
@sampsyo who would think a bug would be so cool?
I tried to reproduce the issue on standalone manycore (without the host) here. This program seems to be running fine with PROBLEM
defined or undefined. May be I'm working on a different version to that of yours. Which version of manycore is this observed on?
Also, there is an interesting linker optimization... uninitialized data (.bss) is not filled in the elf to reduce the the size of executable. Meaning, objdump doesn't dump .bss. bss is is expected to be initialized at load time or run time: https://stackoverflow.com/questions/610682/do-bss-section-zero-initialized-variables-occupy-space-in-elf-file
Having different VMA and LMA can definitely be the problem because, our loader assumes the load address of dmem data it sees in objdump's output is equal to the virtual address!
Hi, @vb000—thanks for looking into this! I indicated the F1 AMI ID and Bladerunner version (v3.0.6) above. AFAIK, every Bladerunner version maps to a specific manycore git commit.
Maybe it would be worth checking whether the program you compiled has the same suspicious out-of-sync address phenomenon as ours?
@vb000 I believe the loader in Cosim and F1 handle uninitialized data.
Hi @sampsyo, that's right -- in my version string is placed in dram address space.
Hi @max-ruttenberg, you mean cosim/f1 loader loads data to correct EVAs even when LMAs are slightly off due to the optimization?
@vb000 I don’t think that’s the optimization you are describing. Uninitialized data does not cause the LMA and VMA to differ. The optimization is done using the MEMSZ and FILESZ fields in the segments (not sections). The point of the optimization is to avoid storing a bunch of zero bytes in the object file.
@max-ruttenberg Oh right, the issue I mentioned is a quirk of objdump, which doesn't apply to cosim/f1 loader.
@max-ruttenberg Aside from the bss detail, I'd still like to understand the loader's behavior when VMA and LMA are different. Would the loader load the data to correct EVA?
@vb000 Why is this closed? Is it fixed?
The loader ALWAYS loads a segment to its Load Memory Address (LMA). In general, it doesn't even look at the VMA because, in fact, it shouldn't need to.
AFAIK there is only one scenario for the manycore in which the VMA and LMA should not match; segments with executable code have an LMA that places the code in DRAM but have a VMA with only (12?) bits.
Sorry, by mistake!
@max-ruttenberg So, the issue boils down to why VMA and LMA won't match, correct?
Not sure why LMA and VMA mismatch in this case (avoid storing non-initialized data?), but a few things noted:
Online postings talk about scenarios that differentiate between where stuff is loaded (LMA) versus where it is run (VMA); I.e. stuff could be loaded in a rom in a machine, and then at execution something copies it to the VMA. For text segment we have the 2^20 virtual text address space that is loaded into DRAM; the hardware effectively handles the copying dynamically via the icache. But for data, it sounds like our loader program is not handling the movement?
On Fri, Jul 26, 2019 at 10:02 AM max-ruttenberg notifications@github.com wrote:
@vb000 https://github.com/vb000 Why is this closed? Is it fixed?
The loader ALWAYS loads a segment to its Load Memory Address (LMA). In general, it doesn't even look at the VMA because, in fact, it shouldn't need to.
AFAIK there is only one scenario for the manycore in which the VMA and LMA should not match; segments with executable code have an LMA that places the code in DRAM but have a VMA with only (12?) bits.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_manycore/issues/111?email_source=notifications&email_token=AEFG5AGI2IEPTXNYKU5GTLDQBMU33A5CNFSM4IG3NA2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD25FL3A#issuecomment-515528172, or mute the thread https://github.com/notifications/unsubscribe-auth/AEFG5ABAK2PMCTEZ234YMP3QBMU33ANCNFSM4IG3NA2A .
On most systems the LMA and the VMA are supposed to be the same. Our system has a twist in that the VMA for code is a different address space from data. However, as far as I can tell, that has nothing to do with this situation. (I could be wrong though).
@sampsyo when I clone your posted code and try to compile I get /home/centos/bsg_bladerunner/bsg_manycore_86ef340/software/riscv-tools/riscv-install/lib/gcc/riscv32-unknown-elf/8.3.0/../../../../riscv32-unknown-elf/bin/ld: cannot find -l:crt.o
How do you guys build and run? Do I need to be in a specific directory?
Arg; that's always annoying. That object file is at ~/bsg_bladerunner/bsg_manycore_*/software/spmd/common/crt.o
, but it's not built by default. Try running this first (in the link-bug
directory):
make /home/centos/bsg_bladerunner/bsg_manycore_86ef340/software/spmd/common/crt.o
And then go ahead with make
and make run
.
As @sampsyo reckoned, there's something weird happening with jump tables; I confirmed this by comparing instruction traces of "ok" and "bad" versions of an even simpler test. Debugging this as a potential hw issue... will keep you posted.
Let’s add some jump table tests to the regression!
M
On Fri, Jul 26, 2019 at 3:23 PM Bandhav Veluri notifications@github.com wrote:
As @sampsyo https://github.com/sampsyo reckoned, there's something weird happening with jump tables; I confirmed this by comparing instruction traces of "ok" and "bad" versions of an even simpler test https://github.com/bespoke-silicon-group/bsg_manycore/tree/issue111/software/spmd/issue111. Debugging this as a potential hw issue... will keep you posted.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_manycore/issues/111?email_source=notifications&email_token=AEFG5AHOANYDPYZZJHZYTJDQBN2ONA5CNFSM4IG3NA2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD253BRI#issuecomment-515616965, or mute the thread https://github.com/notifications/unsubscribe-auth/AEFG5AA6GHREV3OAP273TTDQBN2ONANCNFSM4IG3NA2A .
OK; I've root caused the problem. Here's some code from our linker script:
/* bss segment */
.sbss.dmem :
AT(LOADADDR(.sdata.dmem) + SIZEOF(.data.dmem))
{
*(.sbss .sbss.* .gnu.linkonce.sb.*)
*(.scommon)
} >DMEM_VMA
The issue is using SIZEOF
the previous section to determine the new LMA. This causes a mismatch because if .sbss.dmem
is aligned, the extra bytes are NOT included in the SIZEOF operator.
I replaced above with the code below and then the program ran as expected.
/* bss segment */
.sbss.dmem :
AT(LOADADDR(.sdata.dmem) + (ADDR(.sbss.dmem) - ADDR(.sdata.dmem)))
{
*(.sbss .sbss.* .gnu.linkonce.sb.*)
*(.scommon)
} >DMEM_VMA
Perfect!!
@max-ruttenberg Could you push the fix?
On retrospection though, the disassembly of int64_t yyy
version is skewed:
Disassembly of section .sbss.dmem:
00001028 <yyy>:
...
Disassembly of section .rodata.dmem:
00001030 <_bsg_data_end_addr-0x88>:
1030: 0164 addi x9,x2,140
1032: 0000 unimp
1034: 015c addi x15,x2,132
1036: 0000 unimp
1038: 015c addi x15,x2,132
103a: 0000 unimp
103c: 015c addi x15,x2,132
103e: 0000 unimp
1040: 015c addi x15,x2,132
Regret having ignored this and paid the price with quite a bit of time :(
@vb000 Do you understand the issue?
Yes, as is visible in the snip I attached in my latest comment, .rodata.dmem
section is encroaching yyy
's space due to the fact that alignment is ignored in link script.
I'll push a fix shortly...
Let's talk after this meeting?
Can we make a regression test for this too?
Fixed?
Hi, folks—this is a weird one. We (@avanhatt and I) ran into this using our F1 instance for v3.0.6, which is
i-038b396c43b1e06e1
.The symptom is that a pretty small function executes incorrectly if we add a (totally unused!) 64-bit global. It executes normally if the global is missing, or if it's a 32-bit number. Here's a self-contained test program, including a Makefile, a host driver, and the complete device code. The problem function is this, which is a vastly C-Reduced extraction from a function that feeds into
bsg_printf
: https://github.com/cucapra/hb-examples/blob/476543052b62aeef6b6e0eb8c8cdc2f319d4931e/link-bug/problem.cThat function returns different values depending on whether this global is declared as an
int32_t
or anint64_t
: https://github.com/cucapra/hb-examples/blob/476543052b62aeef6b6e0eb8c8cdc2f319d4931e/link-bug/device.c#L14-L18Because of the dependency on global declarations, we suspected a linker problem. But investigating it was inconclusive. Here's the code for the problem function in the final, linked executable when we include a 32-bit constant (i.e., when everything works fine):
And here it is when we include a 64-bit constant (i.e., when the function returns the wrong value):
The two executable versions are identical, except for this one
addi
instruction that gets filled in by the linker:But that seems right because 0x1058 and 0x1060 are the start address for the
.rodata.dmem
section in each executable. With a 32-bit constant:And with a 64-bit constant:
That is, the
.sbss.dmem
section is 4 bytes bigger in the second case, which is what you'd expect with a constant that's twice the size.One thing that seems odd to me, however, is the alignment specification for that
.sbss.dmem
section. It has 4-byte alignment for the first case and 8-byte alignment in the second case. I don't know why that would matter, but it probably explains why the start address moves from 0x103c (which is not 8-byte-aligned) to 0x1040 (which is).It is probably also worth noting that, in the "OK" version, virtual memory address (VMA) == load memory address (LMA) == file offset. In the "bad" version, the LMA is different for the third and fourth sections, which seems bad.
I'm worried that the linker thing may be a red herring, and we may be looking at a real hardware bug, perhaps to do with alignment. But I thought I'd ask if anything jumped out at y'all.