[gcc] peephole2 to generate double load/stores not kicking in upstream ARC gcc - optimize LMBench bw_mem frd/fwr

vineetgarc commented 4 years ago

LMBench memory bandwidth tests frd() and fwr() access consecutive 512 bytes to compute memory subystem bandwidth.

void fwr(iter_t iterations, void *cookie)       <-- mem write consecutive words [Report #4]
{
...
     register int *p = state->buf;

  p[0]= p[1]= p[2]= p[3]= p[4]= p[5]= p[6]=
  p[7]= p[8]= p[9]= p[10]= p[11]= p[12]=
  p[13]= p[14]= p[15]= p[16]= p[17]= p[18]=
  p[19]= p[20]= p[21]= p[22]= p[23]= p[24]=
...
  p[123]= p[124]= p[125]= p[126]= p[127]= 1;
  p += 128;
     }

At -O2 the normal (boring) generated code use regular ST instructions (both upstream gcc, GNU 2020.03)

fwr:
...
.L83:
    st.as   1,[r2,127]
    st.as   1,[r2,126]
...
    st.as   1,[r2,64]
    st  1,[r2,252]
    st  1,[r2,248]
...
    st  1,[r2,4]
    st  1,[r2]

    add r2,r2,512   # p, p,
    cmp_s r3,r2    # lastone, p
    bhs @.L83

At -Os, gcc from github fork enables store merging, coalescing 2 consecutive word store ST into a single STD double store

.L53:
    brhi r0, r13, @.L52     #, p, lastone,

    mov_s   r2,1
    mov_s   r3,1
    std r2,[r0,8]
    std r2,[r0,16]
...
    std r2,[r0,248]
    std.as r2,[r0,64]
    std.as r2,[r0,66]   
...
    std.as r2,[r0,126]
    st  1,[r0,4]
    st  1,[r0]

    add r0,r0,512
    b_s @.L53

This improves Memory Write Bandwidth by over 20%

Back in 2018 Claudiu had pushed a ARC gcc patch to whcih enabled peephole2 patterns for generating LDD/STD [PATCH 4/6] [ARC] Add peephole rules to combine store/loads into double store/loads

However it seems there is one more patch (in generic code) [MAINLINE][HACK] Allow store merging using 64-bit std instructions. which is not merged into upstream and w/o this the peephole doesn't kick in.

So to summarize

LDD peephole doesn't work at all
enable the LDD/STD peephole for upstream gcc too

claziss commented 4 years ago

Yeah, my speech about custom Vs mainline. The store merging is done with the help of the mod which I did in upstream, and, probably it will not be in until I don't find another architecture which will benefit from it. LATE EDIT: I'll check if I can make it to work without that hack ;)

claziss commented 1 year ago

The autovectorizer should take care of it.

foss-for-synopsys-dwc-arc-processors / toolchain

[gcc] peephole2 to generate double load/stores not kicking in upstream ARC gcc - optimize LMBench bw_mem frd/fwr #309