Linux "data_race" macro triggers multiplication of memcpy() in vm_area_dup()

abrodkin commented 2 years ago

Consider vm_area_dup() function in the Linux kernel (https://elixir.bootlin.com/linux/v5.16/source/kernel/fork.c#L354):

struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
{
    struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);

    if (new) {
        *new = data_race(*orig);
        INIT_LIST_HEAD(&new->anon_vma_chain);
        new->vm_next = new->vm_prev = NULL;
    }
    return new;
}

The real "meat" here is *new = *orig; which basically duplicates contents of one vm_area structure to another with help of memcpy(). But that's how it works if data_race() macro (see how it's implemented here https://elixir.bootlin.com/linux/v5.16/source/include/linux/compiler.h#L214) is removed. But in its presence for some reason 3 extra memcpy() invocations appear, what's more, they all act on different (though located one after another) buffers of the same size.

That's what I see in disassembly w/o the macro:

vm_area_dup:
 fb4: 1cfcb3c8                 push»    %r15
 fb8: 1cfcb388                 push»    %r14
 fbc: 1cfcb7c8                 push»    %blink
 fc0: 4708                     mov_s»   %r15,%r0
 fc2: 41c3 00000cc0            mov_s»   %r1,0xcc0»      »       ; 0xcc0 = mmdrop_async_fn+0xfc
 fc8: 16007000 00000000r       ld»      %r0,[0]»»       ; vm_area_cachep
 fd0: 08020000r                bl»      kmem_cache_alloc
 fd4: 260a9000                 mov.f»   %r14,%r0
 fd8: f20d                     bz_s»    0xff2 = vm_area_dup+0x3e
 fda: da5c                     mov_s»   %r2,92
 fdc: 08020020r                bl.d»    memcpy
 fe0: 41e1                     mov_s»   %r1,%r15
 fe2: 1e031019                 st»      0,[%r14,12]
 fe6: 26561200                 add3»    %r0,%r14,8
 fea: 1e021019                 st»      0,[%r14,8]
 fee: a610                     st_s»    %r0,[%r14,64]
 ff0: a611                     st_s»    %r0,[%r14,68]
 ff2: 1404341f                 pop»     %blink
 ff6: 40c1                     mov_s»   %r0,%r14
 ff8: 1404340e                 pop»     %r14
 ffc: 7fe0                     j_s.d»   [%blink]
 ffe: 1404340f                 pop»     %r15
1002: 78e0                     nop_s

And that's with the macro:

vm_area_dup:
 bac: 1cfcb3c8                 push»    %r15
 bb0: 1cfcb388                 push»    %r14
 bb4: 1cfcb7c8                 push»    %blink
 bb8: 24953efe                 add2»    %sp,%sp,-69
 bbc: 4708                     mov_s»   %r15,%r0
 bbe: 41c3 00000cc0            mov_s»   %r1,0xcc0»      »       ; 0xcc0 = dup_mm.isra.0+0x94
 bc4: 16007000 00000000r       ld»      %r0,[0]»»       ; vm_area_cachep
 bcc: 08020000r                bl»      kmem_cache_alloc
 bd0: 260a9000                 mov.f»   %r14,%r0
 bd4: f221                     bz_s»    0xc16 = vm_area_dup+0x6a
 bd6: da5c                     mov_s»   %r2,92
 bd8: 41e1                     mov_s»   %r1,%r15
 bda: 08020020r                bl.d»    memcpy
 bde: 245535c0                 add2»    %r0,%sp,23
 be2: da5c                     mov_s»   %r2,92
 be4: 245535c1                 add2»    %r1,%sp,23
 be8: 08020020r                bl.d»    memcpy
 bec: 4083                     mov_s»   %r0,%sp
 bee: da5c                     mov_s»   %r2,92
 bf0: 4183                     mov_s»   %r1,%sp
 bf2: 08020020r                bl.d»    memcpy
 bf6: 245635c0                 add3»    %r0,%sp,23
 bfa: da5c                     mov_s»   %r2,92
 bfc: 245635c1                 add3»    %r1,%sp,23
 c00: 08020020r                bl.d»    memcpy
 c04: 40c1                     mov_s»   %r0,%r14
 c06: 1e031019                 st»      0,[%r14,12]
 c0a: 26561200                 add3»    %r0,%r14,8
 c0e: 1e021019                 st»      0,[%r14,8]
 c12: a610                     st_s»    %r0,[%r14,64]
 c14: a611                     st_s»    %r0,[%r14,68]
 c16: 24953141                 add2»    %sp,%sp,69
 c1a: 40c1                     mov_s»   %r0,%r14
 c1c: 1404341f                 pop»     %blink
 c20: 1404340e                 pop»     %r14
 c24: 7fe0                     j_s.d»   [%blink]
 c26: 1404340f                 pop»     %r15
 c2a: 78e0                     nop_s

Any ideas on what's going on here? That's especially interesting if we read a comment for the macro:

This macro does not affect normal code generation, but is a hint to tooling that data races here are to be ignored.

To reproduce that problem outside the Linux source tree use attached fork.i and compile it with arc64-linux-gcc -mcpu=hs5x -c -O3 -o fork.o fork.i, then inspect the body of vm_area_dup().

fork.zip

abrodkin commented 2 years ago

For the record, the same is easily reproduced for ARCv2 w/o 64-bit loads/stores (i.e. w/o -mll64 - in case of 64 bit loads/stores instead of calls to memcpy() data gets moved in place with ldd/std instructions):

vm_area_dup:
 acc: c0f1                     push_s»  %blink
 ace: c6e1                     push_s»  %r14
 ad0: c5e1                     push_s»  %r13
 ad2: 24823504                 sub»     %sp,%sp,0x114
 ad6: 41c3 00000cc0            mov_s»   %r1,0xcc0»      »       ; 0xcc0 = dup_mm.isra.0+0x180
 adc: 4608                     mov_s»   %r14,%r0
 ade: 16007000 00000000r       ld»      %r0,[0]»»       ; vm_area_cachep
 ae6: 08020000r                bl»      kmem_cache_alloc
 aea: 250a9000                 mov.f»   %r13,%r0
 aee: f220                     bz_s»    0xb2c = vm_area_dup+0x60
 af0: 41c1                     mov_s»   %r1,%r14
 af2: da5c                     mov_s»   %r2,92
 af4: 08020020r                bl.d»    memcpy
 af8: c097                     add_s»   %r0,%sp,92
 afa: c197                     add_s»   %r1,%sp,92
 afc: da5c                     mov_s»   %r2,92
 afe: 08020020r                bl.d»    memcpy
 b02: 4083                     mov_s»   %r0,%sp
 b04: 4183                     mov_s»   %r1,%sp
 b06: da5c                     mov_s»   %r2,92
 b08: 08020020r                bl.d»    memcpy
 b0c: 245635c0                 add3»    %r0,%sp,23
 b10: da5c                     mov_s»   %r2,92
 b12: 245635c1                 add3»    %r1,%sp,23
 b16: 08020020r                bl.d»    memcpy
 b1a: 40a1                     mov_s»   %r0,%r13
 b1c: 25561202                 add3»    %r2,%r13,8
 b20: a550                     st_s»    %r2,[%r13,64]
 b22: a551                     st_s»    %r2,[%r13,68]
 b24: 1d0c1001                 st»      0,[%r13,12]
 b28: 1d081001                 st»      0,[%r13,8]
 b2c: 40a1                     mov_s»   %r0,%r13
 b2e: 24803504                 add»     %sp,%sp,0x114
 b32: 1408301f                 ld»      %blink,[%sp,8]
 b36: c5c1                     pop_s»   %r13
 b38: 7fe0                     j_s.d»   [%blink]
 b3a: 1408340e                 ld.ab»   %r14,[%sp,8]
 b3e: 78e0                     nop_s

Pre-built ARC GNU toolchain 2021.09:

arc-elf32-gcc --version
arc-elf32-gcc (ARCompact/ARCv2 ISA elf32 toolchain - build 965) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.

shahab-vahedi commented 2 years ago

An observation: Each of the 3 extra memcpy()s are using the same addresses as source and destination.

abrodkin commented 2 years ago

@shahab-vahedi well looking at real execution trace that's what I may reconstruct:

memcpy(dest = 0x81057da4, src = 0x812f82e0, size = 0x5c = 92)
memcpy(dest = 0x81057d48, src = 0x81057da4, size = 0x5c = 92)
memcpy(dest = 0x81057e00, src = 0x81057d48, size = 0x5c = 92)
memcpy(dest = 0x812f8d4c, src = 0x81057e00, size = 0x5c = 92)

So it's quite an interesting arrangement ;)

claziss commented 1 year ago

I do not understand what is the issue here. I do not see any issue related with to the compiler. The vm_area_dup duplicates some structures, thus, making use of memcpy routine. Why the authors of this vm_area_dup does so, I wouldn't know, and the best is to ask them why. I'll close it down. Please reopen it if you see any issue with the compiler.

foss-for-synopsys-dwc-arc-processors / toolchain

Linux "data_race" macro triggers multiplication of memcpy() in vm_area_dup() #468