damien-lemoal / riscv64-nommu-buildroot

Other
42 stars 11 forks source link

Better memory efficiency with multiple busybox instances #3

Closed laanwj closed 3 years ago

laanwj commented 4 years ago

I think it would be desirable to not use -r (force load to RAM) on the elf2flt build to make it possible for multiple instances of a binary (say, busybox) to share a .text section and thus conserve memory. With -r each invocation copies the entire binary, text and data.

However, currently the -r option to elf2flt is required. Building wIthout it, binaries crash at the beginning of __uClibc_main due to the _GLOBAL_OFFSET_TABLE_ not being correct.

I've tried looking into this but wasn't able to figure out how to make this work. The problem is that on RISC-V, the global offset table is assumed to be in a location relative to the code, part of the .text section itself:

0000000000012ba4 <__uClibc_main>:
   12ba4:       7169                    addi    sp,sp,-304
   12ba6:       00005717                auipc   a4,0x5
   12baa:       5aa73703                ld      a4,1450(a4) # 18150 <_GLOBAL_OFFSET_TABLE_+0x18>

As I see it, no amount of relocations can make this work (without a MMU—but with a MMU we could just ELF). I wonder how other architectures do this. I suppose this is something specific to the RISC-V ABI's handling of the GOT that makes this impossible.

damien-lemoal commented 4 years ago

For solving this (memory efficiency), I think we really need to go after FDPIC format. This is on our to-do list, but will take more time that the rather simple flat-bin support because we need first to define psABI specifications (there are non for FDPIC), and implement that in gcc (or LLVM). Several other projects have raised interest in getting RISC-V FDPIC format, so it is time to do it. Working on that.

laanwj commented 4 years ago

Yes, that would be even better. I've changed the issue title to encompass wider solutions.

damien-lemoal commented 4 years ago

OK. But solving this issue will not be done from this project since fdpic support will first need gcc modifications. There will be also some additional kernel loader patching needed too as fdpic support exists only for 32bits right now. Adding 64bits support is trivial, but that is one more patch to add.

So I think it is better to close this issue for now. We can come back to this modifying the buildroot build configs and kernel config once we have the supporting pieces all ready. Thoughts ?

pdp7 commented 4 years ago

@laanwj @damien-lemoal thanks for your efforts!

Is the motivation here to be able to have enough memory to support networking through the ESP8285 on the MAiX GO?

laanwj commented 4 years ago

Yes, that was my motivation. Even without networking, any more extensive use of the console in Linux on the Maix Go seems to quickly fill up (or at least fragment) memory quickly so that not another instance of busybox fits in.

carlosedp commented 4 years ago

Two references on the K210 MMU can be found on the explanation of the incompatibility here: https://github.com/oscourse-tsinghua/rcore_plus/issues/34 and a sample implementation here:

#include "CacheEngine.h"

uint64_t volatile __attribute__((aligned(CE_PAGE_SIZE)))  ceLv1PageTable[(CE_PAGE_SIZE / 8) * (1)];
uint64_t volatile __attribute__((aligned(CE_PAGE_SIZE)))  ceLv2PageTables[(CE_PAGE_SIZE / 8) * (1)];
uint64_t volatile __attribute__((aligned(CE_PAGE_SIZE)))  ceLv3PageTables[(CE_PAGE_SIZE / 8) * (CE_LV3_PAGE_TABLE_COUNT)];
uint64_t  __attribute__((aligned(CE_PAGE_SIZE)))  ceCacheMemory[(CE_PAGE_SIZE / 8) * CE_CACHE_SIZE_IN_PAGES];

uint16_t ceCacheMemoryBlockAge[CE_CACHE_SIZE_IN_BLOCKS];
uint16_t ceCacheMemoryBlockToVBlockId[CE_CACHE_SIZE_IN_BLOCKS];

static const uint64_t ceVABase = 0x100000000ULL;
static int ceIsMapWritable;

#ifdef CE_USE_FATFS
static FIL* ceFatFsFp;
static DWORD ceFatFsFileClusterTable[1024];

void* ceMapFileFatFs(FIL* fp) {
    memset(ceFatFsFileClusterTable, 0, sizeof(ceFatFsFileClusterTable));
    ceFatFsFileClusterTable[0] = sizeof(ceFatFsFileClusterTable) / sizeof(DWORD);
    // Cache the cluster table to make random access effecient, FF_USE_FASTSEEK must be set in ffconf.h
    fp->cltbl = ceFatFsFileClusterTable;
    FRESULT ret = f_lseek(fp, CREATE_LINKMAP);
    if (ret != FR_OK) {
        CE_ERROR_PRINT("ceMapFileFatFs: init cluster table failed: %d (maybe the file is too fragmented?)\n", ret);
        return 0;
    }
    ceFatFsFp = fp;
    return (void*)ceVABase;
}

int ceFileReadCallback(uint32_t fileOffset, uint64_t* buf, uint32_t len) {

    CE_DEBUG_PRINT("ceFileReadCallback(fatfs): %d, %p, %d\n", fileOffset, buf, len);
    UINT bytesRead = 0;
    if (f_lseek(ceFatFsFp, fileOffset) != FR_OK) {
        CE_DEBUG_PRINT("ceFileReadCallback(fatfs) f_lseek failed: %d, %p, %d\n", fileOffset, buf, len);
        return -1;
    }
    if (f_read(ceFatFsFp, buf, len, &bytesRead) != FR_OK) {
        CE_DEBUG_PRINT("ceFileReadCallback(fatfs) f_read failed: %d, %p, %d\n", fileOffset, buf, len);
        return -1;
    }
    if (bytesRead != len) {
        CE_DEBUG_PRINT("ceFileReadCallback(fatfs) bytesRead != len: %d, %p, %d\n", fileOffset, buf, len);
        return -1;
    }
    return 0;
}
#endif

void ceResetCacheState() {
    memset(ceCacheMemoryBlockAge, 0, sizeof(ceCacheMemoryBlockAge));
    memset(ceLv3PageTables, 0, sizeof(ceLv3PageTables));
    ceIsMapWritable = 0;
    asm volatile ("sfence.vm");
}

uint64_t ceEncodePTE(uint32_t physAddr, uint32_t flags) {
    assert((physAddr % CE_PAGE_SIZE) == 0);
    return (((uint64_t)physAddr >> 12) << 10) | flags;
}

void ceSetupMMU() {
    const uint64_t stapModeSv39 = 9;

    CE_DEBUG_PRINT("setup mmu...\n");

    //0 - 0xFFFFFFFF -> mirror to phys
    for (uint32_t i = 0; i < 4; i++) {
        ceLv1PageTable[i] = ceEncodePTE((0x40000000U) * i,  PTE_V | PTE_R | PTE_W | PTE_X | PTE_G | PTE_U);
    }

    //0x100000000 (1GiB) -> lv2
    ceLv1PageTable[4] = ceEncodePTE((uint32_t)ceLv2PageTables, PTE_V | PTE_G | PTE_U  );

    //0x100000000 (2MiB * CE_LV3_PAGE_TABLE_COUNT) -> lv3
    for (uint32_t i = 0; i < CE_LV3_PAGE_TABLE_COUNT; i++) {
        ceLv2PageTables[i] = ceEncodePTE(((uint32_t)ceLv3PageTables) + i * CE_PAGE_SIZE,  PTE_V | PTE_G | PTE_U);
    }

    write_csr(sptbr, (uint64_t)ceLv1PageTable >> 12);

    uint64_t msValue = read_csr(mstatus);
    msValue |= MSTATUS_MPRV | ((uint64_t)VM_SV39 << 24);
    write_csr(mstatus, msValue);

    ceResetCacheState();
}

static inline uint32_t ceVAddrToVBlockId(uintptr_t vaddr) {
    return (vaddr - ceVABase) / (CE_BLOCK_SIZE);
}

static void ceMapVBlockToPhysAddr(uint32_t vBlockId, uint32_t physAddr) {
    uint32_t basePageId = vBlockId * CE_BLOCK_SIZE_IN_PAGES;
    uintptr_t vaddr = 0;
    for (uint32_t i = 0 ; i < CE_BLOCK_SIZE_IN_PAGES; i++) {
        ceLv3PageTables[basePageId + i] = physAddr ? ceEncodePTE(physAddr + i * CE_PAGE_SIZE, PTE_V | PTE_R | PTE_X | PTE_G | PTE_U) : 0;
        vaddr = ceVABase + ((uintptr_t)(basePageId + i) * CE_PAGE_SIZE);
        asm volatile("sfence.vm %0" : "=r"(vaddr));
    }
}

static inline int ceCheckAndSetVBlockAccessFlag(uint32_t vBlockId) {
    int hasAccessed = 0;

    uint32_t basePageId = vBlockId * CE_BLOCK_SIZE_IN_PAGES;
    for (uint32_t i = 0; i < CE_BLOCK_SIZE_IN_PAGES; i++) {
        uint64_t pte = ceLv3PageTables[basePageId + i];
        assert(pte & PTE_V);
        if (pte & PTE_A) {
            // TODO: ensure this operation is atomic
            ceLv3PageTables[basePageId + i] &= (~((uint64_t)PTE_A));
            hasAccessed = 1;
        }
    }
    return hasAccessed;
}

static uint32_t ceFindBlockToRetire() {

    uint16_t maxAge = 0;
    uint32_t maxAgeAt = 0;

    for (uint32_t i = 0; i < CE_CACHE_SIZE_IN_BLOCKS; i++) {
        uint16_t age = ceCacheMemoryBlockAge[i];
        if (age == 0) {
            // an empty block!
            return i;
        }
        if (age >= maxAge) {
            maxAge = age;
            maxAgeAt = i;
        }
    }
    return maxAgeAt;
}

int ceHandlePageFault(uintptr_t vaddr, int isWrite) {
    if (isWrite) {
        if (!ceIsMapWritable) {
            return -1;
        }
    }

    uint32_t cacheBlockId = ceFindBlockToRetire();
    CE_DEBUG_PRINT("ceHandlePageFault: %p, %d\n", (void*)vaddr, cacheBlockId);
    if (ceCacheMemoryBlockAge[cacheBlockId]) {
        // an used block, free it
        ceMapVBlockToPhysAddr(ceCacheMemoryBlockToVBlockId[cacheBlockId], 0);
    }
    ceCacheMemoryBlockAge[cacheBlockId] = 0;

    uint32_t vBlockId = ceVAddrToVBlockId(vaddr);
    uint32_t physAddr = ((uint32_t) ceCacheMemory) + (CE_BLOCK_SIZE * cacheBlockId);

    int ret = ceFileReadCallback(vBlockId * CE_BLOCK_SIZE, (uint64_t*)physAddr, CE_BLOCK_SIZE);
    if (ret != 0) {
        CE_ERROR_PRINT("ceHandlePageFault: file read failed, %p, %d\n", (void*)vaddr, ret);
        return -1;
    }

    ceCacheMemoryBlockAge[cacheBlockId] = 1;
    ceCacheMemoryBlockToVBlockId[cacheBlockId] = (uint16_t) vBlockId;
    ceMapVBlockToPhysAddr(vBlockId, physAddr);
    return 0;
}

void ceUpdateBlockAge() {
    for (uint32_t i = 0; i < CE_CACHE_SIZE_IN_BLOCKS; i++) {
        uint16_t age = ceCacheMemoryBlockAge[i];
        if (age == 0) {
            // an empty block!
            continue;
        }
        int hasAccessed = ceCheckAndSetVBlockAccessFlag(ceCacheMemoryBlockToVBlockId[i]);
        if (!hasAccessed) {
            if (age < UINT16_MAX) {
                age ++;
                ceCacheMemoryBlockAge[i] = age;
            }
        } else {
            age = 1;
            ceCacheMemoryBlockAge[i] = age;
        }
        //CE_DEBUG_PRINT("ceUpdateBlockAge: %d, %d\n", i, age);
    }
}

uintptr_t handle_fault_load(uintptr_t cause, uintptr_t epc, uintptr_t regs[32], uintptr_t fregs[32]) {

    uintptr_t badAddr = read_csr(mbadaddr);
    if ((badAddr >= ceVABase) && (badAddr < (ceVABase + CE_CACHE_VA_SPACE_SIZE))) {
        if (ceHandlePageFault(badAddr, 0) == 0) {
            return epc;
        }
    }
    CE_ERROR_PRINT("fault load could not be handled, badAddr: %p, epc: %p\n", (void*) badAddr, (void*) epc);
    sys_exit(1337);
    return epc;
}

Ref: https://gist.github.com/44670/0d8c152df7c5b59d17d469aba4dda0e5

damien-lemoal commented 4 years ago

@carlosedp: The K210 does indeed have an MMU, but it is following old unstable specifications (v1.9). So we will never get support for it in the kernel. NOMMU remains the best option for supporting this board.

@Iaanwj @pdp7: The FDPIC support effort on-going could indeed help with improving memory usage. E.g. the libc text can be shared by all running processes. But it will not help with things like memory fragmentation. In any case, this will take time since we need to sort out ABI definitions for it first.

And I agree that we should aim at adding support for peripherals such as wifi as networking would really make the board useful for various applications (sensors, remote control etc). But before going there, I would suggest first to get the SD card working. This step done first would allow moving the initrd image (Busybox image) onto the SD card and avoid having to load everything in RAM. A bigger busybox image with more tools could then be built, a bigger FS image created on the SD card and as a result, have a much smaller kernel+initrd for boot freeing RAM for added functions such as networking.

carlosdp commented 4 years ago

Oh weird, that's never happened to me before, I think you meant to tag @carlosedp, @damien-lemoal =P

damien-lemoal commented 4 years ago

Oh weird, that's never happened to me before, I think you meant to tag @carlosedp, @damien-lemoal =P

Ooops. Yes. My bad !

carlosedp commented 4 years ago

Yes, I agree that this should not be taken into the Kernel as it's an unsupported spec.

pdp7 commented 4 years ago

@damien-lemoal thanks for the insights

@Iaanwj @pdp7: The FDPIC support effort on-going could indeed help with improving memory usage. E.g. the libc text can be shared by all running processes. But it will not help with things like memory fragmentation. In any case, this will take time since we need to sort out ABI definitions for it first.

My understanding is that ARM (the company) created the ARM FDPIC ABI specification. For RISC-V, Is this something that needs to be done by a working group in the RISC-V Foundation?

And I agree that we should aim at adding support for peripherals such as wifi as networking would really make the board useful for various applications (sensors, remote control etc). But before going there, I would suggest first to get the SD card working. This step done first would allow moving the initrd image (Busybox image) onto the SD card and avoid having to load everything in RAM. A bigger busybox image with more tools could then be built, a bigger FS image created on the SD card and as a result, have a much smaller kernel+initrd for boot freeing RAM for added functions such as networking.

I agree, SD card support is a very good idea. Is there discussion that happens IRC about implementation ideas and experiments? Or is it all just on linux-riscv@lists.infradead.org ?

pdp7 commented 4 years ago

fyi - from Twitter, Sipeed shared this work that a student is doing with the K210 MMU: https://github.com/lizhirui/K210-Linux0.11

Linux0.11 with MMU for K210(This is ported from Linus's December 1991 Linux 0.11 code) this is issue is probably not the right place to discuss it but I thought I would bring it everyone's attention

damien-lemoal commented 4 years ago

My understanding is that ARM (the company) created the ARM FDPIC ABI specification. For RISC-V, Is this something that needs to be done by a working group in the RISC-V Foundation?

This is being discussed in the RISC-V foundation sw-dev list. See https://groups.google.com/a/groups.riscv.org/forum/#!topic/sw-dev/ZjYUJswknQ4

damien-lemoal commented 4 years ago

I agree, SD card support is a very good idea. Is there discussion that happens IRC about implementation ideas and experiments? Or is it all just on linux-riscv@lists.infradead.org ?

@pdp7 The kernel list can certainly be used to discussed SD card support. Sean has been doing a lot of work on U-Boot too to support this board. At least clock management is more complete than what is going in the kernel and that will be needed for the SD card. See: https://github.com/Forty-Bot/u-boot/commits/maix_v8