Open parisseb opened 2 years ago
Sorry that I can't read your post, I wonder if the following text is what you originally intended? @parisseb
The code in vmMgr_init()
should be
mapList_AddPartitionMap(MAP_PART_RAWFLASH, PERM_R, VM_ROM_BASE, FLASH_SYSTEM_BLOCK * 64, VM_ROM_SIZE);
mapList_AddPartitionMap(MAP_PART_FTL, PERM_R | PERM_W, VM_RAM_BASE, 0, VM_RAM_SIZE);
#if (VMRAM_USE_FTL == 0)
uint32_t paddr = (uint32_t)vm_ram_none_ftl;
for (uint32_t vaddr = VM_RAM_BASE; vaddr < (VM_RAM_BASE + VM_RAM_SIZE_NONE_FTL); vaddr += PAGE_SIZE) {
mmu_map_page(vaddr, paddr, AP_SYSRW_USRRW, VM_CACHE_ENABLE, VM_BUFFER_ENABLE);
paddr += PAGE_SIZE;
}
mmu_invalidate_tlb();
//
#endif
So that we can use two kinds of ram at the same time: not swapped and swapped. I think that the 1 bit per pixel screen buffer (4K) should never be swapped, leaving 264K for cache. SystemConfig.h
would look like
#define USE_TINY_PAGE (1)
#define VMRAM_USE_FTL (0)
#define SEG_SIZE 1048576
#if VMRAM_USE_FTL
#if USE_TINY_PAGE
#ifdef ENABLE_AUIDIOOUT
#define NUM_CACHEPAGE ( 200 ) // 273 * 1 = 273 KB
#else
#define NUM_CACHEPAGE ( 268 ) // 273 * 1 = 273 KB
#endif
#else
#define NUM_CACHEPAGE ( 79 ) // 79 * 4 = 316 KB
#endif
#else
#if USE_TINY_PAGE
#define NUM_CACHEPAGE ( 264 )
#define VM_RAM_SIZE_NONE_FTL ( 4 * 1024 )
#else
#define NUM_CACHEPAGE ( 32 )
#define VM_RAM_SIZE_NONE_FTL ( 168 * 1024 )
#endif
#endif
The loader script Scripts/sys_ld.script
would be:
MEMORY
{
vmRAM (rwx) : ORIGIN = 0x02040000, LENGTH = 2M
vmROM (rx ) : ORIGIN = 0x00100000, LENGTH = 6M
}
Changes in kcasporing_gl.c:
char * screen_1bpp=0x02000000;
and avoid initialization of virtual_screen
in 1 bit per pixel mode:
if (!khicas_1bpp)
memset(virtual_screen, COLOR_WHITE, VIR_LCD_PIX_H * VIR_LCD_PIX_W);
Now integrate(1/(x^4+1))
is 0.93s (normal mode) or 0.54s (fast CPU). By comparion on the Casio monochrom, it's 0.34s. For plot(sin(x))
in fast CPU mode, 0.06s vs 0.15s on the Casio. If we can spare RAM in OSLoader own use, it should be possible to improve the integrate benchmark!
Yes, I tried but could not make github render the code correctly, I don't know why (I have added the vmMgr.c file to my giac39.tgz archive). Maybe we could discuss further optimizations on a phpbb forum like https://tiplanet.org/forum/viewforum.php?f=70 ?
I think we could spare a few K in OSLoader. For example in msc_disk.c, the variables uint8_t MSCRBuffer[2048] __aligned(4); uint8_t MSCWRBuf[2048] __aligned(4); are not used at all and that's 4K. Unfortunately, commenting these 2 lines does not change the RAM available for cache, because of the following L1PTE variable that is 16K aligned. But it should be possible to reorder the object files load so that this aligned variable does not leave an unused area. [Update] For example rename vmMgr.c to 0vmMgr.c will link it's object compilation before mmu.c. There is currently a potential for releasing about 6K. With a careful study of the OSLoader code, perhaps some buffers could be optimized. I would really like to be able to run KhiCAS with almost 0 RAM page swapping. This would improve benchmarks as well as flash lifetime.
Found another unusued buffer: pcWriteBuffer in OSLoader/start.c, 5K.
Can we share a common buffer for page_save_wr_buf, page_save_rd_buf (VmMgr/vmMgr.c) and data_page_buffer (LowLevelAPI/llapi.c)? Potential saving 4K.
Potential alignment optimizations?
Is L1PTE_NUM really #define L1PTE_NUM (2049)? Setting to 2048 would save almost 4K between L1PTE and L2PTE.
peripheral register address at 0x80000000, we need to set "PTE_LOC[0x800]" to map this segment to virtual address space for driver, so we defined PTE_LOC[2049]. In fact, most of the area (PTE_LOC[49] to PTE_LOC[2048] total about 8KB ) is redundant and could probably be used.
MSCWRBuf and MSCWRBuf I forgot to remove them, initially for the small sector USB transfers, but now I don't need them...
If you look at the loader output, the address of L1PTE and L2PTE differs from 12K because L2PTE is 4K aligned. In other words, the 2049-th index is responsible for 4K additional RAM use. If it's unavoidable, maybe it's possible to use 4K-4 bytes for something else.
I have found a way to move PageFaultQueue and mapList from bss to data and save 1K, just initialize to {0] or 0. Then move the definition of L2PTE at the beginning of vmMgr.c, renamed 0vmMgr.c, and the loader orders the RAM much better
00021444 g O .bss 00000004 faultAddress
00022000 g O .bss 00001000 vm_ram_none_ftl
00024000 g O .bss 0000d000 L2PTE
00031000 g O .bss 00042000 CachePage
00074000 g O .bss 00002004 L1PTE
00076004 g O .bss 00000004 vm_svc_stack_address
00076394 g .bss 00000000 __HEAP_START
I added 4K to the non swappable usable RAM to 8K, and 4 pages of cache to 268 pages and the rom heap start address is lower than before.
I moved the L1 page table to a separate space supported by the chip(default first-level page table, DFLPT), which will save 8KB of memory and I trimmed some useless buffers, and now we have about 300KB of physical memory.
Here is the new code: https://github.com/Repeerc/ExistOS-For-HP39GII or compare views: https://github.com/Repeerc/ExistOS-For-HP39GII/commit/0e13c692321925629edfa3ffcbc939c5921fc8c1?diff=split
I have tested turning memory swapping off and it looks like the UI written by LvGL is difficult to run (maybe we need a simple UI), but it is sufficient for KhiCAS to run, so I set it up to enter KhiCAS immediately after startup (https://github.com/Repeerc/ExistOS-For-HP39GII/blob/main/System/main.c#L1160-L1178).
Great! It's probably possible to spare 4K : MPTE_Table is 78 bytes and there is an emtpy gap of 4K-78 with L2PTE. Now we should trim VRAM usage. 168K are currently already used over 270. There are 3 full screen buffer disp_buf_1, vrambuf, full_screen_buf, that's 96K instead of 32K if we share them. These 64K could be reinjected in NUMPAGES for ROM. (Another option is to enable partial RAM swap like I did before your changes with a few dedicated areas for 1bpp screen and KhiCAS fast alloc).
Unfortunately, my attempts to boot the calculator with this new configuration failed. With a few additional tricks, I have now 288K of RAM available, 32K for giac reserved areas that are never swapped and 256K for the ROM and RAM swap. The archive https://www-fourier.univ-grenoble-alpes.fr/~parisse/hp39/giac39.tgz has a README file explaining all these changes and a changes.tgz with the modified files.
One of the change I made is F3 detection at boot time in OSLoader/start.c : if F3 pressed, display No system, this way one can reflash a calculator even if System ends up with as System panic. I also had problems with USB MSC mode that did not work until I exchanged the Views and mode string displays in the source code (start.c), and then it resume working. No idea why, perhaps a problem with my calculator...
I'm now confident that the RAM swap is minimal inside KhiCAS, the lifetime of the flash should not be affected by swapping. I will now stop looking at the OS and concentrate on KhiCAS itself.
The code in vmMgr_init() should be mapList_AddPartitionMap(MAP_PART_RAWFLASH, PERM_R, VM_ROM_BASE, FLASH_SYSTEM_BLOCK * 64, VM_ROM_SIZE);
if (VMRAM_USE_FTL == 0)
endif
So that we can use two kinds of ram at the same time: not swapped and swapped. I think that the 1 bit per pixel screen buffer (4K) should never be swapped, leaving 264K for cache. SystemConfig.h would look like
define USE_TINY_PAGE (1)
define VMRAM_USE_FTL (0)
define SEG_SIZE 1048576
if VMRAM_USE_FTL
else
endif
The loader script Scripts/sys_ld.script would be:
MEMORY { vmRAM (rwx) : ORIGIN = 0x02040000, LENGTH = 2M vmROM (rx ) : ORIGIN = 0x00100000, LENGTH = 6M }
Changes in kcasporing_gl.c:char * screen_1bpp=0x02000000;
and avoid initialization of virtual_screen in 1 bit per pixel mode:if (!khicas_1bpp) memset(virtual_screen, COLOR_WHITE, VIR_LCD_PIX_H * VIR_LCD_PIX_W);
Now integrate(1/(x^4+1)) is 0.93s (normal mode) or 0.54s (fast CPU). By comparion on the Casio monochrom, it's 0.34s. For plot(sin(x)) in fast CPU mode, 0.06s vs 0.15s on the Casio. If we can spare RAM in OSLoader own use, it should be possible to improve the integrate benchmark!