Possible Execute-In-Place scheme

EtchedPixels / FUZIX

FuzixOS: Because Small Is Beautiful

Other

2.15k stars 271 forks source link

Possible Execute-In-Place scheme #691

Closed nickd4 closed 5 years ago

nickd4 commented 5 years ago

hi,

I did substantial work on an UZI port for a cash register some years (too many years) back. This machine had a Z180 CPU and an interesting feature of the machine was it had 256 kbyte ordinary RAM and 768 kbyte RAM disk, which acted essentially as Flash does to day (memory mapped and fast to read but slow to write).

So the scheme implemented was pretty sophisticated, I made sure that no function had more than 4 kbytes, and I implemented a custom linker that would pack all the code into 4 kbyte windows. All code executed in the F000-FFFF region of the address space, with the currently executing function's window being paged in as part of the calling sequence and the region 1000-EFFF containing the process's data and 0000-0FFF interrupt handlers and system stack.

This behaved pretty much like a PDP-11 with split I&D except code wasn't limited to 64 kbyte, kernel had about 80 kbytes code and some larger utilities such as the shell and the editor had large code too. Code pointers were 3 bytes and data pointers 2 bytes. Smaller utilities did not have to use the scheme and could simply use the tiny memory model with all 2 byte pointers.

The next interesting thing about the scheme was the execute-in-place ability, which I nicknamed "XIP". After downloading new code, e.g. a new kernel, you could run the command "align uzi.bin" as root, and it would arrange the 1 kbyte blocks of the file such that each 4 kbyte window was a contiguous region of up to 4 x 1 kbyte blocks in the filesystem, aligned to a 4 kbyte boundary in physical RAM and thus addressible by CBR register to bring it in at logical F000-FFFF. Hence program code could be executed directly from RAM disk and the 256 kbyte ordinary RAM could be used for kernel data plus 3 or more large processes with up to 52 kbyte data each.

As part of this I modified the filesystem and utilities such as fsck so that instead of the block free list it uses a bitmap. This should be a substantial performance improvement at the cost of a few blocks lost to the bitmap when the filesystem is full.

I made some progress to digging out this old code and putting it into a git repo, there is one commit per development release that I made at the time so it's not a file by file history but there are release notes at least.

If anyone wants to pick this up I am happy to MIT license it or similar and make it available. I also have loads of the hardware it ran on, from the decommissioned network of cash registers, which I am happy to give away for the cost of the postage.

cheers, Nick nick "at" ndcode "dot" org

EtchedPixels commented 5 years ago

On the 6502 side btw my RC2014 tree now has an emulator for the 6502 RC2014 configuration I am currently working on, and the RC2014-ROM directory my rather minimal flash ROM I am using for bring up.

Still have to write the VIA6522 emulation bits I need - such an incredibly complicated bit of silicon.

nick-lifx commented 5 years ago

Yes, an interesting chip. I have VIA6522 (if I recall correctly) which were supposedly brand new from a guy in USA, but actually are bad Chinese fakes, well they might actually work, I don't know. Haha. I complained to him that the tops of the chips had been sanded down and repainted, but he acted all innocent and I could not be bothered sending them back at huge expense just to prove a point. Hmm.

nickd4 commented 5 years ago

I am making great progress on building my mental model of how the banking and swapping works. It was a bit unfortunate really that I started on z80pack because that's really quite complicated. It is best to start with v65 since all the code is relatively together and comprehensible, also it's more friendly than the fixed-bank scheme. Once you understand the v65 way, the z80pack way is just a small twist on it.

So an important misconception I had concerns the common area, I had been thinking of this in the z80pack way as memory that's shared between all processes and the kernel, but that's not accurate, in FUZIX the terminology means memory that's shared between a process and the kernel. Once you understand that each process has its own "common" area, the udata management makes more sense!

So in v65 the important thing to understand is that when in kernel mode, the kernel's private stuff is at 0x2000 and above, whereas the first page of the process is mapped in below 0x2000. The udata and various stacks are kept there. So when a process goes to sleep in kernel mode and another process wakes up in kernel mode, it's quick to remap the first page and get the correct stack and udata for it.

The difference with z80pack is that when kernel is mapped in, that consumes the one-and-only base register so we do not get the choice of mapping in a particular page containing the right udata and stack, we are stuck with the kernel's one. So the only option is when a process goes to sleep, we must copy its user and stack to somewhere else, and then copy back the user and stack of the waking-up process.

As a matter of mechanics, in the z80pack case we make the following decisions (i) the udata and stack of the running process will be kept in z80pack common which is above 0xf000, you can change the boundary by writing an I/O port but I believe we set it to 0xf000 giving the smallest possible common; (ii) the stashed udata and stack of a sleeping process will be copied to/from the end of the process's bank.

A downside to this scheme is that in both cases (v65 and z80pack) we lose the corresponding amount of logical address space in the process, even though the process has no interest in the udata or kernel stack and never accesses them. The space lost is 512 bytes. And this is kind of significant for the largest processes, one of them being the Bourne shell, and another would be the linker if self-hosting.

What I had imagined before I realized all of this last night (while trying to make room for my test program to run in kernel space), was that the kernel was somehow privately managing the udata and kernel stacks in a way that wasn't visible to the process. And I'd kind of like to see it do that, because it worries me that processes can scribble on data that is important to the kernel (at least, it can in the v65 case).

I realize we do not have formal memory protection, and so in theory we cannot really protect from a bad process, and I also wouldn't advocate putting a lot of effort into things like illegal instruction trapping or stack checks on entry to the kernel etc, but still, if the MMU can be used to keep things separate then it's good to do so, as it would save a lot of head-scratching in the presence of tricky corruption problems.

When considering if the kernel should try to keep the udata and kernel stack invisible to the process and allow the process the largest possible logical address space, the difficulty for v65 is that space at the start of a page is a limited resource and we need some of this for the kernel's zero page and CPU stack, whereas the difficulty for z80pack is that the memory is not very granular, if we take a bank for the kernel's private usage to stash udatas and stacks in, then we have to take an entire bank or not at all.

So basically if we let the process have all of its 64 kbyte address space we have to find somewhere else to put the udatas and stacks. And, in the v65 case we want it so we can quickly map it in with the MMU, which implies for each process we have to take an 8 kbyte page, use the first 512 bytes and waste the rest. That's not very palatable. On the other hand, it would simplify the processing of interrupts that occur in user mode, because the process's zero page and CPU stack could be left alone during the switch to kernel mode and possible sleeping in kernel mode (that occupies the kernel stack) if pre-emption occurs.

This is an interesting issue, and I went digging in the Unix V7 source for quite a while to see what Ritchie had done about it. (And I saw what @EtchedPixels was referring to in regards to processes getting booted until enough contiguous core is available; also references to "likelihood" of core or swap running out, which says to me that Unix V7 did not have good guarantees of the kind that I consider essential).

It turns out that the magic elixir with Unix V7 is that the PDP-11, while it has only 8 kbyte granularity in dealing with logical memory, has a smaller 64 byte granularity in dealing with physical memory (it makes sense really, since in those days physical memory was precious and logical memory abundant). So what they're able to do is to tack the udata and kernel stack onto the start of the process's core (or swap) image, taking only the 512 kbytes or so, and map it into the kernel, but keep it invisible to the process. The only way we can do the same thing in v65, is to waste the remainder of an 8 kbyte page as noted.

Thinking about all of this I came up with a bit of a radical plan that I think could optimize things a bit, though it may seem like a de-optimization in some ways. For me, probably the biggest priority is that processes have a full 64 kbytes to scribble on, and a secondary priority is that the kernel have same. (Or, as a compromise, the kernel might have say 56 kbytes plus an 8 kbyte window for uget-type stuff).

What I want to do is to allocate pages (in v65) or one or more banks (in z80pack) just to hold udatas and stacks, and I guess possibly other memory-intensive stuff like disk or network buffers eventually. For each process we'd have a udata, a stack, and in the v65 case also a zero page and a CPU stack, total would be 1 kbyte, so an 8 kbyte page could support 8 processes. To switch processes we would do this:

Have the 8 kbyte page holding the current process's udata and stacks mapped in already up high.
Copy the zero page and the CPU stack into the correct slot (one of the 8) in the high-mapped page.
Change the mapping to bring in the new process's udata and stacks (may be in a different page).
Copy the zero page and the CPU stack from the correct slot (one of the 8) down to the fixed locations.
Continue to access udata through a pointer (because it's in one of 8 slots), C stack already is like this.

This isn't nearly as bad as it sounds, because the cc65 compiler only uses about 20 bytes or so of the zero page, and only a similar amount of CPU stack would ever be used as we'd only be nested a couple of C function calls deep when we go to sleep. The worst part is probably the udata access through a pointer, to mitigate that I wonder if cc65 could be told that the pointer is permanently in the zero page?

For z80pack it really wouldn't be any worse than now, just a different choice of stashing location, though we'd have to take a hit in having one fewer banks available. We would still have a lot of wasted space in the reserved bank, but this gave me another idea which I think would be helpful in those systems that have a large fixed common (like 48 kbyte banks and 16 kbyte always mapped common). We can let processes grow into common, and stash the overflow part somewhere else on task switch.

With these ideas, I think we would move closer to having a self-hosting system that can compile itself.

@EtchedPixels what do you think? It somewhat depends on what we see as the priorities (fast task switching, fast interrupt handling, more logical memory for processes, separation of private data, etc).

jcw commented 5 years ago

It seems to me that common is not even an exchange area between a process and the kernel, but in many cases the only way FUZIX can switch contexts. Like a trampoline to bounce between banks.

As I understand it, with the eZ80 (and prob Z{1,2}80), context switches could be done without common area (the switch into ADL mode can jump to any 24-bit address, while the stack switches to a separate 24-bit one). And since the kernel context can access any address using special "long" instructions, this also means that it can manage process data without having that process actively switched in.

So, to add to your mix of considerations and variants: with the eZ80, it may be possible to support a full 64K per process, no FUZIX kernel data at all. This means accidental scribbling is no longer a risk, for a more robust setup in case of crashes (even more so when combined with eZ80's watchdog).

EtchedPixels commented 5 years ago

Fuzix used the term 'common' the way CP/M 3 and MP/M did - the area of memory that is always mapped. It's not actually quite that simple any more because

Z180 flips the entire 64K and keeps copies of the common in the top of each process map
Bank16K keeps copies of common space in the top 16K block but each process has a 16K page split between common and private space
We have systems with no true common data memory at all where we do some rather tricky stuff when switching.

Historically though the udata was in common because it held the interrupt stack and whilst it no longer does that it does still contain a pile of variables that are easier to manage if always mapped (syscall argments etc). You can avoid it with some gymnastics (see z80-thunked).

Z180 doesn't have a single 'change bank and branch' instruction nor a way to do interrupt vectors into far space so it needs a small amount of 'common' space. Likewise you need a bit for disk I/O so you can do direct I/O into other banks.

Z280 has a real MMU, paging and supervisor mode so is a very different beast. Getting Bill Shen's Z280RC running Fuzix is on the longer end of the TODO list.

eZ80 you probably can do it because you've got "far" (in 16bit think anyway) addressing. 65C816 you can do most of it that way but you need some common code to handle interrupt switching, zero page, and I/O really.

@nickd4 On Z80 having udata via a pointer kills you. 6809 probably not so bad, 68000 does this - A5 is a register global holding the udata pointer. It's actually faster and cleaner because everything is xx(A5) and gcc knows how to do it right.

For 6502 I always wanted to put some of the udata stuff into zero page, and some key system variables, but never had time to play with that. I think you are right though that if you put a udata pointer in zero page it will generate code that's as compact and not much slower.

The real question though is about shared data. Small chunks of shared code are easy - you put a copy in each bank.

nick-lifx commented 5 years ago

@EtchedPixels you are right that Z80 indexing udata would be weak, I guess if the compiler could be persuaded it could use IX or IY for this purpose (like 68000) but it would still be bigger and slower than the direct access. Also about 6502, you are right that it wouldn't be hugely worse, but OTOH we'd lose the opportunity to put frequently accessed variables in ZP (did not think of that aspect).

I have also been thinking about a tricky problem with the granularity of core vs swap, basically you have to round up all swap allocation to a multiple of core block size or risk running out. That's a bit nasty if you have say a 140 kbyte floppy and 8 processes with 8 kbyte core blocks as you could have up to 64 kbytes you can never use due to them being reserved to maintain guarantees. It's really frustrating, that.

Limiting the number of in-core processes to n would almost fix the problem because the reserved space for rounding would be only n * 8 kbytes. But another problem then is if those n processes happen to be quite small the remaining ones could overflow swap.

Considering all of these issues I was thinking of having a "pre swap" phase that does not go to disk but basically compacts a less recently run process from RAM to RAM, so basically copy its ZP, udata and partial last 8 kbyte page to unaligned non contiguous RAM (I have all the mechanics to manage this already so there would be significant code space sharing too).

I am wary of adding complexity as it wastes kernel code space. But in this case I think the saving in memory and time would compensate. The only question is whether it's worthwhile to jump through such hoops to make kernel data invisible to processes and increase process address space a bit. For my part, I do think that it is worthwhile.

nick-lifx commented 5 years ago

So I thought about this for a few days and I came up with a scheme, it is rather complex, so perhaps more of a thought experiment at this stage, but I think it is certainly doable and on the right track.

In what follows, when I mention udata I mean also the kernel stack and possibly the interrupt stack, and in the 6502 context I also include the zero page and the CPU stack (kernel stack is C stack).

Recap about udata stash

Firstly, a recap about stashing of udata in the process's address space: this is a good idea in general, because it simplifies things, you don't have to deal with udata specifically when doing the swapping.

If you stash the udata at 0 in the process's address space (for 6502), then it takes logical memory away from the process, and interrupt handling is slower (we are sharing the zero page between process and kernel, which does not matter for system calls but must be saved for interrupts), and context switching is fast (the udata is already at the correct address so it's an MMU-only operation).

If you stash the udata somewhere invisible to the process (for 6502), then the process has its full logical memory, and interrupt handling is faster (process's zero page and CPU stack can be left completely alone while we switch to kernel mode on a different zero page and CPU stack), but you need an extra page with the udata at the correct address, or you must copy it at context switching.

V7 Unix on the PDP-11 does it the latter way, and does not pay a price because it has very small physical pages (only 64 bytes) so there is no problem having udata aligned to the correct address.

Aligned udata cache

So my idea to get the full logical address space of the process without paying too high a price in time or memory, is to allocate some smallish fixed number of udata's, say 4 or 8, and have this many spare pages available, with a udata at the same address in each page. Then the 4 to 8 most recently run processes will already have their udata somewhere that it can be quickly mapped in by the MMU.

As well as that, each process has a stash area at the end of its in-core image for longer term storage of its udata, and where the udata can be placed before swapping so that it forms part of the process's swap image (for simplicity we have this for all fully in-core processes, including those whose udata is also in the aligned udata cache, and for those processes the contents of the udata stash is ignored).

If the process's logical address space is a multiple of the core page size, then the stash area forces a new full page to be allocated, but usually there is some spare room in the last page for the stash area.

Also, the stash area should divide the page size (e.g. page size of 8192 bytes and stash area of 1024 bytes, which is how it is currently for 6502 IIRC), and the stash area should be aligned to this so that it does not straddle a page, which will simplify swapping. Also, the core image can be larger than the process's logical address space, so that if it's completely full, the stash area can still be placed after it.

Running a process

We maintain a marker in the LRU list of processes, to divide processes whose udata is in the udata cache vs processes whose udata is in the stash. So given an in-core process to run, if its udata is in the udata cache we just run it, otherwise we try to get one of the spare pages to copy unstash its udata into, if none, we take a victim from just before the marker, stash its udata, and move the marker first.

A possible optimization is that if the stash happened to fall at the start of a page (for 6502), i.e. the case where the stash area caused a whole extra page to be allocated, then that page can simply be mapped in without bothering to copy the udata to a cache page first. When reclaiming pages, we have to keep choosing a victim and moving the marker until we get a victim that does use a cache page.

Reclaiming partial pages

A significant problem for a system with large in-core pages, e.g. Virtual65 which provides the example of 8192 byte pages as discussed above, is the waste of space at the end of a page. And even more so, if process used an extra number of pages, causing the udata stash to waste an entire extra page.

What I propose about this is to have a third LRU list for in-core processes, to hold processes that have been demoted from "fully in core with stashed udata" to "mostly in core with partial last page swapped out". So there would be potentially mutiple processes that are in-core and nearly ready to run, but only their last page is in swap. And of course, potentially multiple processes that are fully swapped out.

If we do this, there is a trade-off to make when we need more core. If we swap out the last page from a older (but not the oldest) process, we can get more core with potentially less disk I/O. If we swap out a full page from the oldest process, then there is less likelihood that the process will shortly need to run.

Core vs swap block sizes

An annoying assumption in the current swapper code is that the swap block is the same as the core block, this is done solely to simplify the calculations as to the system's available space and to deny process creation or "brk" requests when this gets too low. The calculations must be done in advance to avoid problems occurring later, because we cannot return an error to the process while swapping.

If you have say a 128 kbyte floppy with 256 x 512 byte blocks, and 128 kbytes of core with 16 x 8 kbyte blocks, the problem is that you'll never be able to pack the floppy with swapped out processes, you will always have a gap in the last page since you have rounded all process sizes up to the nearest 8 kbytes. Even if you do pack the swap images more closely, you just waste a lot at the end of the floppy.

That's where the third LRU list comes in: you can do all your calculations in 512 byte blocks instead of 8 kbyte blocks, and create enough processes or answer enough "brk" requests to completely fill up the swap floppy, and then if you find yourself running out of core because the in-core processes wasted big amounts of their last page, you run the routine that swaps out only the last page of each process.

Use of the disk block cache

Generally we do not want to do swapping through the disk block cache because it causes thrashing, once you've sent a large process through it, all the other blocks that were there before have now been flushed and you've filled the cache up with something you'll only access again once. So we bypass it.

BUT, when swapping the last page it does make sense to use the cache. Because we are only dealing with small amounts of data, it hopefully will not thrash, and because the process is not the oldest, there's some likelihood we'll need it again soon and so it would be good to avoid disk I/O in that case.

Note that we cannot reserve an area of RAM just to hold the compacted last pages. This does not work because it destroys the system's guarantees unless there is a way to unreserve last-page RAM and send it to full-page RAM and vice versa, which implies a defragmentation routine. BUT, if we use the disk cache then we do it without guarantees but just opportunistically, and I think it would work well.

Further, the udata cache can with some gymnastics provide far memory for the disk and inode caches, because we only need to use a certain address in each block, the other addresses are free for other uses. I think it's a high priority to do this and get the disk and inode cache out of kernel data segment.

Summary

A multi-stage demotion scheme from fully runnable processes gradually down to swapped ones, that allows efficient use of core and swap, fast context switching and interrupt handling, less I/O and so on.

EtchedPixels commented 5 years ago

What I'm doing for RC2014-6502 is to load the low 16K page with per process stuff as well as the necessary zero page and stack page. At the moment I have various stuff stuck in 0x000-1FFF because I am playing with ideas but much of that could (and if I can't figure out how to mod the I/O window on the board from C000 to FE00 will shove it up high).

Switching that low 16K page on the actual task switch really helps, and because the banking is flexible doesn't mess up swapping too much. It means the only ZP value exchange I have to do is on interrupts and I don't have to copy stack or other stuff. It's more expensive on memory than copying but the 6502 is not very good at block copies anyway.

The cache is also an odd one - it might help a bit on floppy but floppy swap will be painfully slow, but on hard disk it's usually slower and adds latency to go via the cache so the cache is generally only of real use with metadata. On hard disk it's actually as cheap to write udata to disk when swapping as copying it somewhere else.

nick-lifx commented 5 years ago

@EtchedPixels that's interesting, I would like to know more about what you're doing with the low 16K. Where do you put (i) kernel ZP, kernel CPU stack, kernel C stack (ii) process ZP, process CPU stack, process C stack (iii) interrupt stacks (iv) udata? Is process otherwise allowed to use the low 16K?

As for me, I made some minor changes in 6502 to the link script and discard segment, was not going to post them yet, but I can do so if you think it would help. I implemented the discard segment, but frustratingly it seems we can't really use it because of the initialization order of devices, once we get into the kernel proper it's slightly too late to do things like putting disk block buffers in the discard?

Regarding cache, I think we need to make a lot more use of cache, at least on spinning-rust type systems. Anyone who has used SMARTDRV can confirm that it makes an incredible difference if you have a large enough cache, but the current 6ish disk buffers is useless. I think on systems like the Virtual65 or Z180 where you have up to 1 Mbyte, we need to have more like 64 / 128 kbyte of buffers, and a corresponding increase in the number of in-core inodes as well. For me this is a high priority. I'm considering two main schemes for how the far memory can be accessed here, will write about it later.

EtchedPixels commented 5 years ago

$2000+ is available to the application to use directly, $0-$1FFF is a mix of stuff that is switched with the task in switchin() and stuff that is identical and copied into each low bank. See platform-rc2014-6502. It mostly works except for signal handling.

On spinning rust (unless you happen to be running an original TRS80 hard disk unit or similar) the drives have upwards of 1MB of cache and the interface runs at close to ldir speed. So whilst a cache is useful the drive has one. In fact big drives have 32MB or more cache so you don't really run Fuzix from the disk at all 8)

Different platforms do different things with discard. Some place it in memory that will be used by userspace, others try and place it straight after the buffers section so that they can indeed just expand buffers over it and empty space in platform_discard(). The most general arrangement is probably the the the kernel runs from low memory upwards ending with buffers then discard, then there is a gap and common. The platform discard code reclaims the space between buffers and common.

You can also put buffers in their own memory bank and the overhead for most ports when doing this is pretty small as only metadata goes wandering off into buffers anyway.

Another problem if you bump the numbers up is that we don't hash inodes and buffers nor do we have pointer chains for things like runnable, so it doesn't scale.

nickd4 commented 5 years ago

Yes, fair enough, I agree that there are some weird caveats when it comes to caching in our situation. So perhaps it is not the magic elixir that it was on early 386s... although, having said that, I DO have in the cupboard some ST506 drives which I plan to put onto an AT motherboard and run FUZIX on eventually... not an original TRS-80 but close :) About the hashing and so on, that is trivial to do although I agree it would not bring much benefit right at the moment. I was also reading a bit about segmented LRU and combined LRU/LFU algorithms the other day and it seems they're not too difficult to implement, we probably should eventually do something like that since at present any longer burst wipes the cache. I guess we are going direct to userspace for any full block transfers though? If not, we probably should.

As for me, I got time to do a little more the last 24 hours, I am back to working on the test script / swapping simulator. Given that I have had quite a few additional insights as described in earlier posts, much as I want to move away from the simulator and make an implementation in the kernel ASAP, I decided it's worth doing some reworking and it's coming along quite well. I've been putting in support for an expand-down stack segment among other things. Although I'm not keen on the memory-hole idea that was apparently used in early systems (Xenix?) I would like to have a separate stack segment on the hardware that supports it. Unfortunately that doesn't include 8088 or 286 unless we move to large model.

There are some interesting issues around the stack segment, well I mentioned earlier about the idea of having a fixed-size stack declared in the executable's header and not necessarily in the top of memory, I know this could cause compatibility issues and so on, but nevertheless there are arguments in favour.

Lately my thinking on this has centred on the guarantees, it's a bit of a problem that we have no way of informing the process that it has run out of stack, besides killing it with a signal. That applies even if the user-space cooperates by detecting out-of-stack (perhaps by comparing SP with something in the function prologue) and goes into a brk()-like routine to ask for more stack -- what does it do if this fails.

The best I could suggest is program uses setjmp(), then calls stack intensive routine such as qsort(), ends up in a signal handler on an interrupt stack to tell it it's out of stack, then longjmp() back to the main program to tell the user that the data was too large to be sorted? I suppose it works but it's quite messy.

So to my way of thinking, if there's a chance we could run out of stack during the execution of a process and (in the case we can expand stacks on the fly) not be able to expand for lack of core and/or swap, the right thing to do is deny the fork() or execve(). I realize that this sounds pedantic, but the alternative is basically to overcommit the memory and do what is effectively an OOM-killer, and I don't agree with that.

Because of this, my idea for the expand-down stack segment is essentially to reserve the needed space for the lifetime of the process, but for the system to be aware of the unused portion and avoid copying it or swapping it. This would save significant effort compared with the current way where the entire image of the process is swapped. I was thinking of putting the stack at the start of the process's core and swap images, which can expand and contract in both directions (at the start from stack, at the end from brk), and on some hardware it might just appear there, on other hardware we can trickily remap it to the end.

EtchedPixels commented 5 years ago

Aligned full blocks go direct to user space unless a platform has specifically decided not to for that device. Even on fastish floppies it seems to be a win not to cache data pages only metadata (and maybe directories). Interestingly the authors of the ZSystem found similar results and went the same way except that they added a hack so if you have lots of buffers (and on a big box you could set up a lot of buffers if you wanted) then you could mark some binaries to be cached on load and then subject to the normal expiry rules.

Having a fixed size stack is definitely a win, and on some architectures it's also relatively cheap to spot stack overflows in the function call entry code. I am not sure why you need the stack not to be at the top of the space. What kind of platform does that help ?

Likewise agreed on fork/execve failing on non paged systems.

Modern Unix has the notion of signal stacks for the out of stack and signal case. Nobody ever uses them 8)

nick-lifx commented 5 years ago

About the stack being at the start of the space, it is really more of an internal thing (internal to the kernel), and mostly for the platforms like 8088 or Z180 that use segment or MMU base registers.

Assuming you use a separate data and stack segment, you would not want to put the stack segment right after the data segment, because you would have to move it each time the process calls brk().

So the solution is either treat the data and stack segments as separate from the viewpoint of memory managemant, thus a brk() call might move the stack segment or might move some other segment, or another way is to put the stack segment before the data segment, that way brk() does not affect it, and indeed if the stack can also grow with a brk()-like call, it can grow in the other direction. You would still have to move other processes' segments if they get in the way of extending the stack or data.

If data and stack are treated as separate segments, it does have some small advantage because the defragmentation might be able to do smaller block-moves, but I don't think it would make much real difference, since basically to satisfy a brk() request you have to compact other segments, which means copying everything along to fill the gaps, so having them in smaller pieces would not help that much.

So what I have decided is to concatenate the two, with stack first and the low stuff (zero page, text, data, bss, heap) after. That way, swapping does not have to care where the stack and the low stuff begins. It just keeps a pointer into the conjoined segment to know what is swapped and what isn't.

That does not mean that the stack has to appear this way to the program. In the Z180 case, for instance, we can use the BBR to map in the low stuff and the CBR to map in the stack. On the 8088 it's a bit more tricky since we would have to go to the large memory model to accommodate this.

For the 8088, if we want to get the benefit of less copying (not swapping out the unused stack space) then we could put the stack low, and only swap the used portion (going from the SP register up to the brk level). If we decided it's more compatible to put a fixed size stack above bss and below heap, then we can do that, but unused stack would get swapped unless we added complicated extra swap code.

On z80pack the problem is slightly different and there is some argument in favour of putting the stack up high and then copying it in two goes (perhaps stack followed by low stuff) to the swap, but that's more complicated. Aside from the compatibility issue of programs assuming stack is above heap, I don't see a big advantage. It would be much easier simply to put the stack down low at all times.

Of course there is also the memory hole idea. Give each program a 64 kbyte segment with low stuff, gap, and then stack. And try to interleave the processes so that low stuff of the next goes in the hole. But it would ruin absolutely everything, firstly it would make a dog's breakfast of the code, and secondly once you start introducing weird alignment requirements it's not possible to guarantee the aligned size.

nickd4 commented 5 years ago

I am working in the swapping simulator still, it is looking awesome.

Yesterday I revamped the swap-in and swap-out loops to greatly simplify the calculations and rationalize the state saved per process to manage this, and also to combine common code into a swap_read_write() routine which saves significant code space by using a read/write flag.

Then today I did another major rewrite which makes it keep track of the internal fragmentation in a better way. It has meant changing the test script, the test script runner, and the allocate/reallocate code.

Say for example that you started with 4 blocks and then you added 0.25 of a block to each end. This wouldn't change the alignment relative to the backing store, so you'd be left with a region spanning -0.25 up to 4.25 relative to the original allocation.

Well previously I stored this as the tuple (0.75, 4.5) meaning that in the now 6 block allocation, there is 0.75 of a block wasted, then 4.5 blocks used, then another 0.75 blocks wasted. What I now do is store it as the tuple (-0.25, 4.25) and this implicitly encodes an origin.

This makes it much simpler for the client to resize either end (by specifying that it wants to extend the -0.25 end down to -0.75 say, without having to consider the 4.25 end at all), and also greatly simplifies the internal calculations for swapping and for allocating the backing store, since I can round the number ~~up and down~~ down and up respectively to (-1, 5) to get a description of the block level backing storage.

Interestingly, the counter for how much is currently swapped has to be expressed in terms of the block level storage, because of a problem with null blocks that had me scratching my head for quite a while yesterday. If for example the allocation is (0.5, 0.5) then instead of allocating no blocks like you would think, the system considers this to be half a block wasted, no data, and then half a block wasted. So the block level description is (0, 1) by rounding 0.5 first down and then up. And when swapping it needs to know whether this dummy block has been brought into the core. So the in-core blocks ranges from 0 to 1. If the in-core blocks pointer was restricted to the range of valid data, it would lose information.

Anyhow, all of this work has been aimed towards making it simple to take the plunge and give the core and swap different block sizes, I have rationalized everything to the best extent possible so that the rewrite needed to accomplish this will not be too involved. The holy grail is within grasp!

Recall from my posts last week, that it needs an extra LRU list and the pre-swap phase to maintain guarantees in that situation, because wasted space due to larger core pages isn't budgeted for and can exhaust both core and swap. To recover, we will pre-swap some number of processes that have internal fragmentation. Last week I rewrote the LRU code so that this will also not need much of a rewrite.

The code size needed to accomplish all this is now really tiny, because of all the clever rationalizations.

nickd4 commented 5 years ago

Apologies that I was absent for quite a while. This is because I meandered onto some other fascinating "spare time" projects. In order to stay sane I need a dose of novelty from time to time. However please do not close the issue as I remain committed to completing and merging the swapper. From what I recall there was a small amount still to do in the swapping simulator, to complete implementing the feature that swap blocks could be smaller than core blocks, while not having to round everything up and waste egregious amounts of space to allow for the worst case. I will have to reorient my brain to this problem and try to remember the solution :)

In the meantime I implemented a kind of P-machine for my personal unix projects, but optimized to execute C not Pascal, and sacrificing binary compatibility across platforms for source code compatibility, with the assembler producing a kind of a hybrid or customized P-code for a specific target (Z80 or 6502 currently) that can run very fast by reducing the instruction decoding time in the interpreter significantly. I think it would not be all that much slower than native Z80 or 6502 code given that indexing the stack and doing double byte arithmetic cost many cycles.

My P-code assembler is based on asxxxx and I must say I really like that platform. To reorient myself with asxxxx internals I also undertook another project to port asxxxx to PDP-11 and a.out object format not its own object format, it needs a little work on encoding the PC-relative instructions to a.out but otherwise is close to useability.

nickd4 commented 5 years ago

Another large project that I undertook has been to create an integer math package for the P-code interpreter, also I preserved an API separation so it can be used standalone and I think it would be a highly competitive replacement for the math package in ACK or SDCC Z80 (have not done the 6502 or 8080 versions yet but have thoughts in those directions). It provides 16x16->32 bit and 32x32->64 bit multiplication and the inverse operations 32/16 bit division with 16 bit remainder and 64/32 bit division with 32 bit remainder. It does these in a highly optimized and sophisticated way. It also has fast shifts and compares. Signed and unsigned versions are provided. No negation is done, so the worst case time taken is the same for positive and negative arguments. Division is non-restoring.

More recently I moved on to doing a transcendental package, this was originally for something work related since I do not envisage doing a lot of scientific computing on Z80, but it has evolved in a way that will slot fairly neatly into the P-code math package when I get around to doing floating point support (I have some innovative ideas on that point, to speed up Z80 soft float).

So lots of things are happening that I hope will eventually merge into a very tight package to run FUZIX on Z80 from floppy, but in a fairly chaotic way with progress on many fronts, I suppose in a breadth first search if you take my meaning :) :)

feilipu commented 5 years ago

More recently I moved on to doing a transcendental package, this was originally for something work related since I do not envisage doing a lot of scientific computing on Z80, but it has evolved in a way that will slot fairly neatly into the P-code math package when I get around to doing floating point support (I have some innovative ideas on that point, to speed up Z80 soft float).

If you're interested, I've recently finished a 32-bit IEEE 754 compatible floating library for z80, z180, and z80n (Spectrum Next) in the z88dk. The library is derived from the Digi Rabbit example with Cephes and Hi-tech C transcendental functions, but after three rewrites there's not that much of the Digi code remaining.

Further input of innovative ideas would be welcome.

nickd4 commented 5 years ago

Whoops double posting and seems I closed the issue by mistake, well it does not matter, is there a better place to continue the discussion such as a forum or mailing list? Anyhow I will aim to PR the swapper at some point in the next month or two.

@feilipu very interested to see what techniques you used, do you have a link? Having said that I must say that IEEE is not totally the best for soft float due to all the infinity and NaN tests that have to be put everywhere and weird cases to handle. The sticky bit and round to even I am sort of persuaded has value, but not too sure if the overhead can be justified on all platforms.

feilipu commented 5 years ago

@nickd4 sure yes. Link in blue here. A 32-bit IEEE 754 compatible floating library for z80, z180, and z80n (Spectrum Next) in the z88dk. Most of the techniques are explained there. Generally, it is more like "SDCC 32-bit float", being mostly compatible with IEEE 754, than fully letter of law IEEE 754 compliant, because of the reasons you mention.

But, you might want to comment on the Issue on the library in z88dk, rather than polluting your thread here.

EtchedPixels commented 5 years ago

@nickd4 btw your ndcode domain appears a bit busted - expired certificate with wrong name for www and bad gateway for anything else ?

nick-lifx commented 5 years ago

OK I noticed that a while ago, thanks for the reminder. I cannot work out why it is so troublesome, never seems to work for more than a few months without attention. The gitweb is the most troublesome, seems to overwrite itself with stock configuration from time to time? The certificate update is also troublesome, even though it's configured and all that. Will fix it when I finish work.

jcw commented 5 years ago

FWIW, I can highly recommend Gitea as code server. I use it behind a Caddy reverse proxy, but it can also handle (and auto-refresh) LetsEncrypt certificates on its own. Updates are a matter of replacing a single executable and restarting the service. See my https://git.jeelabs.org/ setup as example, but it's easy to set up your own on any platform, and play with it as admin.

I'm not trying to convince you in any way, just sharing a happy experience.

nickd4 commented 5 years ago

@jcw I will certainly look into gitea. For me, the use of gitweb was kind of temporary until I could look at doing some development in that direction. Because I was not happy with gitlab and some other heavyweight solutions that I looked at. Your gitea server has a very clean look, I'd be keen to know more. I fiddled the css slightly on gitweb to make it look more github / gitea-like but only took it so far.

About my server the issue was not too drastic. I have got git and running again, with some notes:

I am not trying to get https://ndcode.org (which is attached to https://www.ndcode.org) working at the moment, it will be my personal blog but uses an experimental webserver, more on this below.
The https://git.ndcode.org was returning "bad gateway" because the nginx server only proxies a FastCGI backend that runs gitweb and the FastCGI server was not running, more on this below.

About the experimental webserver, it is written in node.js and I have uploaded most modules to npm, unfortunately the actual main module (which uses all the npm packages already uploaded) is not quite ready for release -- I decided to do some minor refactoring first, and I think it was going well, but I haven't looked at that in 9 months or so. See https://www.npmjs.com/~nick_d2 for the modules already uploaded, and in particular the https://www.npmjs.com/package/@ndcode/jst package.

About the FastCGI not running, inspecting the logs revealed that the OOM killer had been doing its awful work -- how I hate that nondeterministic piece of crap! I know that memory overcommit can be turned off in /proc and I am thinking I will do that, but hesitating slightly because I know that many popular components including probably some that I am running as part of iRedMail, rely on sparse allocation and will allocate huge blocks without the intention of using all of it. So it's a yuck situation.

For the time being I increased my server memory from 2GB to 4GB with corresponding increase in monthly fee. To compensate for this I shut down one of my servers I use for web projects which was running another highly experimental web service that was very much a WIP, to do with sandboxing and executing arbitrary user code (I wasn't very happy with how my approach panned out and will have to solve it differently when I return to it). So I now have just 2 webservers to maintain. Haha!

jcw commented 5 years ago

We're veering waaay off-topic here, but node.js (or rather: v8) can be a real memory hog. That's why I switched to Go a long time ... (wait for it) ... aGo ;)

Current memory use on my server: Gitea 161M, MySQL 487M, Caddy 30M. MySQL also handles Nextcloud (PHP) and Redmine (Rails), so that will not reflect its needs when used just for Gitea.