EtchedPixels / FUZIX

FuzixOS: Because Small Is Beautiful
Other
2.15k stars 270 forks source link

Possible Execute-In-Place scheme #691

Closed nickd4 closed 5 years ago

nickd4 commented 5 years ago

hi,

I did substantial work on an UZI port for a cash register some years (too many years) back. This machine had a Z180 CPU and an interesting feature of the machine was it had 256 kbyte ordinary RAM and 768 kbyte RAM disk, which acted essentially as Flash does to day (memory mapped and fast to read but slow to write).

So the scheme implemented was pretty sophisticated, I made sure that no function had more than 4 kbytes, and I implemented a custom linker that would pack all the code into 4 kbyte windows. All code executed in the F000-FFFF region of the address space, with the currently executing function's window being paged in as part of the calling sequence and the region 1000-EFFF containing the process's data and 0000-0FFF interrupt handlers and system stack.

This behaved pretty much like a PDP-11 with split I&D except code wasn't limited to 64 kbyte, kernel had about 80 kbytes code and some larger utilities such as the shell and the editor had large code too. Code pointers were 3 bytes and data pointers 2 bytes. Smaller utilities did not have to use the scheme and could simply use the tiny memory model with all 2 byte pointers.

The next interesting thing about the scheme was the execute-in-place ability, which I nicknamed "XIP". After downloading new code, e.g. a new kernel, you could run the command "align uzi.bin" as root, and it would arrange the 1 kbyte blocks of the file such that each 4 kbyte window was a contiguous region of up to 4 x 1 kbyte blocks in the filesystem, aligned to a 4 kbyte boundary in physical RAM and thus addressible by CBR register to bring it in at logical F000-FFFF. Hence program code could be executed directly from RAM disk and the 256 kbyte ordinary RAM could be used for kernel data plus 3 or more large processes with up to 52 kbyte data each.

As part of this I modified the filesystem and utilities such as fsck so that instead of the block free list it uses a bitmap. This should be a substantial performance improvement at the cost of a few blocks lost to the bitmap when the filesystem is full.

I made some progress to digging out this old code and putting it into a git repo, there is one commit per development release that I made at the time so it's not a file by file history but there are release notes at least.

If anyone wants to pick this up I am happy to MIT license it or similar and make it available. I also have loads of the hardware it ran on, from the decommissioned network of cash registers, which I am happy to give away for the cost of the postage.

cheers, Nick nick "at" ndcode "dot" org

EtchedPixels commented 5 years ago

That would be really interesting to see and possibly very useful as a concept because we have a bunch of Z80 (and 6502) systems with a single 16K pageable window and lots of memory.

The filesystem one is an odd one. On spinning rust I think you are right - a bit map would be a huge win, and later System5 actually load the lists into a bitmap at mount time. On compact flash and SD which is what everyone uses now even with ancient machines it's less clear.

The hardware would be interesting too and I suspect you can probably shift quite a few boards on retrobrewcomputers or vcfed.

nickd4 commented 5 years ago

I am glad to hear that there is positive interest in the idea.

Yes, I had been thinking the execute-in-place might have limited interest whereas the bitmap idea that sprung out of the execute-in-place might be a useful and more self-contained change to merge back into mainline.

One thing that occurred to me since writing the original post is that this used a commercial compiler, the IAR Z180 embedded development system. The nice thing about that is it supports the 3 byte function pointer, which is quite fundamental (I didn't use any of their bank switching routines as I provided compatible replacements).

It might be possible to get this effect by hacking on SDCC, especially if it has a 32-bit int type which I think it does. Or even provide a preprocessor as a source to source translation that replaces function pointers with 32-bit ints. The calling sequence could be munged so that "myfunc(arg1, arg2, ...)" becomes "docall(myfunc, arg1, arg2, ...)". I have a separate project which is intended to do source to source translation like this, I had some limited success with it but it does need a fair bit more work to complete it.

In the meantime a more practical solution might be to copy what 2.11BSD does and have the linker provide stubs for functions in far memory so that 2 byte function pointers can be used. I have another project along these lines which is a cross compiler for 2.11BSD so that the 2.11BSD toolchain can run on a modern system. It works quite well though it is a bit complicated to set up the build directories and staging for a full build. In this context what we'd basically be doing is modifying the Z80 assembler used by SDCC to produce a 2.11BSD object file (routines can be lifted from the PDP-11 assembler which I have translated to C) and then using the 2.11BSD linker on it more or less directly, only the stub generation would have to be changed to make Z80 stubs.

The GCC ports of the 2.11BSD toolchain is done in a bit of a manual way, quite a lot of conditional compilation to get around library and word size differences. Later I came up with an automatic scheme which I call the "Xifier", it is a source to source transformation that puts "x" in front of all symbols and changes ints to shorts, etc. Using the "Xifier" I made GCC ports of the 4.3BSD C library and toolchain, someday I might redo the 2.11BSD toolchain in this way, but for the moment it's useable.

I will definitely put all of this stuff in public repositories, I'm not sure I can commit further time at the moment but possibly in a few months I could.

cheers, Nick

EtchedPixels commented 5 years ago

I already have the SDCC compiler and linker taught to generate banked code. It's not smart enough to work out how to lay stuff out itself (and there are advantages in hand layout because same bank calls are faster.

Basically the sdcc in my git tree can be told to generate

push af call foo pop af

and to expect an extra word offset on arguments.

The linker rewrites those between banks to call _bank%dto%d .dw foo

Anything with a function address generates a short stub in common space so that the C rule about all functions having a unique address remain valid and the pointer fits 16bits.

So I have some of the pieces today it seems.

Alan

nickd4 commented 5 years ago

Cool, that is extremely promising. Well I decided that instead of further theorizing I should check out the tree and take a look. So I gather that you use a standard release SDCC with standard release sdasz80 and the customized sdld from /Kernel/tools/bankld? And that asking SDCC to generate the push af/pop af sequence is basically a workaround to get 2 spare bytes without having to customize SDCC?

One thing that is extremely promising, is that the assembler and linker that I used with the IAR embedded C compiler are actually more or less compatible with yours, since all are derived from Alan R. Baldwin's code. (I did not use IAR's assembler and linker, but rather I temporarily modified Alan R. Baldwin's assembler to accept IAR Z180 syntax, eventually I planned to replace the compiler as well).

So I think it would not be a big project to drop in my window packing code, the main incompatibility is that my way does not generate stubs whereas your way does. By combining mine and your code I think we could do both things. I will look at this later. In the meantime could you explain more about the manual setup, i.e. how does the linker get instructed to put particular functions in particular banks at present? I vaguely recall that in 2.11BSD there's a linker switch that says "new bank" and is placed between .o files.

For the time being, I've made my code publicly available, I haven't done anything about licensing it yet.

See: https://git.ndcode.org/public/uzi.git (my Z180 port of uzi, it also has a TCP/IP stack that needs work) https://git.ndcode.org/public/211bsd.git (my cross compiler setup for 2.11BSD and partial ANSI C port) https://git.ndcode.org/public/43bsd.git (my cross compiler setup for 4.3BSD and partial ANSI C port) https://git.ndcode.org/public/ccom.git (a project to modernize Dennis Ritchie's PDP-11 C compiler)

If you would like to have a look at the linker window packing code that I propose to merge into sdld, see: https://git.ndcode.org/public/gitweb.cgi?p=uzi.git;a=tree;f=src/mkutil/link-z80 It looks like the file "lkarea.c" is the important one, changes are marked by "#if 1 / Nick /" and "#endif".

cheers, Nick

EtchedPixels commented 5 years ago

No it's a slightly customised SDCC (in my github tree). The actual changes for the banking are tiny - most of what it contains is the initial Z280 work.

The linker banking is a hack where it has rules for section names and banks. As you can tell SDCC to use different section names it means you can have _CODE1 _CODE2 etc to match banks. Not pretty but it got me going 8)

nickd4 commented 5 years ago

OK thanks I have looked at that. I have done a fairly thorough survey of the different banked builds in order to try and determine the current constraints and the way forward, and I will explain my thoughts.

My work is useful for a fairly specific use case, that of (1) small code windows and (2) having a second pageable region for data. The Z180 fits these criteria, I use CBR to bring in the code and BBR to bring in the data. If assumption (1) is violated, there is not really a huge benefit to having the linker do the automatic packing into windows, since it is quite unlikely that clever packing will make the difference between say 2 x 16k kbyte code windows or 3 x 16 kbyte code windows. If assumption (2) is violated, then a much more pressing concern is to get the kernel data out of the way of the userspace data, and in my way of thinking, each kernel bank should contain associated code AND its data.

Another issue concerns kernel vs. userspace banked executables with respect to the the total memory size. I think there is always a case for the kernel to be banked, since it helps keep the kernel out of the way of userspace. I think there is only a case for banked userspace executables on a system with large RAM. For instance, some ports can only support 2 or 3 userspace executables at once and these are limited to 16 kbyte. So a monster userspace executable with 16 kbytes data and 2 x 16 kbytes code windows would tend to take over the entire system, and still would be quite limited by the 16 kbytes data.

If only the kernel is going to be banked, there is not much penalty in applying specific programming techniques such as telling SDCC the area to use manually and/or using complex build scripts. If the applications are going to be banked as well, then there is a much more compelling case for modifying the build system to allow programs to be written in a more compatible and natural way.

Tell me, with the current banked kernel, do the different banks contain both code and data? What I'm envisaging is that the bank which contains all process management stuff should contain also the process table, the bank which contains all file management stuff should contain also the open file table, and so on. Trying to access the process table or open file table from an unrelated function in another bank would cause a crash in such case. (Also, things wouldn't always have unique addresses for sleeping on, although I'm sure we could work around that). If the kernel data segment could be got down to only a few essential globals then perhaps it could go into common, leaving say 4000-BFFF for userspace data and C000-FFFF for banked userspace code? (With userspace using a more conventional memory model).

What do you think? Even on the Z180 case, having a kernel data segment from 1000-EFFF and kernel banked code from F000-FFFF does tend to run out of data room when TCP/IP and much file handling is involved (I know this from experience). Perhaps, moving the boundary down to say C000 or 8000 and putting specific kernel data tables in with the associated code, would be a more sustainable approach.

cheers, Nick

EtchedPixels commented 5 years ago

In the cases where the kernel code is banked the data is usually not. The only real exception is the disk buffers. With no TCP/IP and the disk buffers banked you need about 9K for kernel data and stacks. That could probably be trimmed to 8K with some careful tweaking.

TCP/IP adds a chunk but is not too horrible because the protocol end runs in user space using uIP and the kernel and user space use the buffer cache as a disk backed store for all the queues. The TCP/IP layer is almost entirely separated, and splitting it completely is on the TODO list so it can live in its own banks.

As with 2.11BSD the big and difficult things to squash down are the inode cache and process structures. Unlike 2.11BSD it's at least theoretically possible to write active inodes to and from disk when the inode cache is full.

nickd4 commented 5 years ago

Thanks for the feedback. I will do some experiments along the lines of the above. I have since looked closely at SDCC and I must say I am hugely impressed with the progress since I evaluated it >15years ago and found it to be not ready at that time.

As to a platform for experiments I started to put together a Z180 emulator using the CPU code from MAME and the device code from your 8085 emulator or from z80pack or both. I will get it so that it can run the basically unmodified 8085 or z80pack distributions as a first step. The reason I want to do it like this is for ease of experimentation with 16K, 32K or 60K window which are needed by the various ports and the use of 2 or 3 windows (I think only the Z180 supports 3 windows though).

Perhaps we could eventually define some executables and magic numbers for the different window sizes and let the Z180 version run whatever kind of executable you throw at it, for testing userspace distributions for the various different ports.

I saw what you did with relocatable userspace executables. I am not as anxious to do it this way and would be happy to have a separate userspace per port which is compiled with appropriate base addresses etc. However, this would need much more infrastructure and to my mind relocatable is handy until we do have that infrastructure.

nickd4 commented 5 years ago

I put something here: https://git.ndcode.org/public/fuzix_sim.git It should be able to run z80pack images, but with a Z180 not a Z80. At the moment it boots the fuzix image distributed with z80pack (not sure how up to date this is), but doesn't seem to get past init at the moment.

EtchedPixels commented 5 years ago

Catching up a bit

The disk images from Z80pack are pretty ancient. The images on fuzix.org are somewhat newer.

In theory you should be able to run with just the Z180 inbuilt I/O. The serial ports provide console, the MMU provides the memory management, the CSIO drivers can support an SD card for the file system.

nickd4 commented 5 years ago

Done a bit of midnight hacking. The simulator now works well. Actually it was already working, I just hadn't booted from the correct drive. Yesterday I connected up the timer interrupts, and today I have enhanced the simulator to properly support the Z180's expanded I/O and address spaces, while retaining the ability to run z80pack binary images with some clever backward compatibility measures.

Yes, eventually I am planning to hook up the Z180 inbuilt I/O and mass storage in a realistic way, but this isn't really necessary since I am happy to use the fake z80pack devices that do not resemble any real hardware. This will save time, since the device drivers are very simple and also already written. At the moment I just want a platform for memory management experiments and testing userspace binaries.

Since the CPU comes from MAME and I have preserved the object-oriented interface, it is easy to add different MAME CPUs. I plan to make command line switches for other CPUs we support, such as 6809, without changing the z80pack devices any more than necessary (just make them memory mapped on devices with no I/O space, allow a low common area on devices where page 0 is special, etc).

I think I'm ready to compile FUZIX now. I will first just try to duplicate what I have, then I will see about optimizing the bank layout of the existing Z80 or Z180 kernels, and/or possibly modifying SDCC to improve bank handling, such as by extending the generic pointer stuff from 8051 to other architectures. I also have in mind various possible ideas for how we could implement banked userspace executables.

nickd4 commented 5 years ago

I'm definitely making progress now. I've got a dev system up and running and I have been making exploratory changes. I'm focusing on understanding the platform specific code as much as possible and how it relates to the generic C code such as process creation/fork/exec, signal delivery and so on.

Since this morning, I feel that I basically have the picture in regards to the different platforms' build processes, the bank.c files, the lib/z80.s, the callbacks from this into platform specific code, etc.

My investigation has involved test-removal of various platform files to see what breaks and then looking at the callsites that are trying to enter the platform specific code. The PORTING file is also helpful.

I want to start making changes to the bankfixed.c and swap.c code to implement more of a first-fit allocator for z80pack rather than whole banks (similar to what is in unbanked.c), one question that arises in the course of this, is what do you think about requiring processes to declare their stack size upfront?

Obviously, in a system with a paged MMU like Linux, this is an unnecessary restriction. And also in the fixed bank case it is attractive to just let brk increase until it meets the stack pointer as we do now (on the other hand, I am not sure that there is any theoretical justification for the current margin of 512 bytes).

On plenty of other systems (68000 perhaps?) it's probably not really practical to just let processes have a huge stack region and use as much or as little as they want. So would it be okay if we moved towards a standardized system where all executables declare in their header the maximum stack they use?

For backward compatibility we could take the declaration as something big like say 0x1000 if it's not present. And then in the z80pack case, I want to use this to compute the brk limit instead of SP - 512.

nickd4 commented 5 years ago

OK so in the process of trying to improve the bank-switching stuff I sidetracked slightly onto improving the swap, and thinking about it carefully I realized there is sort of a logic error in the current swap stuff.

If the PTABSIZE is at least the number of banks plus the number of swaps, it is possible to completely fill all banks and swaps, which makes swapping impossible as you can neither swap in nor out.

I verified this by writing a fork-bomb type program that forks as many times as it can, and then tries to exchange signals with the children. Eventually I got it to happen and it panics with a message "nopage!".

Admittedly, this does not happen in normal z80pack which has a huge number of swaps compared with PTABSIZE, but that won't be the case on smaller systems that are swapping to floppy disk or similar.

A simple fix would be to limit PTABSIZE to one less than the number of banks plus swaps, we could do this statically since I believe at the moment we will always know the size of the swap at compile time.

On the other hand if I make the swap usage more efficient, so that only the used portion of a sleeping process's address space occupies swap, then that will no longer suffice. If we know that processes can't be bigger than say 48 kbytes, then we can reserve the last 48 kbytes of swap, and not let it be used except to avoid the deadlock. But this might get unmanageable with larger processes that have several segments, once my work on the banking is complete. That's how I came to considering the issue.

With this in mind, I'd like to know what we see as the eventual goal for swap so I can work towards it.

Do we agree that fixed size processes in swap is wasteful? If so, then do we want to have a special format for swap with a malloc()-like allocator and possible fragmentation management? Or would we prefer it to behave more like a filesystem, with the ability to map blocks to a process in any order?

Another goal to keep in mind is simplicity, since if we have a large source file just for managing swap, it will compile to a relatively big object file and take up precious kernel room, I don't want that either.

EtchedPixels commented 5 years ago

At the moment you need enough swap that you can't run out. That's breaks in several places and it's got a bug number for some of it #686 for nready. nopage! means we tried to swap in something that had no memory allocated to swap it in, which means we tried to run something and couldn't swap anything out so exploded. It's not entirely easy to fix because swapin occurs during switchin so a doexit() would recurse into switchin which would get a bit strange and we need to swap something in to exit it. I guess therefore preallocating enough swap as we go would be necessary to fix it properly.

Making swap size things sensibly on Z80 and 8080 is one of the things I just haven't gotten around too. For 8bit it's mostly a detail, but for 68000 it matters so definitely needs fixing.

Do you want the stack size up front or the total stack and space it might allocate up front ? When then binary format gets fixed (0.4 I hope) then stack definitely needs to go in there because for 4x16K and similar layouts you can do some interesting tricks that Tormod pointed out where you initially point any spare or high banks at the top bank, and then as you allocate memory the map changes.

Initially for example your map might be 1 1 1 1 if your max stack plus code/data/bss fitted into 16K. When brk() grows it then you might end up 1 1 1 2 and in time 1 1 3 2 and so on, but your stack would be copied in part so it would always appear in the right place and your bss would grow like Unix expects.

We probably need both and where '0' means "whatever you can give me".

For swap allocators it is in theory a win if they are linear chunks, at least with spinning rust. With modern disks or CF it doesn't matter so much. Z280 will also be a bit different here as it has a real virtual memory system. The rules get a bit odd with CF and a Z80 because memcpy() and disk read/write are the same speed.

nickd4 commented 5 years ago

OK, I didn't pay attention to the bug #686 until I read it more carefully and realized it is relevant to me.

I want the stack size upfront, and I propose to make it static, i.e. it never changes once the executable has been loaded. I realize we could do more with this. For instance, code in the function prologue could check the remaining stack space and make a brk-like call to increase the stack segment. But, this seems unacceptable in terms of run-time overhead, and the program couldn't adjust/recover if it fails either.

In regards to the Tormod stack suggestion, it is certainly a cool idea. I had to read it a few times to get the idea. And I can propose an extension, which is that the bank used for stack does not necessarily have to be one of the program's own banks. So essentially if lots of programs are running, we have a pool of banks with varying amounts of space at the end of them, that can be used for stacks generally.

On the other hand, I'm trying to strictly simplify things to come up with essentially a HAL that does not change too much for the different types of hardware. I realize that significant flexibility is needed given the large number of different banked memory schemes in use, nevertheless it is a worthwhile goal to try to analyze the common points and provide the simplest possible abstraction (but no simpler...).

And in this goal, I see a worthwhile simplification in requiring all application programs to declare a stack size upfront, that never changes. I also don't want to have an exceptional value meaning "give me all". So if we did it this way, stack can be placed just after bss, and then the brk-allocated memory goes after that (I don't propose any upfront declaration of brk-memory). This wouldn't need the Tormod scheme.

With my proposal, the process address space is by default completely linear, which simplifies allocation and swapping, although there is no reason we can't choose to break it into segments for added flexibility.

I see three main categories of memory, with analogous cases for swap:

(1) Huge banks that are either there or not there. For memory, this corresponds to the z80pack port with at most one application running in each bank and wasting the rest of the bank. For swap, this corresponds to the z80pack port with a fixed 56kbyte or similar slot in swap for each possible process.

(2) Memory or swap that requires contiguous allocation. For memory, this corresponds to hardware that has a base register, for instance the 8088 or Z180 ports. So you can allocate varying size segments up to 64 kbytes, and transparently place them anywhere in memory, but they must be contiguous. For swap, this corresponds to spinning rust where the seek delay is significant, so we want contiguous storage.

(3) Memory or swap that allows block-by-block allocation. For memory, this corresponds to hardware that has a page table, for instance the PDP-11 or https://github.com/EtchedPixels/Virtual65 or the 4x16k layout that would allow the Tormod tricks. So you can map each 8 kbyte or similar region of the logical address space to any physical memory independently. For swap, this is similar to a filesystem in the sense that there is a dynamic indirection table between the logical addresses and the physical block numbers. It allows all swap to be used without defragmenting, but has overhead and seek delays.

So here is my proposal for each of these:

(1) No change. Looking at bankfixed.c and the "static unsigned char pfree[MAX_MAPS];" scheme, or analogously swap.c and the "static uint8_t swapmap[MAX_SWAPS];", the beautiful thing about this is that it's so simple and requires hardly any code to implement, a big consideration for some ports. Having said that, an improvement might be possible if you were willing to give up swapping. Then with each executable having a relocation table, we can load any executable anywhere in any bank. But it must stay there for its entire lifetime. So it does not really make sense to swap it out and then swap it back in at the same logical address as before. It WOULD be possible (treating the z80pack memory as sort of like a 7-way set-associative cache so that a swapped process can go at the same location in any of the 7 banks), but I really don't favour this scheme. I think it DOES make sense if we are okay to have no swap.

(2) We need a contiguous allocator that supports alloc/realloc/free, but it also needs to be able to transparently defragment the memory by physical copying (and in the memory case as opposed to swap, the base registers of processes' logical address spaces must adjusted transparently as memory is moved). Without the defragmenting, we'll only be able to use 30-50% of memory, and we are also vulnerable to deadlocks in swapping when we cannot swap in to free enough swap for a swap out.

(3) Each process can have a page table, or in the case of swap, a "swap table". There can be a free list of blocks or similar. However, as we implement processes that can be much larger than currently (but don't have to be), this will either involve wasteful amounts of static page table arrays in the process structure, or will require dynamic allocation of page tables in the kernel, which makes me a little bit grumpy. So what I'm proposing is that we have ONE page table, or in the case of swap, ONE "swap table". For instance if process 0 requires 3 x 16k banks which are 7, 4 and 3 whereas process 1 requires 2 x 16k banks which are 5 and 2, then the page table could contain [-1, -1, 7, 4, 3, -1, -1, 5, 2, -1, -1, -1, -1, -1, -1, -1] which defines logical pages 0..15, and process 0's process table entry tells that it uses 3 logical pages starting at logical page 2, whereas process 1 tells that it uses 2 logical pages starting at 7.

Further comments on the page table or "swap table" for devices that allow block-by-block mapping:

(a) For memory, the page table can be statically allocated in the kernel's data segment depending on how much memory we expect to have. (For the Z180, maybe a 256-entry table corresponding to 4 kbyte logical pages in a 1 Mbyte address space, indeed I implemented exactly this in earlier Z180 work. though this was based on a model where the base register selects a single page not a contiguous region). For swap the "swap table" would similarly be statically allocated in the first "n" sectors of the swap partition.

(b) Some interesting tricks are possible with this table. For instance, if we have the moveable allocator, then the table can be the same size as physical memory, and defragmentation is a very cheap operation since the page table entries are only a few bytes, so quick to move. (In the swap case, defragmentation might require reading and writing a few blocks of the first "n" sectors, but is still relatively cheap). Or, we could simplify by using the simpler first-fit non-moveable allocator (similar to malloc), and just make the page table or "swap table" somewhat larger, a factor of 4 wouldn't be too costly as entries are small.

So, having settled on what is possibly quite a grand vision, but which makes sense to me, I went ahead and implemented a defragmenting allocator. It would link into the current kernel more or less directly.

The kernel must provide routines that can move a process's image up and down in physical memory, and/or routines that can move a process's swap image up and down in swap space. The allocator does not care whether this is the actual memory, or the entries of the indirection table mentioned above, or whether they reside in memory or disk. The kernel must also provide an 8-byte structure in the process table for each of memory and swap, which are privately used by the allocator to track the allocations. The kernel would also refer to these in order to set up the process's base register or access swap data.

I put the initial commit of the allocator here: https://git.ndcode.org/public/moveable_pool.git

To use it, build it by running "make", and then run "./n.sh". This will generate a test script containing a list of alloc, realloc and free operations to do. It runs them firstly in non-moveable mode, so that some of the alloc and realloc operations fail. Then it runs them in moveable mode, all succeed and you can see the diagnostic messages from the callback that moves stuff around. Dummy memory contents are provided when blocks are allocated, and verified before freeing or making the blocks smaller. It works really well.

There's a bit of duplication in the allocator code, that can be got down by subroutinizing etc later on.

The next thing I will do is build a model that includes two pools (memory and swap), and generate a similar test script containing process wakeups and sleeps, process creation and brk requests, etc. This will verify that we can move processes smoothly between memory and swap, and never deadlock.

EtchedPixels commented 5 years ago

For Tormod's trick the stack does want to be part of the same banks as the code/data because in many cases that saves you space (18K of program and 6K of stack is 2x16K not 3x for example). Unix and traditional unix always had data/bss/brk/hole/stack, and some apps break if you change it.

V7 and the like on PC were of course even crazier because of the segments. They set SS = DS and placed the data and bss at the bottom and stack at the top. To avoid wasting 64K chunks they interleaved programs into the gaps in each other !

First bit

  1. Yes although on some platforms the I/O is slow enough it would be worth writing out / reading back only the used chunks. The maps don't change however.

  2. The other case of this is systems with a single flat 32bit space, but right now I've not done swap on them because the memory management and copying back and forth for forked and overlapped processes is horrible. MAPUX did it on the Amiga with UZI but the complexity is eww...

  3. Agreed, and if you have an in memory map it's not that expensive as well as being possible to try and find groups when doing swapping. Paging is different altogether when you have real VM (eg Z280)

  4. Agreed - and there are some machines with lots of banked memory and crap I/O where swap makes no real sense.

  5. For swap yes, and for Z180/Rabbit style definitely

  6. Makes sense I think

I am not sure you want to be able to move things up and down. The classic mainframe systems with base/limit pairs always used a first fit algorithm until there was no room and then used a first hole algorithm until they'd done one pass through memory. In other words once they had something that should fit but didn't it started booting stuff out to make space gradually walking up the memory.

I believe the theory is that most stuff is long lived and stable so as well as avoiding thrashing the sweep tended to push all the stable stuff down one end and the rest remained the transient pool under memory pressure. I am not sure it translates - another reason of course was such machines had to use the CPU to memcpy but the disk interfaces were point and fire so a compaction didn't burn CPU.

The memory compaction is trivial enough (memmove), the disk one is a bit hairier but certainly doable. I guess a swap table is preferable to compacting disk swap.

nickd4 commented 5 years ago

I see what you mean about the mainframe style algorithms. One difference that we have to consider in our case is that we might have quite a small swap partition. I'd think that most mainframes would have been swapping to a hard disk measured in Megabytes, especially multiuser systems.

If we had this luxury then I would consider preallocating all swap. So for instance with 1 Mbyte memory and 5 Mbyte swap, you could run only 5 MBytes worth of processes, and process creation would return an error if space in the swap could not be allocated, even for short lived processes that never actually need to swap.

This would be great because (1) you would never have swap deadlocks since you'd always be swapping out to a different swap location than swapping in, and (2) you could do a lovely suspend operation where all processes get swapped and the system shuts down, then reverses this next boot.

On the other hand, with say a 128 kbyte Apple IIe having a 140 kbyte swap floppy, you'd expect and want to be able to run about 256 kbytes of processes (allowing 12 kbyte for overhead). Similarly if a Linux user mounts an 8 Gbyte swap partition on a machine with 4 Gbytes RAM (as I ve often done) he/she expects 12 Gbytes not 8.

With these use cases I don't think it is feasible to defragment memory via the swap. You could do it in 12 kbyte pieces in the Apple IIe case above (since I would be reserving about that much space from the combined memory plus swap to avoid swap deadlocks), but there'd be no real advantage as compared with just defragmenting the memory directly.

Admittedly the Apple IIe might not be the best example as you'd likely be using the bankfixed.c-like scheme, however I think you can see what I am getting at.

Anyhow, I will proceed to do a prototype for the Z180 case and we can then discuss the merits of extending the ideas to other platforms, I think they do generalize mostly.

EtchedPixels commented 5 years ago

Sounds good.

nickd4 commented 5 years ago

I did a further simulation where there are two pools, one representing core and the other swap. It runs a test script containing the following commands:

alloc - create a process (like fork), victimizing others as needed realloc - resize a process which is fully in core (like brk), victimizing others as needed run - bring a process fully into core, victimizing others as needed free - delete a process (like exit)

To execute, build and then run ./o.sh, it will run a simulation as follows:

The probabilities of each action are set so that the memory and swap are under heavy pressure. In such case, swapping a large process out and another large process in, is done in 16 block chunks.

In the running state there can also be a partially swapped process (the least recently used process). This works because swap is arranged like a stack, that accepts core blocks in LIFO arrangement.

In the simulation, dummy process data is provided and checked at process exit, by which time it has moved through swap possibly multiple times. This works, so it seems likely my logic is correct.

To use the system the user has to provide the routines in core.c and swap.c, which are as follows:

I will do further work on this so that swap-to-swap copies are done via memory, reducing the number of routines you have to write (via the 4 blocks reserved core, which would become mandatory).

I will then make it so that the swap pool is not moveable but rather it uses indirection like a filesystem, this will avoid swap-to-swap copies altogether. (I want to evaluate with and without indirection).

See the repo at: https://git.ndcode.org/public/moveable_pool.git

nickd4 commented 5 years ago

I made a lot of improvements to the simulator, so that you can choose various compile time options like MOVEABLE_CORE, MOVEABLE_SWAP, PREALLOCATE_CORE, PREALLOCATE_SWAP, and just now, INDIRECT_SWAP. These correspond in a complicated way to different kinds of hardware.

So, for instance, the Z180 would use MOVEABLE_CORE, because the Z180's MMU works with contiguous regions, so we need to be able to get things contiguous without risking fragmentation which would destroy the system's guarantees. If swapping to CompactFlash it would also use INDIRECT_SWAP. That means that block allocation is random and no compacting is required.

When using INDIRECT_SWAP, the MOVEABLE_SWAP and PREALLOCATE_SWAP options refer to the indirection table, not the backing store. We would like to turn MOVEABLE_SWAP off when using INDIRECT_SWAP, because there is no real point in being able to defragment the indirection table. We can instead just make it a bit larger than needed. The risk with turning MOVEABLE_SWAP off, is that guarantees could be destroyed -- suppose we are trying to kick a process out, and there is no way to get a contiguous region of the indirection table because of fragmentation. Thus, I made it that when MOVEABLE_SWAP is turned off, PREALLOCATE_SWAP must be turned on -- meaning that any problems due to fragmentation (of memory or indirection table) are detected and rejected to the application by an error return from a fork or brk system call, instead of a time-bomb further along.

If PREALLOCATE_SWAP is used without indirection then it means that all processes must have their backing store allocated in the swap in order to be able to execute at all. This could be handy for implementing a suspend feature. It also compiles to less code, since swap allocation is simply a malloc-like operation and all guarantees are preserved without ever having to defragment the swap.

Hmm! The code is pretty complicated, but will compile down quite small. One problem is, that all the conditional compiles make it hard to understand the code. So what I could do is, for each of the common cases, pre-run the C preprocessor to get a much smaller module that can be put into a port.

I think I might try compiling this as a userspace program and see what the cost will be in code space.

nickd4 commented 5 years ago

I haven't had time to work on this lately. Finally got a chance to do something this evening.

So I've decided that the moveable swap is not a good idea. The deciding factor was basically the difficulty of copying swap from one place to another, when core is full (which it is when you're trying to swap out, and that is when you need to defragment the swap). Without some spare core you can't do it, especially as that use case was supposed to be for mechanical disks so you want a big batch size.

For the indirect swap I had planned to implement an indirection table on disk, taking the first "n" sectors of the swap partition, and then use contiguous parts of this indirection table per process to locate the blocks of the process. This would have been cute if the indirection was conditionally compiled, but since the indirection will always be there, I decided to do it another way. I will just use the filesystem code.

So basically I see two ways of organizing the swap. You could swap to a mounted filesystem, in which case the swap works a bit like a pipe -- each swapped process has an invisible file that eats some disk space but can't be accessed through the directory structure. Alternatively you could have a dedicated swap partition which is created with mkswap, that will be identical to mkfs but won't create a root inode.

I think the versatility of being able to use a mounted filesystem for swap is helpful for very constrained systems (for example, running from 2 x 1 Megabyte floppies), and also I think that the reuse of the filesystem code for the block allocation and indirection stuff will save significant kernel code space.

A slight complicating factor of doing it this way, is the accounting for the allowable total process size. As I have mentioned previously, I am fine to reject process-creation or brk-setting if it exceeds the total amount of core and swap available, I am not fine to reject a swap attempt because we find out that we do not have enough core or swap at some arbitrary time later (and OOM-killing is out of the question).

So I have to account for the maximum total size of a process, including any indirect blocks, rounded up to an appropriate block size to account for internal fragmentation in core or swap (4 kbytes for the Z180). Then I have to make sure this total doesn't exceed what is available (with dedicated swap) or reserve the swap part of the total away from normal filesystem activities (when swapping to a mounted filesystem).

To test the concept I am working on changing the test script to use the FUZIX filesystem for the swapping. I have taken the filesystem code from the UCP utility, although I since found out it's a bit out of date and buggy. I may be able to integrate the kernel and UCP filesystem code at some point, for the moment it is sufficient for my purpose. I started by just writing a separate inode test script, which works.

To run the inode test script you run ./p.sh, it will create an 8 Mbyte filesystem in "fs.bin" and then randomly create and destroy up to 64 inodes simultaenously, each of which can have up to 2 Mbytes of dummy data stored in it. I set it for a pretty long test (16384 events) where each event is either creating an inode with some dummy data, resizing an existing inode with dummy data, or freeing an inode.

The test script runs perfectly, according to the predicted total sizes of the inodes and the number of blocks stored concurrently, and the dummy data is preserved as it should be. So seems from this that I can correctly store the dummy data in the inodes and account for the sizes including the indirect blocks. The next step will be to integrate this code into the process test script, where it will store the swap data.

Note that to get this to work I extended the f_trunc() function to allow the size to be set arbitrarily, instead of only being allowed to destroy a file with it. It recursively frees all blocks (direct, indirect or double indirect) beyond the given threshold, after rounding the threshold up to the nearest block. I will integrate this code into the kernel when I'm ready to integrate the rest of it, and add truncate()/ftruncate() APIs.

See the repository here: https://git.ndcode.org/public/moveable_pool.git

EtchedPixels commented 5 years ago

I am fine to reject process-creation or brk-setting if it exceeds the total amount of core and swap available, I am not fine to reject a swap attempt because we find out that we do not have enough core or swap at some arbitrary time later (and OOM-killing is out of the question).

I would agree - it makes sense for a paging system to overcommit but not really a swapping one

Alternatively you could have a dedicated swap partition which is created with mkswap, that will be identical to mkfs but won't create a root inode.

Linux does something a little different for fs based swap. It's a lot more complicated now in implementation but the original idea looks like it would fit in. In the original Linux swapping you either swapped to a block device, or a to a single swap file (that acts like a swap device). The only real difference is that your swap table of indirections also does a bmap() on the inode unless the file was created sequentially. That short circuits all the expensive readi/writei paths.

The ftruncate is definitely useful if it can be made small enough.

nickd4 commented 5 years ago

Thanks a lot for the feedback. I'm not sure I understand fully about the original Linux swap implementation, from what I gather you're saying that it partitions the swap into a dedicated set of indirect blocks plus a dedicated set of data blocks, and it may or may not use the indirect blocks depending on whether a contiguous region is available? So quite similar to my original design?

Anyhow, I made some progress. What I decided to do is to reverse the order of how processes are swapped in and out, so that to swap out, the data is essentially popped off the start of the process's core image and appended to the process's swap image, whereas to swap in, the data is "dis-appended" from the process's swap image by reading the tail of the file and truncating it, then pushed onto the start of the process's core image. This means the core image has to behave sort of like a deque, in the sense that the application itself by brk-setting can push/pop the tail whereas swapping can push/pop the head.

How I had it before was more like Towers of Hanoi, it would pop off the tail of the core image and then push onto the tail of the swap image, making the swap image end up in reverse order. Then I had some rather tricky code to complement all addressing to re-reverse the swap image and thus enable multiple blocks to be written linearly from core to swap. It was kind of confusing, although efficient in code space. By reversing things this way, I had it so that the realloc() operation was actually creating space at the START of the swap image rather than the end, although it didn't know or care about the physical layout.

So with the new deque approach, it compiles to more code space, but I don't think that will matter in the Z180 (and PC/XT) case that will use it -- because the Z180 (and PC/XT) have much better memory capacity and management ability than the other, more constrained, platforms that we support, and I'm aiming for a >64 kbyte kernel with far calls, the way I had originally implemented for the cash register.

Also, the deque is good because in the ideal case you'll have processes A followed by B in memory, and each chunk swapped out from B will create room to extend A without moving stuff around. The Towers of Hanoi approach couldn't do this. And in similar cases where say there's process A, B then C and you're swapping out C to swap in A, it will move B out of the way if possible, otherwise only move process B. Oops no that doesn't compute. Possibly could implement extra code to make that happen.

So the deque is working and hence the current test script is using swap in a file-like manner. I've started to incorporate the inodes code, so that it will use the filesystem for swap rather than my pool allocator.

About the f_trunc() code size, have a look: Lines 1359 to 1455 inclusive at this link: https://git.ndcode.org/public/gitweb.cgi?p=moveable_pool.git;a=blob;f=fuzix_fs.c;h=b198505e21e04b811165d53b9bbdf77fbfcf27d7;hb=af94a9859755031ef4ad37283e7ebfe07196c24d See the new routines freeblk_partial2() and freeblk_partial1() which are called for the case of the double or single indirect block respectively, that straddles the truncation point. This code took an evening of head scratching and several failed attempts to get right, I am fairly happy with how it turned out.

nickd4 commented 5 years ago

Making progress -- I'm ready to start integrating it into the kernel.

I improved the code a bit and the separation between generic and test-code. The generic code is in process.c, pool.c, core.c and optionally swap.c (depending on whether we are going to handle space allocation in the swap partition or whether we will leave this up to the filesystem code).

The generic code defines the following abstract routines:

The core and swap addresses and size are implemented as long which gives 32-bit addressing, this is necessary for the Z180 and/or PC/XT which allow up to 1 Mbyte physical. I believe the current kernel readi() and writei() can already specify far addresses, but I will have to look into that as part of the integration process since at the moment I'm using the fuzix_fs.c from UCP utility which is much simpler.

The test-code is in process_test_run.c which provides test-versions of the abstract routines based on memcpy() or equivalent, and/or fuzix_fs.c, and runs a test script as described earlier in this thread.

Because the code was nearly impossible to understand (simple in principle, complex in practice), and also the options for conditional compilation are rather confusing (certain options require other options, certain combinations not allowed), I pre-ran the C preprocessor to generate 6 nice clean versions of each file, and suddenly I can see my logic clearly. Essential since in 6 months time I probably won't understand the original :) But the original source is now just a private note, and won't go into FUZIX.

To see the preprocessed code, you check out the repo and run ./gen.sh, this populates 6 directories:

Indirect core: for platforms like PDP-11 or the emulated 8085 which allow each page of logical address space to be mapped independently, and thus do not require a process's core image to be contiguous: indirect_core_indirect_swap indirect_core_inode_swap indirect_core_preallocate_swap

Moveable core: for platforms like Z180 or PC/XT, which have MMU base registers or segment registers allowing code or data to be accessed from anywhere in physical memory as long as it's contiguous: moveable_core_indirect_swap moveable_core_inode_swap moveable_core_preallocate_swap

If we are going to manage space allocation in the swap partition ourselves, there are two ways of doing it: If swap is not plentiful then we use xxx_core_indirect_swap which keeps a free bitmap and a block indirection table in kernel data space. If swap is plentiful then we use xxx_core_preallocate_swap which restricts the total amount of processes to be run to equal the swap space ignoring the core space. The preallocated way does not require an indirection table as everything is contiguous in swap. And note that both of these options do not have good guarantees, since they omit the moveable pool code and will simply reject process creation or brk-setting if failed due to fragmentation. For the indirect case, swap fragmentation is less of a problem since the indirection memory can simply be made bigger by say 3x.

If the swap is going to be a filesystem with inodes and indirect blocks, we use xxx_core_inode_swap. This is the recommended way, because fragmentation is never an issue, since the indirect blocks do not need to be contiguous. At present it would have to be a dedicated partition, as I have not implemented reservations to allow it to share with ordinary filesystem use and still maintain the system's guarantees.

Another advantage of inode swap over own-managed swap is the saving of kernel data space, since the indirection information is stored in the filesystem and not in precious kernel data space. I was originally going to make the own-managed swap system do this too, which is why I separated out the indirection stuff into core.c and swap.c, so that swap.c could get a more complicated implementation that uses the buffer cache to manage the indirection table. But I haven't done this, as I think the inode way is much superior and the own-managed way shouldn't be supported long term. I only generated these versions because it was easy to do, so we may as well evaluate them. The xxx_core_preallocate_swap does have some advantage of extreme simplicity and small code/data space, if you really have plenty of swap.

Anyway, back to the usage intructions: you run ./gen.sh to generate the 6 directories as above, and then you go into one of these directories and run make and then ./o.sh to generate and run the test script. You also have to have done make in the top level to get the test script generator and mkfs.

If anybody is interested, follow the above steps then inspect moveable_core_inode_swap/process.c to see how I am going to manage the core and swap for the Z180 implementation. It is just 320 lines of code and very clear and easy to understand. The routines in this file are as follows:

estimate_size() -- calculate size of the core image (rounded up to 4 kbytes) and the swap image (rounded up to 0.5 kbyte block and indirect blocks added). Take the maximum -- this is necessary for the system's guarantees so that process creation or brk-setting can be rejected if problems will occur later.

process_init() -- call this once in the beginning, to tell it how many processes to use and how much spare space to reserve for the transfers between core and swap (under heavy pressure we must swap out small pieces to make room to swap in each small piece until a process to run is fully swapped in).

do_swap_out() -- an internal routine which frees enough core to swap in a specified amount of the process we want to run, it has its own routine because it may need to loop through several victims.

process_alloc() -- supports the fork() operation or possibly in future an execve() after a vfork(). Possibly swaps out and/or moves core to find a contiguous core region of the requested size, or fails.

process_realloc() -- supports the brk() operation. Possibly swaps out and/or moves core to extend the current contiguous core region (which must be fully swapped in; the process must be executing) or fails.

process_run() -- called from the scheduler to bring the process fully into core if it isn't already. Repeatedly calls do_swap_out() and then does the swapping in part itself. Note that this can't fail.

process_free() -- supports the exit() or kill() operation. Process can be partially in core or swap.

The other files pool.c and core.c are basically support code and should be mostly self-explanatory.

I'm having a problem with SDCC that it hangs when compiling the test-system, I'm investigating. Well, I don't really need the test-system for kernel integration, but I want it, so as to do things step by step.

nickd4 commented 5 years ago

I got a bit disappointed with SDCC, it seems to be over-complex, slow and bug-prone. A significant amount of work has gone into it, and kudos to the devs since it has some very nice features that I was initially quite excited about using (the global pointers and so forth), but on balance I feel that a ground-up redesign and new approach would be needed to get it to something I'd consider robust and scalable.

So I started to look seriously at ACK again, I have looked at it several times over the years and been put off by its design which is also a bit over-complicated in different ways. But after hacking on it for 1-2 weeks and making some experimental changes to the EM machine assembler-link-editor and simulator, I'm starting to come around to the ACK approach. Also, the code it generates seems surprisingly good.

So for my Z180 port I'm quite keen to resurrect the ACK Z80 target. I have got it basically working and I am ready to try to compile FUZIX with it. As a preliminary step I'm trying to compile the v85 and v8080 target, and I do not understand one thing: where does the ack -mfuzix come from? I expected it to be ack -mi80 or similar. I cannot see any reference to a platform called fuzix in David Given's ack?

Any tips would be helpful. I would also like to know whether we use ack style executables in those ports or whether we are converting them to put a standard FUZIX a.outish header on them? Or did we hack on the platform's linker to produce the standard FUZIX executables hence the -mfuzix in the compilation step? I saw the ack2kernel tool, although I haven't looked at it all that closely at this stage.

EtchedPixels commented 5 years ago

The fuzix file is in the Build/ directory. Drop it into your ACK enviironment. I know @davidgiven was also looking at the Z80 state with what we learned from 8080.

ack2kernel and friends just turn an ack image into a Fuzix one. It doesn't currently know how to build relocatable binaries but that will be much the same as with SDCC (build it twice, binary diff). I did poke at ack a bit for Z280 but it's obsession with a framepointer is a killer (as with native 8085 rather than 8080) so I didn't get too far because I couldn't figure out how to make it generate all the offsets as SP relative. I also looked at ANSI pcc a bit but Z80/Z180 was just too weird to encode the tables easily.

I am not sure how well it will work in general - the ack linker is dumb as molasses and can't cope with the complex layouts needed for many systems. OTOH I am very interested to see how small an ack for Z80 targetted at size could get. SDCC is good at fast, it's not so good at small.

ack2kernel should be usable as is for Z80, and the same for the user tools providing you build them at a valid load address for your system. The bigger problem will be signal handling. ACK on 8080 at least does not generate re-entrant code. SDCC on Z80 does and our signal paths currently asume that on Z80 (it's also why Z80 really can't run 8080 binaries right now).

Alan

nickd4 commented 5 years ago

OK, thanks for the tips and the detailed run-down of current ACK problems. I think those things could be fixable given the will to really get down and dirty in the ACK code and make structural modifications. I have not looked closely at the target-specific backends yet, if they are similar to what I see for the EM-machine it should be OK.

I also looked at pcc a bit, and although Steve Johnson is a genius (and is still active on early unix mailing lists) I am not sure that pcc is his finest work, the table matching algorithm seems much too heuristic to me. Perhaps there were good reasons at the time. But a detailed study of the AT&T documents about the internal design of the Ritchie vs. Johnson compilers convinces me that Ritchie had the better algorithm. I made some efforts to pick up Ritchie's code as basis for a multi platform compiler, it is highly intricate work and I eventually decided replacement is easier.

Yes, I am highly focused on the size of ACK vs SDCC and the self hosting potential. It is tough to choose between these compilers as both have good points and bad points, for me the potential that FUZIX could some day compile itself on Z80 (like 2.11BSD can on PDP-11) is a deciding factor for ACK.

feilipu commented 5 years ago

@Nickd4 have you looked at sccz80 in the z88dk?

It is under active improvement, is much faster than sdcc, and supports both classic and new libraries.

Most of the older z80 hardware is supported via sccz80 and the classic library.

davidgiven commented 5 years ago

I think the ACK's unlikely to self-host on the Z80 --- the smallest architecture I know of that it runs on is the i86, which both has much better code density than the Z80 and also cheats hugely by using split I/D address spaces, effectively doubling the available space. The ncg ACK binary (the core code generator) is 65kB, and the code segment is essentially full.

The ACK's table-based code generation algorithm is interesting, however, because it's a single-pass non-basic-block non-AST architecture, which makes it suitable for streaming from/to disk. It doesn't keep much state in RAM. With simplified bytecode it may be possible to write a new, smaller code generator that would fit on these systems.

(Do you have any links to the papers you described on Ritchie and Johnson compilers?)

nickd4 commented 5 years ago

@feilipu I must admit I have not looked closely at z88dk, that's because I looked at it about 15 years ago when undertaking the initial UZI work and found it to be not a sufficient subset of C (I'm enthusiastic about Small C but more from a teaching / applications viewpoint, system level needs struct etc). I understand that it's improved a lot and is now mostly a proper C compiler, so I should take another look.

@davidgiven I probably should have said Z180 not Z80, recalling that using the Z180 I can create something similar to split I/D as described in the first post(s) of this thread. On the other hand, I will not give up on making the Z80 self hosting :) That is a challenge to tackle when I get to it though :) I was reading the paper about the ACK "fast C compiler" today and I think that approach might be fruitful here.

About the Ritchie and Johnson papers, see here: https://s3.amazonaws.com/plan9-bell-labs/7thEdMan/bswv7.html The two papers are in the file v7vol2b.pdf starting at page 179 of 250. It was a real find when I came upon this, despite it being a bit late in the day, as I'd already figured out the Ritchie compiler mostly. It's absolutely essential to read this if looking at the Johnson compiler, due to the less-obvious structure.

I spent the day building and debugging the platform-v8080, fixed a couple of minor things but mostly just dealing with my own mistakes like if I modify Kernel/platform-v8080/config.h I need to do a full build, or at least touch Kernel/include/kdata.h to build the most important modules. We should really add automatic dependency tracking when possible. Anyway, I learned a lot about how it all works.

davidgiven commented 5 years ago

Tangent about compilers: I actually implemented a brand new mcg backend based on iburg. It works fairly nicely, although the register allocator I implemented (puzzle-based register allocation) turns out to be a complete disaster (although possibly not as bad as ncg's register allocator). I've been slowly adapting it to use proper graph colouring but it's impossibly slow; I have an O(n^2) algorithm in there somewhere. So I'm interested in alternatives --- thanks. This approach is completely unsuitable for small systems, of course.

By the 'fast compiler' do you mean the code expander, ceg? I haven't actually spent any time looking at that, but a cursory glance shows that it's simple enough to work on these systems. At the expense of nauseatingly bad code, of course. I'm not sure this will help, though. You still need the full EM C compiler, which for the i86 is 80kB.

Incidentally, re Z180s: old Brother typewriter/word processors are based on these, are really cheap off ebay, and I have a design for a cheap USB floppy controller which can read and write Brother floppies. I'm waiting for a disk to arrive by post and then, hopefully, I should be able to reverse engineer the executable file format. I don't think they have enough RAM for Fuzix but they should certainly run Minix.

EtchedPixels commented 5 years ago

@davidgiven Code size is actually not the killer on Z180 at least because you can generate banked rolling window code which some Z180 toolchains can already do. The Rabbit CPU took this and actually added the instruction level helpers for it (but not in a way that lets 'fork' work... )

For the typewriter I assume you mean UZI rather than Minix ?

One thing that would be interesting to me on the Z80 side is how compact a code generator you can get out of ACK Z80 if you are willing to take a performance hit. SDCC is performance focussed so whilst I've gotten a few size improvements from Philipp it's not his focus.

Certainly in Z80 asm you could fit Fuzix into 32K/32K the challenge is getting the compiler to manage it.

nickd4 commented 5 years ago

@davidgiven Ah, that's cool, I had a quick search and found this: https://typewriterdatabase.com/1997-brother-wp5600-mds.3510.typewriter A problem with the Z180 DIP package is the A19 is not brought out and so it can only address up to 512 kbytes of memory. I know this because we used to make a printing scale containing this chip. (The cash register models all had 1 MByte of RAM and used the PLCC version of the chip which had A19). The pictures in this article aren't good enough for me to see the RAM capacity, perhaps you know more?

I also looked at iburg and found some articles, extremely interesting stuff. Sort of like yacc for backends I think. It was a bit mysterious and will deserve another closer reading. I remember having just this feeling about yacc when I first encountered it (I now have reimplemented yacc many times and I am planning to make a Python pip repository release of my Python yacc workalike when I get to it).

About the code expander CEG I think I do mean that, I just had a look in fcc/cemcom/proto.main and it mentions libraries called CEopt, ce and back, the back library is in a sun3 or vax4 directory. So if that refers to the same thing as CEG then yes. Although the code might be nauseating, according to the paper the peephole optimizer improves it enormously. An issue for me with the approach might be that the ACK fast compilers generate object code directly rather than source code. This was cool when compilers like Turbo C did it, but it would cause significant problems for porting of Unix-style Makefiles. So although ACK invocation can be a bit weird, the normal compilers can are still basically compatible.

davidgiven commented 5 years ago

Er, sorry, s/Minix/CP\/M/. I haven't identified the RAM in the thing yet; there's a weird double-decker chip which I suspect is it, but every time I take the lid off I forget to photograph the board. I suspect there's not a lot, maybe 64kB. I made a rather rambly video about it which shows the inside at quite low resolution. https://youtu.be/YFQprySL82Y?t=1393 (The floppy disk, by the way, is a wonderful 40 track 3.5" thing running at probably 360rpm with a custom GCR encoding which PCs won't touch.)

Re iburg: here's a MIPS code generator table I did with it. https://github.com/davidgiven/ack/blob/default/mach/mips/mcg/table But it definitely needs a better register allocator, and register moves are a bit of a disaster --- see the accompanying C file; and, of course, it's intrinsically tied to the ACK architecture, which means it has to have a physical frame pointer, and I doubt it'll work at all for the Z80.

nickd4 commented 5 years ago

Getting back to the discussion about swapping, making efficient use of swap, and my swapper code:

I made fairly significant progress in porting the swapper over to FUZIX and getting the test script to run (at least in user space), however I had to re-evaluate things a bit, for several reasons:

Because of these problems I've come up with a simpler way of categorizing the cases to handle:

For example, if you had say 48 kbytes of core and 140 kbytes of swap on a floppy (Apple IIe idea), you'd be allowed the full 48 + 140 kbytes of processes, less a transfer margin of say 16 kbytes. On the other hand, if you had say 256 kbytes of core and 4 Mbytes of swap on a CompactFlash (Z180 idea), you'd only be allowed 4 Mbytes of processes, no transfer margin needed, but <4 Mbytes when fragmented.

The new way of doing it is expected to make heavier use of the block free bitmap and indirection stuff, and also to make heavier use of the non-moveable allocator, so I'll do some rework to reduce code size (if core is indirect its bitmap/indirection code becomes common with swap, else core is moveable and its allocator becomes common with the non-moveable allocator, with moveable stuff controlled by a flag).

nickd4 commented 5 years ago

I haven't had time to work on this lately, I made a few attempts but I was also a bit stuck on rewriting some gory stuff in the allocator that needed an uninterrupted block of time to figure out.

So now I have rewritten the allocator to combine a number of different routines into one, whose behaviour is controlled by a mode argument. I hope that this will cut down on the code space.

Previously there was a routine pool_realloc() which worked more-or-less like C's realloc() in that it tried to resize in place and if that failed it did a first-fit and a copy.

There was also a routine pool_realloc_base() which was the symmetrical version, it tried to resize in place to add or remove at the base of the block, and if that failed it did a last-fit and a copy. (I need this because as stuff gets appended to the swap file it gets "popped" from the core image and vice versa).

Now the routines are unified and what you do is provide an offset argument, which tells where the new space will be added or removed (at the start of the block, at the end of the block, or some combination). For example, if you truncate at the start while appending to the end, the block will behave like a pipe.

As well as this rationalization, there are some further big wins by combining the pool_alloc() routine with pool_realloc() so that the first-fit code doesn't have to be expressed twice in an almost identical way, and by combining the first-fit and last-fit code so that it is controlled by a direction variable.

In the new code you also get precise control over whether first-fit or last-fit shall be used and whether compaction shall be done, and if you only use certain modes, you only provide certain move-callbacks. This is kind of important for when core and swap are both managed, but with different move-abilities. I also plan more aggressive use of #ifdef to remove modes that are not accessed in any given setup.

I have also something in the code which I call the "blocker test", and that means that while doing a compacting realloc in the case that resizing in place is not possible, it will determine a "blocker", being an adjacent block that prevents us resizing in place, and will opportunistically try to move the blocker out of the way as well as trying to find a new spot for the block being resized. It can now handle either a pre- or a post-blocker or both (previously the pool_realloc() could only handle a post-blocker and the pool_realloc_base() could only handle a pre-blocker). It also now applies a size criterion so that it will aggressively try to move only the smaller of the pre-blocker, the post-blocker and the block being moved. It still remains opportunistic, in the sense that after a single pass over all blocks it always has a solution.

Another thing I've been considering is a much simpler mode, as suggested by @EtchedPixels, to use when there isn't code space for the compacting allocator, and that's to just boot processes out until core is sufficiently defragmented. This will work when swap is preallocated (which will require plenty of swap).

I have for a long time had a kind of a vision of a radically different memory management scheme for a kernel, well the moveable pool is only a small part of it, and I am also not sure that my vision should be implemented in FUZIX given there are many other pressures. But I'm excited about the new code, and in my private research I'm keen to try applying it to things like the network stack and the disk system.

jcw commented 5 years ago

(still catching up, but this caught my eye - by nickd4 on 13 Mar)

you could do a lovely suspend operation where all processes get swapped and the system shuts down, then reverses this next boot

That would be nice. Instant-on. Also, with a simple enough swap mapping, even system boot could be done that way: place a fresh FUZIX kernel image in swap space, and set up things so that it gets loaded and run - i.e. the boot loader is really just a resume operation, even for fresh installs.

nickd4 commented 5 years ago

@jcw Yes that would be cool. It is kind of part of my master plan, but that is a story for another day.

As to today, I merged into current FUZIX userspace build for platform-v8080, the latest moveable pool and related changes. And it runs. In the current configuration, the backing store for core is a block of size 1 kbyte allocated using malloc(), divided into 64 blocks. Blocks are nominally 4 kbytes, but for FUZIX userspace simulation purpose they are only 16 bytes. So it simulates a 256 kbyte system with 1 kbyte.

For the time being I'm using the filesystem code from ucp as the backing store, and swapping to inode. As mentioned earlier in this thread, I'm moving away from that system because it's a bit too expensive in code space. With the latest inode stuff (better calculations of available space, allowing for indirect blocks and such), the process.c compiles to 3712 bytes instead of the previous 3437 bytes (a waste).

Anyhow, the goal was to reduce the pool.c code space usage and it is now down from 3733 bytes (with separate routines for the different modes) to 2182 bytes (with the combined routine). That's acceptable. It's also overly generic, as it doesn't use some of the mode-combinations that can be specified right now. So I believe there is further space saving to be had. Overall, I think the outlook is encouraging.

I have uploaded the various pieces of this test and made them basically useable if anybody wants to: https://git.ndcode.org/public/fuzix_sim.git https://git.ndcode.org/public/FUZIX.git on the branch pool_userspace These have to be checked out to adjacent directories, in my case ~/src/fuzix_sim and ~/src/FUZIX. There also needs to be an ACK installation, I cannot remember exactly how I set this up but I think it is just the latest development build from @davidgiven, it should not be critical in any case.

In the fuzix_sim and FUZIX directories you run make. Then, in the fuzix_sim directory you run ./n.sh and what this does is to create the disks directory under fuzix_sim and populate it using the kernel and other files from ../FUZIX. At this stage you can run ./fuzix_sim to boot the system, I do not have a good way of breaking out of the simulator at this stage and so what I do is Ctrl-\ (Unix abort).

Then, when you have a good system you go into the FUZIX/moveable_core_inode_swap directory and run make (this is not part of the recursive top-level make build, it must be done separately). Then you run make install, and what that does is to copy the needed files for running the test into the disk images at ../fuzix_sim/disks. Finally you boot the system and from /root you run ./o.sh to run the test.

It is all a little bit ad-hoc but highly automated, so I think it provides a fairly good model for somebody who wants to do FUZIX userspace cross-development without necessarily wanting to hook into the FUZIX build process. I put the moveable_core_inode_swap directory under FUZIX so that it would have easy access to the FUZIX include files and libraries, but it could just as easily be separate.

Note that the fuzix_sim is in a pretty good state now. It's the Z180 simulator with 1 Mbyte of RAM, and a switchable boot ROM so that you can use the low common provided by the Z180's MMU after booting. It is also perfectly fine as an 8080 or Z80 simulator, and is z80pack compatible (I believe you could use z80pack for running the test as described if you prefer). I plan to extend it with more CPUs.

I had a few snafus getting the platform-v8080 to build and run properly, mostly these were due to my mistakes, but I did put in the following patch which is necessary for the console to work in z80pack:

diff --git a/Kernel/platform-v8080/v8080.s b/Kernel/platform-v8080/v8080.s
index f5627c47..0054c4a8 100644
--- a/Kernel/platform-v8080/v8080.s
+++ b/Kernel/platform-v8080/v8080.s
@@ -204,6 +204,7 @@ _tty_pollirq:
        jnc poll2
        in 1
        mov e,a
+       mvi d,0
        push d
        mvi e,1
        push d
@@ -216,6 +217,7 @@ poll2:
        rnc
        in 41
        mov e,a
+       mvi d,0
        push d
        mvi e,2
        push d

Anyhow, in the process of debugging my mistakes I got the Z180 disassembler from MAME integrated in (the Z180 CPU and some other parts are also from MAME) and you can get the simulator to dump the instruction trace as it runs. I haven't done a proper debugger with breakpoints and such, but you can get more or less the same effect by modifying certain files in the simulator if you're desperate (I was).

The next step is probably to make the indirect core and swap code (the free bitmap, the indirection table and the allocation/freeing of random blocks) be reuseable, so you can have one instance for core and another instance for swap. At the moment, they use global variables, so you need two copies of the code if both core and swap are indirect (I was going to make the swap one different but changed my mind).

Then, I'll do another userspace test where the swap is under pool.c management rather than inodes, I think that should save significant code space. After that, I should be ready to start integrating in FUZIX. At some stage I will also fork the platform-v8080 into an ACK-based Z80 platform, I already have the Z80 ACK compiler working (in an @davidgiven version and also in a plain ACK release 5 version).

jcw commented 5 years ago

FYI, while trying to build fuzix_sim on macOS, I had to change the flags in the Makefile to:

CXXFLAGS=-g -std=c++11

And then it leaves me with these errors:

$ make
c++ -g -std=c++11 -DCONFDIR=\"conf\" -DDISKSDIR=\"disks\" -DROMSDIR=\"roms\" -I.  -c -o z180/z180.o z180/z180.cpp
z180/z180.cpp:1944:19: error: no member named 'make_unique' in namespace 'std'
        SZHVC_add = std::make_unique<uint8_t[]>(2*256*256);
                    ~~~~~^
z180/z180.cpp:1944:31: error: unexpected type name 'uint8_t': expected expression
        SZHVC_add = std::make_unique<uint8_t[]>(2*256*256);
                                     ^
z180/z180.cpp:1944:39: error: expected expression
        SZHVC_add = std::make_unique<uint8_t[]>(2*256*256);
                                             ^
z180/z180.cpp:1945:19: error: no member named 'make_unique' in namespace 'std'
        SZHVC_sub = std::make_unique<uint8_t[]>(2*256*256);
                    ~~~~~^
z180/z180.cpp:1945:31: error: unexpected type name 'uint8_t': expected expression
        SZHVC_sub = std::make_unique<uint8_t[]>(2*256*256);
                                     ^
z180/z180.cpp:1945:39: error: expected expression
        SZHVC_sub = std::make_unique<uint8_t[]>(2*256*256);
                                             ^
6 errors generated

The compiler on my system is:

$ c++ -v
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.5.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Is there a quick fix for this, perhaps?

nickd4 commented 5 years ago

@jcw It's awesome that you checked out the code.

Yes there's a quick fix, I patched it and pushed to the repo as follows:

diff --git a/z180/z180.cpp b/z180/z180.cpp
index 5b05517..9dbf644 100644
--- a/z180/z180.cpp
+++ b/z180/z180.cpp
@@ -763,8 +763,13 @@ static uint8_t SZP[256];      /* zero, sign and parity flags */
 static uint8_t SZHV_inc[256]; /* zero, sign, half carry and overflow flags INC r8 */
 static uint8_t SZHV_dec[256]; /* zero, sign, half carry and overflow flags DEC r8 */

+#if 1 // std=c++11 compatibility
+static uint8_t *SZHVC_add;
+static uint8_t *SZHVC_sub;
+#else
 static std::unique_ptr<uint8_t[]> SZHVC_add;
 static std::unique_ptr<uint8_t[]> SZHVC_sub;
+#endif

 #include "z180ops.h"
 #include "z180tbl.h"
@@ -1941,8 +1946,13 @@ void z180_device::device_start()
    uint8_t *padd, *padc, *psub, *psbc;

    /* allocate big flag arrays once */
+#if 1 // std=c++11 compatibility
+   SZHVC_add = new uint8_t[2*256*256];
+   SZHVC_sub = new uint8_t[2*256*256];
+#else
    SZHVC_add = std::make_unique<uint8_t[]>(2*256*256);
    SZHVC_sub = std::make_unique<uint8_t[]>(2*256*256);
+#endif

    padd = &SZHVC_add[  0*256];
    padc = &SZHVC_add[256*256];

I have never been able to figure out why C++ programmers like making their life so difficult for themselves, smart pointers are a waste of time! Haha. (And yes I do understand RAII paradigm).

How come you have to use C++11 on macOS? The make_unique() function is apparently C++14. Anyhow, it's a hangover from MAME, and I've been removing all that sort of stuff when I find it.

I did not put the -std=c++11 in the Makefile at this stage, I probably could if there is a strong case for it. But hopefully you can just add this when you're building on macOS, or could GNU make detect macOS?

jcw commented 5 years ago

Aha, bingo - adding -std=c++14 iso my tweak solved it. Still some warnings, but now it builds.

FYI, the remaining warnings I get are all of the type: warning: 'register' storage class specifier is deprecated and incompatible with C++17 [-Wdeprecated-register].

Yes, make could use a conditional. The error was caused by the default being lower than c++11 - and my mistake was assuming that c++11 would be enough, as that's what I always add with such errors.

Thx.

jcw commented 5 years ago

Eek, github is confused - Nick, you've somehow posted in the future... oh well, my reply above:

Screenshot 2019-05-08 at 17 09 17
nickd4 commented 5 years ago

@jcw That's interesting, could it be basing these notifications on your local clock? Or perhaps it is attaching a bogus timezone to my posts, I could have forgotten to set it.

Thanks for the heads up on the lower than c++11 default, didn't realize that could happen. I tried compiling with -std=c++03 and got loads of bogus errors (can't use override specification etc) so I have added -std=c++11 (note: this works because I removed the offending code and pushed to repo, you might have not noticed about that). I've also fixed the warnings you mentioned and added -Wall to catch and fix some more. To be honest, the code is pretty grotty, since I just transplanted the pieces with only minimal integration. I renamed .c files from z80pack to .cpp because I can't be bothered to extern "C" them. And I must say, it's quite irritating how C++, that's supposed to be a superset of C, is anything but!

nick-lifx commented 5 years ago

Making progress. I'm building the swapper into the kernel. But I backtracked a bit because the platform-z80pack style ports use the "banked" architecture that is not very flexible (core is neither paged nor has a base register), I'll support that but later. I remembered seeing a paged one somewhere and I found it in https://github.com/EtchedPixels/Virtual65 (logical space divided into 8 x 8 kbyte pages similar to a PDP11, backed by 128 x 8 kbyte pages or 1Mbyte). I notice that in FUZIX we actually use it the same as the "banked" architecture right now, but this is clearly very ripe and overdue for improvement :)

So I spent the day getting platform-v65 to work again, it looks like the 6502 and 65c816 ports haven't been compiled in a bit and I fixed a bit of code rot here and there. Currently dealing with a problem that it crashes in userspace, that may be something to do with the shell, adding diagnostics to the shell makes the problem come and go, but interestingly I found that turning on syscall tracing in the kernel fixes it, or more correctly, allocating space in the kernel for the syscall names makes things run reliably.

So it could be something like, that the process is touching memory that it shouldn't, but the kernel doesn't mind if you move stuff around. Anyway, I will fix it. The weekend is too short, I am just making progress now when I have to put it aside, haha. I have a fairly good feeling about cc65, the order of passing parameters is annoying (because it was originally a one-pass compiler and pushed expressions as they were compiled, so it's stuck with left-to-right for compatibility now), but otherwise it's solid.

On the hardware side (which sadly gets neglected since I am more of a software guy and I keep finding interesting software problems to look at, despite that I have a large collection of interesting and ancient machines waiting for my attention)... I was fortunate enough to be given a pair of S100-bus machines last week, they are totally awesome as they're multiprocessor Z80B servers, I will post some links and pictures later. I also collected some years back the parts for a souped up Apple IIe with RAMworks and an Apple SCSI harddisk among other things, so I'm keen to get FUZIX onto all of these soonish.

I've also got slightly new plan for the fuzix_sim project, which aims to unify the various emulators we used during development noting that it's not meant to be a real machine exactly, but more to be a kind of a playground for experimentation with the logic at the C and resource management level, with basic drivers but knowing that the precise low level driver stuff will depend on the eventual hardware platform.

Faced with the problem of wanting the paged memory hardware when I already have the banked memory hardware, and also that the v85 I/O ports will conflict with the z80pack I/O ports and so on, I decided the right thing to do is a simple text file where you can specify the wanted CPU and memory configuration, plus any hardware and its address. I need to keep it simple though, I know that MAME can do this sort of thing, but there is so much obfuscation and layers of interfacing it's not worth learning.

Also, I think the current fuzix_sim code is too messy for a number of reasons, so I think what I may do is strive to unify Virtual65 with v85 by means of the text file, then re-add Z180 and z80pack support but more cleanly (I will use the MAME code but converted to C not C++, and re-implement z80pack).

EtchedPixels commented 5 years ago

6502 indeed kind of died off because the original project it was for (tgl65) also died. I'm actually currently debugging some 8085 and 6502 boards in an RC2014 system so I can work on them a bit more usefully.

65C816 should be ok although not that looked after, it gets built in the autobuild and used a bit.

Things that crash and move around in my experience so far have tended to be variables that should have been in common code but were not and then got touched with user space mapped. Not always but it's become my usual suspect.

nick-lifx commented 5 years ago

Yup, that is almost exactly the problem, it was touching _inint (which is in kdata.c thus in kernel space) just after calling map_restore. The code looks like it was moved to the correct spot at some point, but the original code accidentally wasn't deleted afterwards. The following patch has fixed it:

diff --git a/Kernel/lowlevel-6502.s b/Kernel/lowlevel-6502.s
index f5d4c462..7e85fd9f 100644
--- a/Kernel/lowlevel-6502.s
+++ b/Kernel/lowlevel-6502.s
@@ -102,14 +102,9 @@ interrupt_handler:
        sta _inint
        lda _kernel_flag
        bne interrupt_k
-       jsr map_process_always          ; may have switched task
-       jmp int_switch
+       jmp map_process_always          ; may have switched task
 interrupt_k:
-       jsr map_restore
-int_switch:
-       lda #0
-       sta _inint
-       rts
+       jmp map_restore

 ;
 ;      The following is taken from the debugger example as referenced in

In the process I also did some nice debugging stuff in the Virtual65 emulator, it now has a 6502 disassembler and the ability to print an execution trace on stderr. I tackled the problem by creating good and bad execution traces and comparing them for the first divergence (with some caveats). When I discovered this occurred during a timer interrupt, I then tried commenting out parts of the interrupt handling code, and bisected in from there to discover what part of the interrupt handling caused it.

I am not sure if it's worth PRing these fixes for the moment, I can do so if there is a pressing reason to. Similarly, the platform-v8080 changes. Anyway, the 6502 and 8080 seem to have already been in a pretty good state, so it only required a tiny bit of tinkering to get things bootable and seemingly robust.

nick-lifx commented 5 years ago

About the 65c816 I have flagged the following as possibly incorrect, should arguments be swapped?

diff --git a/Kernel/lib/65c816.s b/Kernel/lib/65c816.s
index 71cce952..9b01d84b 100644
--- a/Kernel/lib/65c816.s
+++ b/Kernel/lib/65c816.s
@@ -279,12 +279,14 @@ fork_patch_3:
        sep     #$30            ; back to 8bit mode for C
        .a8
        .i8
+       ; NICK: IS THIS CORRECT? SHOULDN'T THE ORDER OF ARGUMENTS BE SWAPPED??
        lda     #<_udata
        ldx     #>_udata
        jsr     pushax
        lda     ptr1
        ldx     ptr1+1
        jsr     _makeproc
+       ; NICK: END IS THIS CORRECT?
        ; We are now being the child properly
        lda     #0
        sta     _runticks

That's the only reason I thought the 65c816 might not be in current use, if it is then we should check it.

Also, I was wondering if there is an easy way to remove specific files from the packager for specific platforms? Since a few things were too big to fit or wouldn't compile on 6502. Obviously I can get around it for now, but that's something that I would have to tidy up before I could PR the changes officially. (Or the problem goes away when/if I implement the larger address space, which is envisaged in comments).

EtchedPixels commented 5 years ago

Possibly .. I will check that. I have not tested 65C816 since I did the big makeproc change, I've just fixed banked memory Z80 with banked kernel for that same bug!

EtchedPixels commented 5 years ago

I've added the v8080, 6502 and 65c816 fixes you mentioned here. Any other 6502 bits would be good to get into the tree as I am currently bringing up a 6502 system with 512K of banked RAM (in 16K banks), 16550A UART, CF adapter and (once I've finished soldering it) a 6522 VIA for timers etc.

nick-lifx commented 5 years ago

That's good. As a bit of an old-timer I am pretty comfortable with exchanging text files and patch files and I feel that project owners / maintainers should not look a gift horse in the mouth, if the information about a bug or feature reaches the owner / maintainer by some means then that is better than it not doing so. Since the advent of github style interfaces though, I have increasingly had patches rejected for not being formal PRs -- and then critiqued and delayed multiple times once formally PRed, as if it were my responsibility to get critical fixes polished and into the style because I want them in the mainline, when in reality it makes little difference to me since I can always merge the fixes I see as critical in a private version of the repo! As a highly productive developer who throws off such patches regularly as a side effect of my work, I do not want the hassle of shepherding them through someone else's process, when I could spend that time developing! (Having said that, there is a time and a place for PRs and there are occasions where it is important to me to get stuff into mainline). So anyway, to conclude the rant, I am happy that you take a practical approach to such contributions!

I believe that all critical bugs I found are mentioned above, i.e. those that took time to isolare, there are also quite a few hacky fixes or commentings-out for obvious problems and some changes for my use such as making /dev/tty1 refer to z80pack aux not lpt. These are tricky to PR since they would arguably leave the source in a messier state than I found it, but anyway I'll review my branches for any missed fixes.