bradleyeckert / chad

A self-hosting Forth for J1-style CPUs
Other
24 stars 4 forks source link

chad

Welcome to deep embedded computing, where the best hardware is no hardware. Well, minimal hardware. How does a processor in 200 lines of Verilog sound? Pretty awful? Try pretty awesome.

It's so simple your computer will simulate it at 100 to 150 MIPS. At that speed, it can host itself on your computer. No need to turn the Verilog into hardware, just run the processor model on your desktop or laptop. That's the ideal environment for Forth, the programming language of deep embedded computing.

Forth hardware's claim to fame is its lack of a compiler in the traditional sense. It's both high-level and low-level, with both levels seamlessly integrated. Forth computers execute the language directly by implementing Forth instructions in hardware. If you disassemble code, it looks a lot like the source code. No compiler means no compiler bugs.

Now you can build computing platforms almost without a processor. If you have an FPGA or ASIC, there's your computer. chad is the at-speed simulation model.

The C and Verilog models match as closely as possible so that the generated flash memory image can run on either model. Although the project started as a simple CPU, the SPI controller morphed into a central hub for handling memory decryption, in-system-programming, and boot-from-flash. Chad is:

The sample MCU uses an external SPI flash and external USB UART. Green data is plaintext, red data is encrypted.

MCU Image

A self-hosting Forth for J1-style CPUs

James Bowman's seminal paper on the J1 CPU was presented in 2010. At under 200 lines of Verilog, the J1 was a real breakthrough in simplicity. It also happens to be a very powerful Forth processor.

The Chad CPU, like the J1, has excellent semantic density. The application of the J1 was a UDP stack in a Xilinx FPGA. The code was 70% smaller than the equivalent C on a MicroBlaze. The code just wouldn't fit in memory, so the J1 was used instead. Admittedly, MicroBlaze is a hog. However J1 has a lot going for it. Calls and jumps take only a single cycle. Often a return is combined with an ALU instruction to cause a return in zero instructions. It's a little freaky to watch in simulation if you're used to control flow changes having to deal with pipelines. Fetch/execute pipelines impact fine factoring by making call and return waste clock cycles on synchronizing the instruction stream. The ISA of a Forth CPU should avoid such pipelining. Do not not select the inputs to the ALU. They are implicit. That takes the decode delay out of the critical path.

Chad improves on the J1 to facilitate bigger apps. A caching mechanism supports much larger applications than fit in code RAM. For example, a full Forth needs as little as 4K bytes of code RAM. chad protects your software investment by targeting a very simple but very powerful (for its size) stack computer. Modern desktop computers are fast enough to simulate the CPU on the order of at-speed. It's like having a real CPU running in an FPGA, but without an FPGA. Forth should execute the code it compiles. Cross compiling, such as targeting ARM with code running on x86, adds a lot of complexity which is avoided with Chad.

You can add custom functions easily. Just edit chad.c, coproc.c, and chaddefs.h. Recompile and your simulated computer and its language have the new features. Chad comes as C source. Once you compile it, you have a Forth that can extend itself in such a way that the binaries can be output for inclusion in a SOC. You can add code to Chad's simulation to mimic your SOC so that the PC is the development environment.

More importantly, you aren't dependent on other people for long-term support. The system can be understood and maintained by one person due to simplicity.

Since Chad's simulation of the CPU is its specification, which is under 200 lines of C, the processor is also called Chad. You can specify the cell size as any width between 16 and 32 bits (in the config.h file) and recompile Chad with any C compiler.

Chad's way of working isn't fully ANS compatible, which is fine. The great thing about hosting the Forth in C is that there's not much confusion about what the Forth does. You can look at the C source.

The main source files are:

To try it out, compile chad and put it in the forth folder. cd to the myapp directory and launch it with ../chad include myapp.fs. At the ok> prompt, 0 here dasm to disassemble everything.

For example:

ok>include forth.fs
370 instructions used
ok>25 fib .
121393 ok>stats
2792024 cycles, MaxSP=27, MaxRP=26, 155 MHz
ok>

The instruction rate is much less when doing I/O, so running an interpreter in the simulator (by loading myapp.f and entering "cold") shows the cycle counter incrementing at a much lower rate. When code is doing useful work, this isn't a problem. The thread stays in cache.

It's also a documentation standard.

chad provides a documentation system for Forth systems. It doesn't need the ANS Forth standard, it generates a standard from source.

Your project folder has a html folder that contains documentation. chad generates hyperlinked HTML versions of each source file so that you can click on any word to get an explanation of what it does and if necessary, a link to the source code of that word. That helps you navigate Forth source code even if you're new to Forth. The documentation is re-built each time you build your app.

The 20th Century was great and all, with its books and PDF equivalents. We have web browsers now.

Some interesting features of Chad

It's built for security. The ISA doesn't support random read of code memory, which makes reverse engineering and hacking the code an exercise in chip probing if it can even be done. The MCU boots from SPI flash, which is encrypted using a stream cipher. The weak spot then becomes key management: How secure are keys, how hard can you make it to probe memory busses on the ASIC die, etc.

In-system programming (ISP) is handled by hardware state machines, not firmware. The SPI flash controller integrates a UART and processor memories so that the RAMs can be loaded from flash at boot time. The UART can also be used to program flash by any host computer with a serial port. It can also reset the processor.

The sample MCU has a Wishbone Bus Master so that you can add peripherals from sites like OpenCores.

The interrupt system uses a style that's conducive to small stacks and Forth. It trades a little extra interrupt latency (which you can control) for simpler and less error-prone interrupt handling that's similar in concept to Forth's PAUSE.

Consequences of the architecture

That arise from:

Means that it deviates from the ANS Forth model when necessary. But it's close enough to make ANS Forth usable as a testbench. Some of the RAM is used as a frame stack, which is used to:

The "unlimited flash" means SPI flash is very cheap, so data is kept there whenever possible. That includes:

Applets remove restrictions on application size, at least where code is concerned. Large apps may reside in flash yet still be supported by a small (and fast) code RAM and a CPU with a limited (8K) address range. Human-speed Forth tools are good candidates for applets so as to free up code RAM.

Status

The "myapp" demo boots and runs in chad. An ISP utility loads the boot file into an FPGA with SPI flash chip attached.

chad boots and runs an app from simulated flash memory. A minimal SoC (MCU) in Verilog demonstrates synthesis results and performs the following:

I ran it on Digilent's Arty A7-35T board: 100 MHz, 10% of the chip. Here's what text rendering looks like on a TFT LCD module:

ArtyA7 Image

To-do

The ISP utility should have the terminal code merged in. Although it's written in C, it should be translated to Forth and 8th.

Catch and Throw should use the features of frame.f to set up catch frames. Maybe leave more stack space for the frame stack in data RAM.

A cooperative multitasker can likewise use frame.f words to move hardware stacks to and from task buffers. This makes a context switch more unwieldy, but still in the microsecond range.