frank-dspeed / nasm

A cross-platform x86 assembler with an Intel-like syntax
https://nasm.us/
Other
2 stars 0 forks source link

Introductions #1

Open frank-dspeed opened 1 year ago

frank-dspeed commented 1 year ago

/*

The arrows are what we call busses. They are the transport lanes within the CPU. If you think about a factory, much of what happens inside is simply transportation of partially finished components from one functional unit to another. That could happen by trolleys, fork lifts or conveyer belts. You could think of each of these blue, green and red lines as conveyer belts shipping stuff around the CPU. Except they are moving numbers or control signals. where blue represents Data Bus, green Address Bus and red Control Lines. Funfact the ALU is what we call in CPU Design the Switch able gate if you can form a switch able gate you get binary and you created a cpu core.

DataBus(Memory,InstructionRegister,InputOutput,ProgramCounter, Registers, Decoder, ALU); // blue AddressBus(Memory,InstructionRegister,InputOutput,ProgramCounter); // green ControlLines(Memory, Decoder, InstructionRegister, Registers, ALU, InputOutput)

4-digit decimal number can hold much larger number values, than a 4-digit binary number (4-bit number). Decimal numbers have the digits 0–9. A digital computer in contrast use binary digits 0–1. That is because this is easy to represent in electrical terms. A 0 would be low voltage and 1 would mean high voltage. To not be overly pedantic most chips specify low and high as some sensible range. 0 to 1.5 volts for low. 3.5 to to 5 volts means high.

The curious can read more: Logic Levels. https://learn.sparkfun.com/tutorials/logic-levels/all

The largest number we can hold with four decimal digits is 9999 (10⁴ -1). For a binary number the largest four digit number is 1111 Remember the max digit value is 1. Translated into the decimal number system, this is 15 (2⁴-1).

Learn more about other number systems: Teaching Kids About Alternative Number Bases. https://erik-engheim.medium.com/teaching-kids-about-alternative-number-bases-36e3cc464250

Anyway you don’t need to care about the binary number system, since most operates on decimal numbers just like humans. It is able to do that, because it is an imaginary computer. Although computers using decimal number systems have existed but they where mechanical rather than electrical.

The RAM (Random Access Memory) of a computer also identifies every memory cell with a unique number, the address.

When you launch a program on your computer, it gets loaded into memory. It consists of a series of instructions. Running a program involves fetching each instruction in the program in sequence and performing that instruction.

A register is simply a collection of digits which could be used in an operation. For instance on the mechanical calculator below, one pulled on levers to specify an input number. This formed the Input Register. Once you cranked the handle on the right, that number would be added to the accumulator register at the bottom.

The Program Counter (PC) is a register containing a two digit number. It is set to address of the first instruction of the loaded program. This number is transported along the green bus, the address bus, to Memory. That gets the memory cell with this address selected. The number in the selected memory cell is transported along the blue line to the Instruction Register.

We can think of this as a guy picking up a note with the address of a mailbox in a shelf labeled Program Counter. Then he walks over to the mailboxes. Locate the mailbox with the two-digit number on his note. He copies the number he finds onto a piece of paper and brings that over to another shelf labeled Instruction Register.

Of course in reality there is no “guy” walking around. Instead there is a series of copper traces carrying electrical signals from one location of the CPU to another.

DataBus(Memory,InstructionRegister); // blue AddressBus(Memory,InputOutput,ProgramCounter); // green

After this whole sequence of events are carried out, the Program Counter will get incremented. Thus if our program started at address 3, we will increment the Program Counter to 4 and fetch the next instruction in memory.

Reading input from the user or other sources is done in a very similar fashion to reading data from memory cells. Input can be things such as a keyboard, mouse, network cable or hard drive. Usually inputs are deal with as if they where memory locations.

The instruction read from memory is just a 4-digit number. It must be interpreted to carry out a task. This is done by the Decoder. The instruction is sent to the decoder through the blue data bus.

Once the decoder as decoded an instruction it will activate and de-activate one or more of its outgoing read control lines.

These are the missing part of the puzzle, explaining how it is determined where numbers flow on the blue data bus.

The red control lines are electrical wires that toggle on and off values at the openings of each functional unit. Thus when the next step is to read the next program instruction, it is the Decoder which opens the valve to the instruction register.

So the core part is the Decoder which gets data from the Data Bus

Arithmetic Logic Unit (ALU) Of course we cannot just move numbers around. We got to actually do something with them. The Arithmetic Logic Unit (ALU) is the calculator of the microprocessor (CPU).

But it is a pretty dumb calculator. It is more similar to the mechanical calculators of old than a modern calculator. It only really knows how to perform additions, subtractions and perform shift operations.

What is a shift operation? Shifting is to add or remove digits from a number. A left shift of 40 gives 400. Two right shifts of 400 gives 4. Basically it is just multiplying or dividing by 10. Or rather it depends on the number base you use. For a binary computer shifts are multiplying or dividing by two.

These operations are relatively easy to build electronically with transistors as well as mechanically with gears. No mechanical calculator was ever able to perform multiplication or division directly. Rather multiplication was performed by repeated additions. Division by repeated subtractions.

Early microprocessors worked the same way. Calcutron-33 works this way. Advanced modern microprocessors have special functional units called multipliers which perform multiplication. A lot of early electronic calculators likely performed multiplication too by running small programs that performed repeated additions.

The decoder uses its control lines to activate two of these registers which become the inputs to the ALU. The decoder also use its control lines to select what function the ALU should perform, whether it should be an addition, subtraction or shift.

It also opens one register for input, so that the result from the calculation can flow back to the Register File. It is common to call the collection of registers from x1 to x9 for Register File. Don't mistake this for file in a filesystem. The two terms are unrelated.

Okay now we know how the various functional units interact. It is time to look at how we write instructions to use these functional units in meaningful ways.

Machine Code and Assembly Code If the decoder decodes the instruction 8243 it will cause it to load the number stored in memory address 43 into register x2.

The instruction 1123 in contrast will cause the contents of register x2 and x3 to be added and the result stored in register x1.

These numbers are on machine code form. They are just numbers which is fine for the computer, but not very practical for us humans. Thus it is common use human readable abbreviations called mnemonics. Assembly code is written using mnemonics. In assembly code the previous instructions become:

LD x2, 43 ADD x1, x2, x3 Machine code instructions are usually encoded in particular ways. Especially for RISC CPUs one tries to have a very regular pattern in how this is done.

Instruction Set Architecture (ISA) vs Micro Architecture Instructions such as LD and ADD as well as the available registers, how they are encoded in machine code format, number of digits in a register and so on are all part of what we call the Instruction Set Architecture (ISA) of a microprocessor. So if a CPU had all the same instructions as Calcutron-33, but say 4 registers or 40 registers instead of the nine Calcutron-33 has then it wouldn’t be be the same ISA. Likewise if the number of registers and the instructions where the same but memory cells contains 10-digit numbers instead of 4, then it would also be different. An ISA is basically a contract between the programmer and the microprocessor makers. It specifies the exact behavior of a microprocessor seen from the perspective of a software developer.

However internally a microprocessor can be quite different. What the busses are like internally, how many decoders, arithmetic logic units it has and so on is part of what we call the Micro Architecture. This is invisible to the programmer, but can affect the performance of the microprocessor.

Let me pick a couple of examples. Both AMD and Intel make microprocessors with the same ISA called x86–64. However their internal micro architecture is entirely different. But this doesn’t matter to programmers. A program written for an Intel processor will run just as well on an AMD processor. There may simply be some performance differences.

Calcutron-33 Machine Code Format Every instruction is 4 decimal digits. The first digit is the opcode, which says what the instruction does such as add, subtract or load. The second digit is a register operand. Usually the destination register for whatever operation is performed. The last two digits will vary in meaning depending on opcode. For arithmetic operations, they will usually be two registers, used as input. The last one may be an immediate value from 0 to 9. For branches, store and load instructions , the last two digits will be a memory address. Assembly Instruction Set Once I have explained the assembly instructions, we can actually start to write some small programs that do useful things.

I am going to give a list of assembly mnemonics, their operands (the inputs they need) and how assembly instructions get encoded into machine code.

If the first digit is a 1, then we got an add instruction. If the first digit is a 2, we got a subtraction instruction and so on. To get a more compact overview of how these instructions work, I am going to use some naming conventions which are very common in microprocessor manuals.

rd is not the name of an actual register on the CPU, but simply refers to some register from r1 to r2 which is the destination register for the operation performed. For an addition that means, the result is stored in this register. rs and rt means source registers. These are the inputs to an operation. Don't worry this will be clear with a simple example.

ADD — Addition ADD rd, rs, rt | 1dst | rd ← rs + rt What this described is that the result of the operation is stored in the first specified register. The rs register is the 3rd digit and the rt register is the 4th digit.

1dst This is a description of how the assembly code instruction is mapped into machine code. We see that the first digit must be a 1 for this to be an ADD instruction. The letters d, s and t are place holders to show which digit is the rd, rs and rt register.

rd ← rs + rt This is a description of how the operation works. The arrow ← points to where the result gets stored. Let us do a simple example to make sure you understood the notation:

ADD x2, x4, x8 ; x2 ← x4 + x8 The colon indicates a comment, and everything after it can be ignored. In this case we destination register rd is x2 and the two source register rs and rt are x4 and x8. We add the numbers in x4 and x8 and store the result in x2.

SUB — Subtraction Subtracts register rt from register rs and store result in register rd.

SUB rd, rs, rt | 2dst | rd ← rs - rt SUBI — Immediate Subtraction Subtracts value k from register rs and store result in register rd.

SUBI rd, rs, k | 2dsk | rd ← rs - k LSH — Left Shift Shift register rs left k digits and store in rd. Left shift means to add a zero digit. So 3 left shifted two digits is 300.

LSH rd, rs, k | 4dsk | Let us to an example where we assume x2 contains the number 12.

LSH x1, x2, 2 Register x1 will now contain the number 1200.

RSH — Right Shift Shift register rs right k digits and store in rd. To shift right means to chop of digits at the end. 400 shifted right two digits is 4. If shifted only one digit, it is 40.

RSH rd, rs, k | 5dsk | BRZ — Branch Zero jump to address aa if register rd is zero.

BRZ rd, aa | 6daa | BGT — Branch Greater Than jump to address aa if register rd > 0 (positive).

BGT rd, aa | 7daa | LD — Load Register load register rd with contents of memory at address aa.

LD rd, aa | 8daa | Let us do an example. of loading the contents of memory location 12 into register x2.

LD x2, 12 Basically our CPU goes to “mailbox” labeled 12, finds the number inside and puts it into register x2.

ST — Store Register store register rs in memory at address aa.

ST rs, aa | 9saa | HLT — Halt Program Stop program

HLT | 0000 Pseudo Assembly Instructions For almost all microprocessors you will find what we call pseudo assembly instructions. These are instructions which don’t really exist in the instruction-set architecture (ISA) of a CPU. Rather they are convenient shorthands for other instructions.

For instance if we wanted to move the contents of register x2 to x1 we could write:

ADD x1, x2, x0 This may seem surprising as I have not mentioned register x0 before. The reason is because it doesn’t really exist as an actual register. Register can be used as an operand anywhere where a register is expected, however x0 will always be zero, and any number moved to it will be discarded.

This is a popular trick on several RISC processor architectures such as MIPS, ARM A64 and RISC-V. Because this is so useful the Calcutron-33 assembler has a shorthand for this:

MOV x1, x2 However it is not a real instruction since the machine code is the same as for ADD.

Let us look at the other pseudo instructions:

MOV rd, rs │ ADD rd, rs, x0 │ rd ← rs CLR rd │ ADD rd, x0, x0 │ rd ← 0 DEC rd │ SUBI rd, rd, 1 │ rd ← rd + 1 BRA aa │ BRZ x0, aa │ Writing Programs and Running Programs Programs are read by the Calcutron-33 from start to finish but certain parts of the code can get repeated many times because one performs jumps from one line in the program back to a previous line.

However programs end when the reach the HLT instruction or we attempt to read from input and there is no more data to read. We can mark places in the code we can jump to with labels. Labels are written using the format: mylabel:. Let us look at a very simple program to demonstrate this:

loop: LD x1, 90 LD x2, 90 ADD x1, x2 ST x1, 91 BRA loop This program continuously reads two numbers from input and stores them in register x1 and x2. These are added and result is written to output. We could e.g. assume that on address 90 we have connected a puched taped reader. You can see an example of a modern version of this below. In our imaginary computer we assume that every time a number is read with LD, the tape advances to the next number. Hence we never read the same number twice despite reading from different memory addresses.

An optical punched taped reader. Number are represented as patterns of holes punched through a paper tape. This mimics how early computers read input data. We use the different addresses to distinguish between different devices. E.g. the tape reader is at address 90, while and LED display for numbers is on address 91.

Example of a 4-digit LED display. Imagine one of these connected to address 91. The line at the bottom of the program, which says BRA loop means we jump to the label marked loop which means the whole program starts from the beginning.

NOTE: There will be more information about how to actually run Calcutron-33 programs here in the future. This story is still a bit work in progress. My apologies in advance.

Double Output Here is another simple example which repeatedly doubles its inputs.

loop: LD x1, 90 ADD x1, x1 ST x1, 91 BRA loop Multiply By Eight Multiply each input by 8. This is the first step to understand how addition can be used for multiplication. You can see that multiplication is really about just controlling how many times you are adding.

loop: LD x1, 90 ADD x1, x1 ADD x1, x1 ADD x1, x1 ST x1, 91 BRA loop Maximizer Here is a bit more complex program where we are using conditional branching more actively. What this program does it that it read two inputs into x1 and x2 and output the larger of the two values.

loop: LD x1, 90 LD x2, 90 SUB x1, x1, x2 BGT x1, first

second: ST x2, 91 BRA loop

first: OUT x1 BRA loop In this case we perform uses SUB so that x1 ← x1 - x2. The following BGT x1, first instruction, is basically doing the equivalent of:

if x1 > 0 goto first If x1 > x2 then x1 - x2 > 0, so this is logical. So depending on whether the first or last number if bigger we goto label first or second and output result. Afterwards we jump back to the beginning with BRA loop.

Simple Multiplier Early microprocessors did not have any instructions for multiplying, just like Calcutron-33. But that was not a problem because you can implement multiplication through repeated additions.

The code below shows a simple way of doing this. The numbers to multiply are placed in x1 and x2. Say we multiple 4 with 3. That means adding 4 three times. The way we do this is by subtracting 1 from x2 on each repetition. This is to keep track of how many times we need to add x1 to the final result stored in x3.

We repeatedly jump back to the label multiply to keep adding. This is done by using BGT (Branch Greater Than) to check of x2 is greater than zero. As long as it is we keep jumping back to multiply. Since x2 is reduced by one on each repetition we it will eventually be 0 at which point we use ST to write out the result.

loop: LD x1, 90 LD x2, 90 CLR x3

multiply: ADD x3, x1 DEC x2 BGT x2, multiply ST x3, 91

BRA loop

However this approach is not very efficient. Imagine multiplying 90 with 90. You would have to perform 90 additions.

Fast Multiplier Fortunately there is a much faster approach using a combination of additions and shifts. Shift for a decimal computer is multiplication or divisions by 10. In a binary computer it is multiplications or divisions by 2. Shifts are always really fast to perform on a computer.

The core idea is that you can represent any multiplication as a combination of shifts and additions. Let me do a few simple examples.

30 × 40 = 300 + 300 + 300 + 300 = 1200 Here we shifted 30 left once and added it 4 times.

30 × 23 = 30 + 30 + 30 + 300 + 300 = 690 First we added up the ones 3 times. Then we added up the tens twice. To be able to add tens, we do a single left shift.

If you look at the code below you will see both left and right shifts. The right shifts are used to peel off individual digits. E.g. when you do 30 × 40, you need to separate out the 4 digit to keep track of how many times we add 30. If you right shift 40, you get the 4 digit isolated.

LD x2, 90  ; First number to multiply. 
LD x3, 90  ; Second number. Treated as counter
CLR x1     ; accumulator for result. Clear it out.

nextdigit: RSH x4, x3, 1 ; Push right most digit of x3 into x4

multiply: ADD x1, x1, x2 ; Add first input to accumulator
DEC x4 ; Decrement counter for number of additions BGT x4, multiply ; Repeat while x4 > 0

LSH x2, x2, 1     ; Left shift. x2 made 10x larger
BGT x3, nextdigit ; check if all digits have been processed

ST x1, 91

What Have You Learned? Let us do a kind of summary because this whole story is quite long and I don’t expect you to remember all of this. It is useful however to grasp what is of most fundamental importance here.

The key things I wanted you do get is that:

A CPU does lots of small really simple instructions. It is only stuff like moving numbers from one location to another, adding them and comparing them. Then there are control flow, such as jumping to other parts of the program to repeat things. The code that the CPU understands is just a long stream of numbers. Assembly code is a just a human readable form of this. How are Real Microprocessor Different? While other CPU architectures may seem significantly more complicated a lot of this derives from providing faster ways of what you can already achieve with this super simple CPU. You can already perform multiplication by just writing program for it.

However real modern CPUs have special instructions for performing multiplication. Our CPU doesn’t have a way of multiply floating point numbers such as 4.25 or 58.32. Again real CPUs have special functional units for this. So instead of just the ALU, there are several other functional units, such a a multiplier or FPU for floating point arithmetic.

And of course in real processors instructions are not 4 digit decimal numbers but typically 32-bit binary numbers. That means multiple bits will have to be used to encode things such as the source and destination register of an operation.

Real CPUs also have more convoluted ways of reading things from memory. Memory exists in multiple stages. You got very fast memory called L1 cache. Then slightly slower memory called L2 cache, then even slower memory called L3 cache and finally you get to main memory which is the slowest.

When the CPU gets something from memory it puts it in the cache, so that next time it is needed it can look in the cache first to see if it is already there. But because cache is small, stuff that hasn’t been used in a long time may have gotten booted out.

Related Stories Stories I have written about hardware which may be of interest.

Why Pipeline a Microprocessor? — RISC processors pioneered the use of instruction pipelines. What are they and why are they useful to increase performance? What Does RISC and CISC Mean in 2020? — Historical comparison of RISC and CISC design and a look at their convergence and the differences that still exist. RISC-V Vector Instructions vs ARM and x86 SIMD — Focused on comparing packed-SIMD and vector-SIMD instructions and why they exist. ARMv9: What is the Big Deal? — Goes into more details about the SVE2 vector-SIMD instruction set added to the ARMv9 architecture. The Genius of RISC-V Microprocessors — Nothing about SIMD but about instruction compression and macro-operation fusion in RISC-V cores.

*/

frank-dspeed commented 1 year ago

Explain how the RISC processor Architects took learnings like instruction compression and macro operations from existing coding languages and applyed that to the Chip

frank-dspeed commented 1 year ago

emulating a amd64 https://notes.eatonphil.com/emulating-amd64-starting-with-elf.html