MEGA65 / mega65-core

MEGA65 FPGA core
Other
237 stars 84 forks source link

Hardware math division: 5211/193 > 27 #786

Open dansanderson opened 4 months ago

dansanderson commented 4 months ago

Test Environment (required) You can use MEGA65INFO to retrieve this.

Describe the bug The hardware math registers are returning an inexact result for 5211/193 (= 27). This Issue is to track down whether this is expected, or whether there is a flaw in the hardware division.

This was noticed in the BASIC, which might be exacerbating the imprecision with its conversion from fixed point to floating point, or might be using the hardware math registers incorrectly in some other way. https://github.com/MEGA65/mega65-rom-public/issues/101

To Reproduce

!cpu m65
!to "tdiv.prg", cbm

* = $2001

!8 $12,$20,$0a,$00,$fe,$02,$20,$30,$3a,$9e,$20
!pet "$2014"
!8 $00,$00,$00

multina = $d770
multinb = $d774
mathbusy = $d70f
divquotient = $d76c
divremain = $d768

  ldq dividend
  stq multina
  ldq divisor
  stq multinb
loop:
  bit mathbusy
  bmi loop
  ldq divquotient
  stq $1804
  ldq divremain
  stq $1800
  brk           ; 

dividend:
!16 5211,0
divisor:
!16 193,0

This breaks to the monitor. Type M1800 to see the result. The result is DIVOUT whole (D76C-F) of 1B 00 00 00 = +27 and DIVOUT frac (D768-B) of 03 00 00 00 = +3 != 0.

Expected behavior DIVOUT frac of 0 is desired.

lydon42 commented 4 months ago

There is a python vunit testbench in tests/hardware_divider.vhdl which can be used to debug the math unit.

Rhialto commented 4 months ago

Ok, so I'm absolutely a beginner in reading VHDL. But it seems the divider in question is this one: https://github.com/MEGA65/mega65-core/blob/development/src/vhdl/fast_divide.vhdl Is there any reference where I can find the algorithm and compare it to my understanding of it, after staring at it for a bit?

But I think I get (mostly) how it is supposed to work. It is an iterative approximation of the desired result, with as invariant that the value nn / dd remains the same throughout (at least in the mathematical sense, if no bits are lost or overflow happens or that sort of thing).

This is clearly true in the "normalise" stage: both values get shifted left the same amount. Nonzero bits from nn will not fall out on the left because nn has more bits than dd.

In the "step" phase, they are both multiplied with the same value every time. (I didn't verify how this particular factor 2 - dd is supposed to bring us closer to the desired end state, but I'll assume for now that that is true). At the end, nn contains the quotient and the remainder.

The stopping condition of the step phase surprises me a bit. In particular, it seems that the desired end state here is that dd == x"FFFFFFFFF". However, I would have expected a value 1 more, i.e. x"1000000000". With this value, the division nn / dd is exactly equivalent to a shift-right of a number of bits, and the quotient is the result. The fraction would be the part that's shifted out.

Now the difference between dividing by either of those two numbers is likely to be fairly small, especially for larger values in nn. So probably in many cases the extra 4 less significant bits that are used in nn and dd plus a little bit of rounding could compensate for this. But probably not in all cases. It may explain the comment about adding one (although dividing by a number that is slightly too small would give a result that is already slightly too large).

The number of step iterations may be too low, too. I did some imperfect calculation for our dd=normalized(193) and I came to about 8 iterations until some kind of end condition arose (and it wasn't all FFs) (but I don't trust completely that python did what I wanted).

I searched around, and It seems this division is a form of Goldschmidt division ( https://en.wikipedia.org/wiki/Division_algorithm#Goldschmidt_division ) with fixed-point numbers (instead of floating point?). The text under the next header suggests that 5 iterations will give you 2^5 = 32 bits of precision (under the right circumstances). That sounds like enough, but I think we want at least 64 bits here (for 32 bits of quotient plus 32 bits of fraction), which is at iteration extra. 68 bits would require another iteration.