Mellvik / TLVC

Tiny Linux for Vintage Computers
Other
9 stars 0 forks source link

Thoughts on changing ktcp timer resolution #71

Open ghaerr opened 1 month ago

ghaerr commented 1 month ago

In previous comments, I made the statement:

I would treat making a single system call per ktcp cycle as very fast. The system call overhead is almost identical to the hw interrupt processing overhead ... and of course the 100 timer calls/second are thought as minuscule overhead.

After looking at the code generated for the kernel sys_gettimeofday routine, I would say "Oops - that's only part of the story, its actually not fast". In order to prepare the returned system time into a struct timeval, a 32-bit divide (by calling __udivsi3) plus an additional 32-bit modulo (by calling __umodsi3) are made, along with a couple other 32-bit adds. This is all slow as heck. (Unbundled sources for each are here and here and they loop).

Then, after that, ktcp's time wrapper function timer_get_time, which returns a 32-bit time in 1/16 second resolution, goes through yet another slow call to __udivsi3 to convert microseconds into 1/16 seconds by dividing by 16:

timeq_t timer_get_time(void)
{
    struct timezone tz;
    struct timeval  tv;

    gettimeofday(&tv, &tz);

    /* return 1/16 second ticks, 1,000,000/16 = 62500*/
    return (tv.tv_sec << 4) | ((unsigned long)tv.tv_usec / 62500U);
}

Besides being quite slow, 1/16 second = 62.5 msecs, 6 times longer than the 10ms clock resolution and the 5-10 msecs for an ACK response, discussed here.

(I also notice in the above code as I'm writing this that the second ("&tz") param to gettimeofday is not used and wastes more time returning it - should be NULL).

In another post I stated:

Whether you choose 1, 10 or 100ms isn't probably a big deal if we stick with the jiffies-related (100ms) hw timer as base real time, but the lower the better

After thinking more on this and looking at the timer_get_time routine - that answer also gives only half the story. The problem is that for efficiency reasons, any "ktcp" time needs to fit in 32 bits, not a timeval struct, in order to do a quick subtraction and not end up calling gcc long div/mod routines each time its used.

Thus the tradeoff: the more resolution given to micro or milliseconds, the less bits given to the "seconds" portion of a timeq_t.

In the current timeq_t, 28 bits are seconds and 4 bits are 1/16 seconds. This allows for 2^28 = 256M/86,400 = ~3106 days of timer. (A timeq_t may have to be in effect for the total duration of a TCP connection, I believe). To move to microsecond resolution would require 20 bits (1M) for microseconds, leaving only 12 bits (=4096) for seconds, a little over an hour. So that won't work, and the kernel hw timer has max 10ms (=HZ = 1/100) resolution anyways.

Considering that we stay with an approx ~10ms timer, which would at least measure ACK responses favorably on a 286/12.5Mhz or V20/8Mhz, how would the math work, given that the gettimeofday/timer_get_time is already quite slow?

If we considered 1/256 second resolution (=~3.9ms) or 1/128 resolution (=~7.8ms), this would help a lot and the timer_get_time routine could be made faster (see below). If 1/256sec is used, that leaves 24 bits for seconds (=16M/86400 = ~194 days). 1/128 would be 25 bits with 388 days. But we should be time sampling with less than half the timer resolution, so I would say 1/256 second resolution would be best, and TLVC/ELKS is likely just fine with an open TCP connection for 194 days before worrying about RTT miscalculations.

One might ask, why not just use 1/100 second (=10ms) resolution directly? Well, the original timer_get_time routine is somewhat magic in that it uses the fact that 1M/16 = 62,500, so a microsecond divide can exactly fit into 4 bits (=1/16 sec). Dividing by 100 will still be quite slow but won't fit neatly into 7 bits. And if we're going to use 10ms resolution, why not make a special new system call and just return jiffies directly, which would be quite a bit faster, and only use a compare and subtract or two to compute elapsed time.

Another thought would be to just use a right-shift-12 (=1M/256 = 2^20/2^8) for roughly converting struct timeval microseconds to 1/256 seconds, this would leave the last line of the routine as:

    return (tv.tv_sec << 8) | (((unsigned long)tv.tv_usec >> 12) & 0xFF);

I haven't yet figured out how to measure how long the __udivsi3 or __umodsi3 routines take, so I'm not fully sure just how important all this is, given the kernel's 10ms max resolution.

What do you think?

Mellvik commented 1 month ago

Very interesting @ghaerr, and very timely indeed. Great discussion - and I think we're converging on a solution.

Unsurprisingly I've been struggling with this over the past few days as I've been experimenting with simple ways of implementing (and measuring) slow start & congestion avoidance. There are plenty surprises in this mess, some actually pleasant in the sense that smart things have been done, smart choices made - such as the original 1/16s now obviously being there for a good reason. Also, there is a lot to learn about how ktcp actually works, very valuable as we head into weeding out the final (?) problems and ridding it of the double-ACKs, which has suddenly become an issue - if not very practical, at least a load thing.

Re. timer: I started with microseconds, found it messy in the sense that lots of zeroes didn't contribute to anything, 10ms being the actual resolution anyway. Also, I got the obvious wraparound problem with 12 bit seconds messing me up. Then I moved to 1ms which improved things, but didn't eliminate the wraparound issue for seconds. And of course the convergence algorithm for RTT needed adjustment as the time resolution changed. As your message arrived I was back to 1/16s, postponing the timer resolution issue until the algorithms are under control.

That said, having increased resolution was helpful in providing a clearer picture of actual round trip times and delays, a desirable change.

Your research and analysis was very educating - and interestingly lands very close to where I was sitting before we started this discussion: Not being even close to the task of creating a new system call, I played with the idea of creating an ioctl in tcpdev to get the current jiffies value. The thinking was that the gettimeofday call was probably expensive and I was (incorrectly) thinking getting an updated time value more often that once every ktcp loop was useful.

Now we're here, and I strongly support the idea of getting jiffies via a system call. Simplicity at its best. And safe. I may be missing something, but from my perspective - here's the problem with using gettimeofday: There is always going to be a wraparound issue with seconds when we truncate - unless we subtract a constant from the seconds value before creating the combined 32bit time value. Even with 28 bits of seconds ORed in, there is a chance that those 28 bits will go from a very big to a very small number between 2 calls to gettimeofday. jiffies are guaranteed to not have that problem.

IOW - jiffies have my vote, and I'll keep the 1/16 setup until we have a solution. This is good!

ghaerr commented 1 month ago

Interesting @Mellvik, I figured you might be in the middle of a timer rewrite, glad to hear your comments!

here's the problem with using gettimeofday: There is always going to be a wraparound issue with seconds when we truncate

jiffies are guaranteed to not have that problem.

Well, yes when truncating the 52 bit gettimeofday result to 32 bits, but we have the same wrapping problem with jiffies but in a different way: the 32-bit jiffies stores sub second hw timer ticks (=10ms = 100/second) effectively in 6.5 (lets call in 7) bits (=128). So the upper 25 bits are pretty much seconds and the lower 7 bits are the count of 10ms ticks. This means that jiffies will wrap every 2^25 = 32M/86,400 = 388 days. If you're worried about 1/256 second wrapping every 194 days, I'd be as worried about 388 days.

Nonetheless, getting jiffies directly for timing purposes saves lots of time, and is definitely more accurate, considering that __udivsi3/__umodsi3 divide by looping thus adding variable microseconds to each measurement. We would need to calculate how many instructions per second a slow 8086 executes code at and then count instructions in the udivs3 routine(s) in order to get a more accurate idea of how slow it all is. I would guess on the order of 50-200 microseconds; not much when our timer resolution is 10,000 microseconds. Still - it all adds up.

I played with the idea of creating an ioctl in tcpdev to get the current jiffies value.

Actually, I think this is a better idea than writing a new system call. Since ktcp already does some special ioctls, another ioctl would be very simple. And then code the ioctl to return the address of the kernel jiffies variable itself (a pointer), rather than the value. Then a far pointer could be used to read jiffies directly - how's that for speed?

The ioctl only needs to return the kernel address (DS offset) of jiffies, since MEM_GETDS is already implemented to get the kernel DS segment value.

Note that the kernel itself doesn't concern itself with jiffles wrapping - most kernel wait code adds a value to jiffies without checking for wraparound. However, if one really wanted to be correct, something like the following could be used:

    unsigned int kds, jaddr;
    unsigned long __far *jp;
    unsigned long lasttick = 0;
    int fd = open("/dev/kmem", O_RDONLY);
    ioctl(fd, MEM_GETDS, &kds);
    ioctl(fd, MEM_GETJIFFADDR, &jaddr);
    jp = _MK_FP(kds, jadå);

    /* get current jiffies */
    unsigned long now = *jp;
    /* get jiffies difference */
    long diff = now - lasttick;
    /* jiffies wrap check */
    if (diff < 0)
        diff = 0xFFFFFFFFUL - lasttick + now;

    /* do something with diff */
   ...
   lasttick = now;

Another thought would be to just use a right-shift-12 (=1M/256 = 2^20/2^8) for roughly converting struct timeval microseconds to 1/256 seconds

Thinking further, my above suggestion of dividing microseconds by 256 (quickly using a right shift instead of slow divide) to get 1/256 second won't work properly - microseconds are 0-999,999 (1M) and not evenly divisible by 256, like 1M is by 16. Thus the result tick count would be incorrect for the computed range 1,046,576 (=2^20) above 1M (=46,576). Since 1M/256 = 3,906 then 46576/3906 = 11.9 there would be about 12 incorrect ticks every 256 ticks. No good.

However, 1M is exactly divisible by 64 (=15,625) like it is with 16, so a timer change to 1/64 second would work without error, even though it'd still be slow. A 1/64 second timer is ~15.6ms and wrap would occur at 2^32/2^6 = 26 bits for seconds, which is 64M/86400 = 776 days.

A 15ms timer resolution using 1/64 second isn't nearly as nice as 10ms jiffies, and we still can explore the potential option of reprogramming the 8254 PIT to either timeout quicker or start another timer and read it directly. The PIT reprogramming could be turned on only when ktcp is running in debug mode for more accurate timing, on the PC platform.

Mellvik commented 1 month ago

Thank you @ghaerr - very educational indeed, a much appreciated run-through.

To me, simplicity is key and ktcp's requirements are really 'unsophisticated'. Actually, now that the 1/16s metric is back in my code, I suddenly find it more workable - probably because I now understand more about the inner workings of the entire 'package'. That said, I'm still eager to increase resolution for the same reasons as before: Better understanding, easier reading, more leeway in tuning and more.

In fact there are likely many places in the system that would benefit more from the availability of a 'reasonable' resolution and performancewise cheap timer value. IP and TCP were created to be lax about time and some early RFCs refer to 0.5 seconds as a reasonable resolution. I had to smile when reading that.

Back to the case, your run-through and examples make me even more convinced that jiffies is the thing. I was thinking using tcpdev for an ioctl to get it since tcpdev is already open. That benefit obviously becomes irrelevant when the call is made only once - to return a pointer. So /dev/kmem definitely makes more sense. As to the wrapping, my concern has not been the timespan we can store in a long but what happens when there is a wrap between two subsequent calls to the timer (which I ran into when using microseconds and had only 12 bits for seconds). When the chance of that happening is minimal and the consequences non-catastrophic, my concern seems exaggerated. Also, jiffies being good enough for the kernel, they are good enough for ktcp - without the wraparound check. The simplicity of just accessing jiffies from a far memory location is unbeatable - and 10ms resolution is fine.

Mellvik commented 1 month ago

... we still can explore the potential option of reprogramming the 8254 PIT to either timeout quicker or start another timer and read it directly. The PIT reprogramming could be turned on only when ktcp is running in debug mode for more accurate timing, on the PC platform.

This is a very interesting idea, @ghaerr - let's keep this one warm, even while changing ktcp to use jiffies.

Mellvik commented 1 month ago

@ghaerr, Let me elaborate on my last comment about making available another timer via the PIT, preferably one with microsecond resolution: The kernel - drivers in particular - contain many delay loops. However undesirable these do-nothing-but-wait loops are required by physical interfaces, a requirement proportional with age. In most cases the delays occur in reset/initialization code outside what we can call 'mormal operation', and as such they may seem insignificant. However, and again proportional with age, things go astray when dealing with physical stuff whether networks or media, and recovery is required - resets, initialization, etc., meaning the delay loops are used much more often than what may be obvious. Curious fact: The EL3 NIC actually has a timer for this purpose - unused by our driver.

The delay loops in use are typically unscaled, the delay they deliver vary wildly between processors and systems, in most cases slowing down already slow systems while having negligible effect on faster iron. Which is why I introduced the idea of reviving BOGOMIPS a while back - the ability to scale delay loops to the iron.

So there are actually 2 issues here: One is to replace all occurrences of delay loops scattered all over the place with a library routine, and two: enabling exact delays instead of 'add-an-order-of-magnitude-to-be-safe-on-slow-iron'.

On the original issue, I'm adding the ioctl to the kmem driver per your example today, and changeing ktcp accordingly, at first converting 10ms to 60ms. Then when everything else is working, move the scaling into the code.

ghaerr commented 1 month ago

Agreed that delay loops should not be "unscaled", but rather associated with a realtime timer so that delays are exact, at least to the resolution of the timer. I changed all the ELKS kernel delay loops to use a jiffie-based timer loop a while back, and IIRC the only non-scaled loops remaining are in the NIC drivers and the reset_floppy DF driver routine. The NIC drivers weren't changed since I'm unable to test them. The DF routine can be easily changed once the actual real time required for reset is known.

Which is why I introduced the idea of reviving BOGOMIPS a while back - the ability to scale delay loops to the iron.

IMO, better to use reference to a realtime timer or volatile jiffies value than a scaled "bogomips-derived" timer delay count, as kernel interrupts can occur during a delay count down and use up more real time than is accounted for in bogomips loops.

Mellvik commented 1 month ago

The reason your PIT reference caught my interest was the need for a high res timer for some of these delay loops. Having access to a realtime timer with, say, .1ms resolution would be really valuable, including for performance measurement.

ghaerr commented 1 month ago

I've researched the 8253/8254 PIT and have found a way to get fast precision timing, although only within a certain elapsed time window, due to 32-bit overflow issues and avoiding software divide routines.

Long story short, the kernel PIT has an input clock frequency of 1,193,181.8, which divided by 100 to set 1/100 second timer interrupts (=HZ) uses an initial timer countdown value of 11932. With the PIT you can read the current countdown value directly from the chip, and that means that the time between each counter decrement is 1/11932 sec = 0.8381 usecs = 838.1 nanoseconds. (see elks/arch/i86/kernel/timer-8254.c for more detail).

The original IBM PC master oscillator of 14.31818 MHz was divided by 3 for the CPU and by 4 for the CGA, and then using an AND gate gives the divide-by-12 PIT frequency above for the PIT reference frequency. For 4.77 Mhz 8088s which have a 4-cycle minimum instruction execution, this somewhat conveniently gives a (4-cycle) base 0.8381 usec/instruction timing as well. (However, I've been reading that the average instruction takes 13.67 cycles).

In order to not overflow 32-bit longs and use fast hardware integer multiplies only, the PIT countdown can be read and multiplied by 84 (100 times 0.8381 rounded) to get "nticks", my new term denoting 10ns intervals (84 nticks would equal 838 nsecs or .838 usecs = 4 8088 CPU cycles). The kernel itself gets 10ms interrupts which we'll call "jticks", which are 1,000,000 nticks. This allows jticks and nticks to be added together to produce a total from which an ntick difference can easily be calculated.

So, the precision timer resolution is ~838 nanoseconds, or .838 microseconds (or 84 nticks), and is 11,932 times smaller than a jiffy. (This would also seem to be the max number of opcodes that could possibly be executed between hardware clock interrupts on a 4.77Mhz 8088).

The entire calculation can be performed in nticks as the following pseudocode (not tested):

#define MAX_TICK        11932       /* PIT reload value for 10ms (100 HZ) */
#define NANOSECS_TICK   838         /* nsecs per tick */

unsigned int count, lastcount = MAX_TICK;
unsigned int tickdiff;              /* PIT reload value for 10ms (100 HZ) */
unsigned long jiffies;              /* kernel 10ms timer */
unsigned long lastjiffies;
unsigned long nticks;               /* real time elapsed in nticks (10ns) */

    /* latch timer value for reading */
    outb(0, TIMER_CMDS_PORT);
    /* read timer countdown */
    count = inb(TIMER_DATA_PORT);
    count |= inb(TIMER_DATA_PORT) << 8;
    tickdiff = lastcount - count;
    /* handle wrap */
    if ((int)tickdiff < 0)
        tickdiff = MAX_TICK - count + lastcount;

    nticks = (jiffies - lastjiffies) * 1000000L + (tickdiff * 84L);
    lastcount = count;
    lastjiffies = jiffies;

The code produced is quite small and pretty fast, but uses 4 integer multiply instructions.

Using 32-bit longs, there is the potential for overflow in only 4294 (=2^32/1M) jiffies, which is ~42.94 seconds. This should work well for precision timing, but I'm trying to think of way to turn on and off high precision without changing ktcp timing constants. A bit more thought on that needed.

I plan on playing around more with this and coming up with a working example.

Mellvik commented 1 month ago

That's a great piece of research, @ghaerr - this is obviously right down your alley. 'nticks' it is. I'm really looking forward to the next instalment!

This will open up a previously unavailable tool chest for - like you say - precision timing! Combining it with jiffies is really smart. While ktcp is fine with jiffies in an operational sense, this will enable measurements and tuning along a completely new scale.

I'm curious - how do you get this (2 long multiplications) to become only 4 int mult's?

Also (BTW, thanks for the reference on instruction timings, an educating refresh on a lot more than timings :-) ), how do you get to the 4 clock cycle minimum instruction execution cycle? Is that the fetch+execution (mov reg,reg is 2 cycles IIRC)?

ghaerr commented 1 month ago

how do you get this (2 long multiplications) to become only 4 int mult's?

GCC does a nice job with it. In the case of multiplying a short by a short constant to produce a long result, only a single multiply is needed (as 8086 takes AX times 16-bit reg/mem and produces DX:AX. For short tickdiff ticknsecs = tickdiff * 838L:

   0:   56                      push   %si
   1:   b8 46 03                mov    $0x346,%ax
   4:   f7 26 00 00             mulw   tickdiff
   8:   a3 00 00                mov    %ax,ticknsecs
   b:   89 16 02 00             mov    %dx,ticknsecs+2
   f:   5e                      pop    %si
  10:   c3                      ret

In the case of a short times a long constant, ticknsecs = tickdiff * 0x54321L:

   0:   56                      push   %si
   1:   8b 1e 00 00             mov    tickdiff,%bx
   5:   b8 05 00                mov    $0x5,%ax
   8:   f7 e3                   mul    %bx
   a:   91                      xchg   %ax,%cx
   b:   be 21 43                mov    $0x4321,%si
   e:   93                      xchg   %ax,%bx
   f:   f7 e6                   mul    %si
  11:   01 d1                   add    %dx,%cx
  13:   a3 00 00                mov    %ax,ticknsecs
  16:   89 0e 02 00             mov    %cx,ticknsecs+2
  1a:   5e                      pop    %si
  1b:   c3                      ret

For the above two cases, I kept tickdiff to be unsigned short as it is effectively the 8254 PIT elapsed count difference and saves some instructions. In the sample precision timer code above, we still save one multiply because tickdiff remains unsigned short even though the multiply by 84L isn't stored into a variable.

For the case of a long times a non-constant long, its a more complicated ticknsecs = jiffies * nsecs:

00000000 <x>:
   0:   56                      push   %si
   1:   a1 02 00                mov    jiffies+2,%ax
   4:   f7 26 00 00             mulw   nsecs
   8:   93                      xchg   %ax,%bx
   9:   a1 02 00                mov    nsecs+2,%ax
   c:   f7 26 00 00             mulw   jiffies
  10:   91                      xchg   %ax,%cx
  11:   01 d9                   add    %bx,%cx
  13:   a1 00 00                mov    jiffies,%ax
  16:   f7 26 00 00             mulw   nsecs
  1a:   01 d1                   add    %dx,%cx
  1c:   a3 00 00                mov    %ax,ticknsecs
  1f:   89 0e 02 00             mov    %cx,ticknsecs+2
  23:   5e                      pop    %si
  24:   c3                      ret

With the OpenWatcom compiler, the same thing is done but a helper routine __I4M is called.

ghaerr commented 1 month ago

how do you get to the 4 clock cycle minimum instruction execution cycle?

I read it somewhere. I also read that timing starts at 5 cycles for Base or Index effective address calculation and goes up to 12 cycles. The bare minimum is 4 when using register-to-register but this can sometimes change because there is some variety to instruction prefetching. (Of course all this varies wildly per processor variation). The interesting thing in the referenced link is that 4 more cycles are added for accessing words at odd addresses... much more a performance hit than I had thought. I plan on looking at various produced .o files to ensure that mixed char and int globals aren't stuffed together into possible odd addresses for ints. I know that "common" (.comm) variables aren't allocated that way (BSS only) but not sure about DATA section variables, based on their C declaration.

For our estimates of speed, I found it surprising that the average is 13.67 cycles/instruction. However, I'm inclined to believe it and am building the following "simple outlook" on imagining how time is spent in an 8086/8088 ELKS system:

With the kernel PIT running a 11932 countdown to produce 10ms interrupts, each countdown (stated previously) amounts to 0.838 microseconds, which is 4 cycles.

Thus, rounding to 12,000 countdowns (4 cycles each or one basic instruction) per 10ms clock interrupt would mean 12,000 instructions before a task switch. But with the real instruction cycle average being 12 (rounded down from 13.67), this amounts to 4,000 average instructions executed each time slice. If the average instruction is 12 cycles (for simplicity), this would mean each instruction is 3 4 cycles = 3 0.838 usecs = ~2.5 usecs/instruction.

Finally, at average ~2.5usecs/instruction, 4000 instructions is 10,000 usecs which equals, you got it - 10ms - one jiffie.

So there you have it. Precision timing samples starting at a very fast 4-cycle instruction, with three samples per average instruction. 4000 average instructions later we have a jiffie increment and possible task switch. This keener precision approach to imagining what is going on at the process time slice and kernel driver level should allow us to better understand what the real costs of interrupt service routines are and also quantify in which method a kernel busy-loop might be coded to allow other code to run or not (same for interrupt disabling).

Mellvik commented 1 month ago

Again @ghaerr, this is both very interesting and very useful.

Thanks for the run-down on the mul instructions, now I understand. So we have 6 muls which is pretty good. This exercise immediately turned on the -O flag in my head: Inspired me to take a closer look at the ktcp-algorithm that adjusts the rtt for every ack received - a simple 'slow convergence' algorithm that uses a weighing factor (TCP_RTT_ALPHA) to ease the changes. The algorithm uses both multiplication and division - on longs, which is why it caught my attention, the initial question being whether it can be changed to use shifts instead.

n->cb->rtt = (TCP_RTT_ALPHA * n->cb->rtt + (100 - TCP_RTT_ALPHA) * rtt) / 100;

That question is still open, but I did change the rtt and rto variables from timeq_t (u32) to unsigned- they never exceed 100 anyway, which changed the code dramatically.

Old code (TCP_RTT_ALPHA is currently set to 40, 0x28):

454:   e8 fe ff                call   455 <tcp_retrans_expire+0xb5>
                        455: R_386_PC16 fprintf
 457:   83 c4 08                add    $0x8,%sp
 45a:   8b 5e fa                mov    -0x6(%bp),%bx
 45d:   89 d8                   mov    %bx,%ax
 45f:   8b 4e fe                mov    -0x2(%bp),%cx
 462:   09 c8                   or     %cx,%ax
 464:   74 4b                   je     4b1 <tcp_retrans_expire+0x111>
 466:   8b 7c 12                mov    0x12(%si),%di
 469:   b8 28 00                mov    $0x28,%ax
 46c:   f7 65 16                mulw   0x16(%di)
 46f:   89 46 fe                mov    %ax,-0x2(%bp)
 472:   b8 28 00                mov    $0x28,%ax
 475:   f7 65 14                mulw   0x14(%di)
 478:   89 46 fa                mov    %ax,-0x6(%bp)
 47b:   89 56 fc                mov    %dx,-0x4(%bp)
 47e:   8b 46 fe                mov    -0x2(%bp),%ax
 481:   01 46 fc                add    %ax,-0x4(%bp)
 484:   b8 3c 00                mov    $0x3c,%ax
 487:   f7 e1                   mul    %cx
 489:   91                      xchg   %ax,%cx
 48a:   93                      xchg   %ax,%bx
 48b:   bb 3c 00                mov    $0x3c,%bx
 48e:   f7 e3                   mul    %bx
 490:   01 ca                   add    %cx,%dx
 492:   8b 4e fa                mov    -0x6(%bp),%cx
 495:   01 c1                   add    %ax,%cx
 497:   8b 5e fc                mov    -0x4(%bp),%bx
 49a:   11 d3                   adc    %dx,%bx
 49c:   31 d2                   xor    %dx,%dx
 49e:   52                      push   %dx
 49f:   ba 64 00                mov    $0x64,%dx
 4a2:   52                      push   %dx
 4a3:   53                      push   %bx
 4a4:   51                      push   %cx
 4a5:   e8 fe ff                call   4a6 <tcp_retrans_expire+0x106>
                        4a6: R_386_PC16 __udivsi3
 4a8:   83 c4 08                add    $0x8,%sp
 4ab:   89 45 14                mov    %ax,0x14(%di)
 4ae:   89 55 16                mov    %dx,0x16(%di)
 4b1:   56                      push   %si
 4b2:   e8 fe ff                call   4b3 <tcp_retrans_expire+0x113>
                        4b3: R_386_PC16 rmv_from_retrans
 4b5:   89 c6                   mov    %ax,%si

New code:

401:   e8 fe ff                call   402 <tcp_retrans_expire+0x9d>
                        402: R_386_PC16 fprintf
 404:   83 c4 06                add    $0x6,%sp
 407:   85 ff                   test   %di,%di
 409:   74 1b                   je     426 <tcp_retrans_expire+0xc1>
 40b:   8b 5c 10                mov    0x10(%si),%bx
 40e:   b8 28 00                mov    $0x28,%ax
 411:   f7 67 14                mulw   0x14(%bx)
 414:   91                      xchg   %ax,%cx
 415:   b8 3c 00                mov    $0x3c,%ax
 418:   f7 e7                   mul    %di
 41a:   01 c8                   add    %cx,%ax
 41c:   b9 64 00                mov    $0x64,%cx
 41f:   31 d2                   xor    %dx,%dx
 421:   f7 f1                   div    %cx
 423:   89 47 14                mov    %ax,0x14(%bx)
 426:   56                      push   %si
 427:   e8 fe ff                call   428 <tcp_retrans_expire+0xc3>
                        428: R_386_PC16 rmv_from_retrans
 42a:   89 c6                   mov    %ax,%si

... which seems like a pretty dramatic improvement right there. Actually, this along with the use of jiffies (complete elimination of calls to idiv?? and several mults @ about 115 cycles each are gone, which probably shaves off nearly 2000 cycles per main-loop in ktcp.

how do you get to the 4 clock cycle minimum instruction execution cycle?

I read it somewhere. I also read that timing starts at 5 cycles for Base or Index effective address calculation and goes up to 12 cycles. The bare minimum is 4 when using register-to-register but this can sometimes change because there is some variety to instruction prefetching. (Of course all this varies wildly per processor variation). The interesting thing in the referenced link is that 4 more cycles are added for accessing words at odd addresses... much more a performance hit than I had thought.

Yes, all this matches up with my impressions (and I agree with you, there are some surprises here, in particular the one about alignment) except for the minimum cycles per instruction - unless the counting of cycles varies. The way I read it is that the fastest instructions - like reg-reg-move - execute in 2 cycles. If you add the fetching, it may be 3 or 4 - or zero if already in the prefetch queue. I'm clipping in a couple of pages from the ultimate source (!) on the issue, Hummel's Technical Reference, an extremely valuable resource these days (got it on eBay when I got involved with ELKS some 7 years ago) - recommended.

IMG_9930 X86-timing.pdf

For our estimates of speed, I found it surprising that the average is 13.67 cycles/instruction. However, I'm inclined to believe it and am building the following "simple outlook" on imagining how time is spent in an 8086/8088 ELKS system:

Agreed, this value seems reasonable and the method they used to determine it, very convincing - actually pretty smart! Valid for 8086/8088, less so for later architectures. Then again, that's what we're talking about - 4.77MHz and 86/88.

With the kernel PIT running a 11932 countdown to produce 10ms interrupts, each countdown (stated previously) amounts to 0.838 microseconds, which is 4 cycles.

Thus, rounding to 12,000 countdowns (4 cycles each or one basic instruction) per 10ms clock interrupt would mean 12,000 instructions before a task switch. But with the real instruction cycle average being 12 (rounded down from 13.67), this amounts to 4,000 average instructions executed each time slice. If the average instruction is 12 cycles (for simplicity), this would mean each instruction is 3 4 cycles = 3 0.838 usecs = ~2.5 usecs/instruction.

Finally, at average ~2.5usecs/instruction, 4000 instructions is 10,000 usecs which equals, you got it - 10ms - one jiffie.

So there you have it. Precision timing samples starting at a very fast 4-cycle instruction, with three samples per average instruction. 4000 average instructions later we have a jiffie increment and possible task switch. This keener precision approach to imagining what is going on at the process time slice and kernel driver level should allow us to better understand what the real costs of interrupt service routines are and also quantify in which method a kernel busy-loop might be coded to allow other code to run or not (same for interrupt disabling).

This is indeed enlightening. The 12 cycles average being more important than whether the fastest instruction takes 2 or 4 cycles. 4000 instructions per time slice really doesn't get much done, does it? It may seem that this is part of the challenge with very slow machines: So little get done per time slice. And suddenly the difference between my XT and my V20/8MHz is big - twice as much gets done per slice. It would be very interesting to find the actual price of a task switch (and like you mention, the interrupt overhead). No wonder ktcp is slow on the old iron. The challenge now is to get it to work (well) anyway.

Thank you.

ghaerr commented 1 month ago

except for the minimum cycles per instruction - unless the counting of cycles varies. The way I read it is that the fastest instructions - like reg-reg-move - execute in 2 cycles. If you add the fetching, it may be 3 or 4 - or zero if already in the prefetch queue.

I'm not sure, but I think you're going in the right direction when counting instruction fetch. IIRC a word access requires 4 cycles, so perhaps reg/reg move is done within memory fetch 4 cycles. I'd like to hear more about what you figure out, I'll leave this for you to figure out more fully.

4000 instructions per time slice really doesn't get much done, does it?

Yes, that's not much, is it? If we can figure out the average execution time per byte of code emitted, that might allow us to quickly estimate how long a function takes, without loops being taken into account. And yes the point about 8Mhz being (obviously) twice as fast hits home when considering the smaller amount of instructions being executed in the first place. Its nice to be getting conceptually closer to the metal in these timing discussions.

It would be very interesting to find the actual price of a task switch

That's easily done: just count instructions in _irqit (not cycles). Use ~2.5usecs as average instruction time. Note that all system calls and hardware interrupts (except CONFIG_FAST_IRQ4 serial I/O) go through a nearly identical _irqit experience. When first writing the fast serial driver 4+ years ago, I learned that the interrupt/syscall/trap overhead is too high for 38.4K serial baud.

So we have 6 muls which is pretty good

Yes, that would seem great - except that after this discussion it might still be too slow, as MUL is very slow on 8088's, right? (Of course, 8088's just can't do much which is why few run them anymore lol; the long divisions required to do the timing printf's will really knock us out!)

I did change the rtt and rto variables from timeq_t (u32) to unsigned- they never exceed 100 anyway, which changed the code dramatically.

Very interesting! I seem to remember a quick look at the timeq_t declarations when you first brought this up, but didn't think about the code required to support it. I would say that if your assumptions are correct about RTT/RTO remaining a small number that they should definitely be changed to unsigned short for ktcp!!

Actually, this along with the use of jiffies (complete elimination of calls to idiv?? and several mults @ about 115 cycles each are gone, which probably shaves off nearly 2000 cycles per main-loop in ktcp.

Wow. Getting a handle on ktcp's main loop elapsed time seems important to get fast back-to-back NIC packets sent w/o buffering. 2000 cycles is a lot (wow, MUL is super slow, 118-143 cycles!!), even though there are 4000 * 12 = 48,000 cycles per hardware timer interrupt. So that's the equivalent of 2000/12 = 166 instructions out of 4000 saved in ktcp's main loop per task switch = 4%, huge improvement with no change in functionality!

ghaerr commented 1 month ago

It would be very interesting to find the actual price of a task switch

That's easily done: just count instructions in _irqit (not cycles). Use ~2.5usecs as average instruction time.

This has got me thinking: I'll bet the irqit overhead is larger, possibly significantly if C routines and signal processing is taken into account (but not including actual task switch code), than 5% which was my previous WAG. That hardware interrupt handling code for the system timer is only to update jiffies if there's no other task to run.

So perhaps a 10ms timer is too fast for old 8088s; less interrupt overhead may allow for 10% more to get done when running single application. I've added all this to my list to look further into. It would be quite easy to change the HZ frequency, although there is a bit of a mess checking out that the clock/date code still works and doesn't make any assumptions.

ghaerr commented 1 month ago

I spent a while counting instructions and tracing _irqit, and here are the results! Depending on exactly what is going on, lots of instructions can be executed. In particular, I traced IRQ 0 in detail. Tracing was done by counting instructions in routines disassembled using ia16-elf-objdump -D -r -Mi8086 and following possibly called functions.

ELKS overhead by instruction count (approximate)

Routine                                     Instruction count

_irqit (syscall)                            54 
    stack_check                             15+
    syscall                                 5+syscall
    do_signal                               16+
total                                       90+syscall

_irqit (hw interrupt, IRQ 0 for example)    85
    do_IRQ
        timer_tick (irq handler)
            do_timer                        23
                run_timer_list
                    cli/sti                 *
                    if timers
                        del_timer               27
                        sti/cli                 *
                        call timer_func         ??
            calc_cpu_usage                  14
                if count>200                    68 every 200x
            if disk activity and dirconsole
                if count>8
                    pokeb                       7 every 8x
        if count>5
            BIOS IRQ 0  x5                      29 on ibmpc, every 5x
    if was user mode:
        schedule                            93
            cli/sti
            del_from_runqueue ?                 32
            add_timer       ?                   29
            tswitch                         13
            del_timer       ?                   27
        do_signal                           16+
            cli                                 *
            call user mode handler ?
total                                       128+
                                                106+ if sched

Average interrupt overhead 128+ instructions = 320usecs
Average interrupt w/reschedule 234+ instructions = 585usecs
Basic 8088 runs ~12 cycles/instruction, ~2.5 usecs/instruction
Basic 8088 executes 4000 instructions/timer interrupt (@100HZ = 10ms = 10,000 usecs)

Precision 8254 timer is 0.838 usecs per countdown = 4 cyles = 1/3 average instruction
8088 MUL instruction takes ~120 cycles = ~25 usecs, 10 times longer than avg instruction

8088 Kernel Interrupt overhead 3.2%
8088 Kernel Interrupt w/reschedule overhead 5.6%
    When multiple processes running, reschedule is slower
    Reschedule only occurs with timer interrupt from user mode
    Any kernel timers need processing check each timer interrupt
    Various interrupt handlers/reschedule and kernel timers disable interrupts
    All other kernel drivers may take up time (DF, HD, NIC) away from ktcp and app

Add to all of this the thought that the average MUL instruction is say, 120 cycles (10 times an average instruction time), so my previous comments about a "fast" precision timer with only 5 multiplies means that a precision time sampling takes ~125+ usecs minimum, which is a full 30 0.838 usec countdowns of the 8254 precision timer itself!! Basically, a "precision" timer calculation takes something less than half as long as the kernel interrupt overhead on 8088.

I think MULs are heck of a lot faster on 386+ CPUs so perhaps this isn't a big deal except on real 8088's themselves.

Mellvik commented 1 month ago

Wow, that's indeed a lot to devour :-) - and again both interesting and useful, not the least to understand system behaviour and possible bottlenecks.

Some observations:

... the thought that the average MUL instruction is say, 120 cycles (10 times an average instruction time), so my previous comments about a "fast" precision timer with only 5 multiplies means that a precision time sampling takes ~125+ usecs minimum, which is a full 30 0.838 usec countdowns of the 8254 precision timer itself!! Basically, a "precision" timer calculation takes something less than half as long as the kernel interrupt overhead on 8088.

That's indeed a good point. Maybe we need to rethink this one completely in order to make it viable on slow iron.

I think MULs are heck of a lot faster on 386+ CPUs so perhaps this isn't a big deal except on real 8088's themselves.

That's also true. mul reg,reg is as you said almost 120 on 86/88, while 21 on the 286, 12-25 on the 386. Then again, 86/88 is the target and the challenge...

When first writing the fast serial driver 4+ years ago, I learned that the interrupt/syscall/trap overhead is too high for 38.4K serial baud.

If this is on the 86/88, it's actually pretty good. And since ELKS and TLVC have their own drivers, we're able to take advantage of the UART FIFOs when available, which at least theoretically should make it possible to exceed this limit, right? Process several chars per interrupt.

On a general note, I think we should stick with the 10ms clock even on the slow iron, proclaiming it as 'good enough' and continue along the line that this thread started, optimizing where it really matters. Actually, off the top of my head, 3.2% interrupt overhead isn't all that bad, neither is 5.6% for interrupt&resched. I would be surprised if a PDP11/20 with Unix v6 was much better. Of course they had more IO offloading but then again they were supporting many real and concurrent - users too ...

Apropos timing, the ktcp update is getting there, the magic key turned out to be (yes, timing) an improved RTT. And quite a few other things, but a 'truer' RTT turned out to be the most important one. In hindsight completely unsurprising. More about that when the PR comes.

BTW - I'm inclined to head into _irqit and actually calculate the cycles from objdump. It would be interesting to compare to the estimate to get a feel for how close we really are.

ghaerr commented 1 month ago

The instruction timings themselves vary wildly depending on factors

Yes, good point, lots of factors. I was reading another timing reference that showed a register word INC to take only 2 cycles, not including instruction fetch. So I'm not sure which table(s) to use for instruction cycles, and when one needs to apply instruction fetch timing (after every jump)? Perhaps best to actually try to measure it using "precision'' timing (more on that below).

approx. 12 cycles, come from a collection of applications and is probably too high for an OS like TLVC/ELKS.

I tend to agree, I was thinking about that as I calculated the irqit timing and those instructions did not seem like ~12 average. However, one must consider an "average" instruction timing would have to include MUL, so that likely pulls the average lots higher. Perhaps we should research and calculate an ELKS/TLVC average instruction that does not include high-time instructions, and watch for and flag the high-time instructions separately (including the C code that generates them).

Maybe we need to rethink this one completely in order to make it viable on slow iron.

Indeed. The 5 MULs were the result of creating a virtual "ntick" and converting PIT countdowns and jiffies to the same base units so they could be added. A better approach might be just to store the 4-byte jiffies and the 2-byte countdown as a "measurement", and delay multiplications, or possibly just display jiffies and countdown separately. Also, for small measurements, the jiffies diff might be 0 and just countdown (wrap handling still needed). So some fast code could perhaps calculate jiffie and countdown diff values and just store them. Of course, once it comes time to actually see the result, the printf/printk routine will be so slow it'll wreck all runtime timing.

Which brings me to my next idea - kernel or application profiling. I am thinking that a super tool would be to store the CS:IP of executing code every timer tick into a far memory area of 64K bytes. Then, process or kernel symbol tables could be consulted to display the function being executed at each clock tick and a profile generated. The same approach could possibly be used for precision timing, where the jiffies/PITcount would just be stored somewhere and analyzed later.

I'm inclined to head into _irqit and actually calculate the cycles from objdump. It would be interesting to compare to the estimate to get a feel for how close we really are.

That'd be great - go for it. I'm interested in a better "average" instruction time for sequences that don't include very slow instructions. For reference, I divided irqit into two streams, syscall and hardware interrupt, and traced the function calls, but didn't try to trace the probably execution path, instead just counting all instructions. So the most-used code path is likely smaller than my research suggested.

Mellvik commented 1 month ago

The instruction timings themselves vary wildly depending on factors

Yes, good point, lots of factors. I was reading another timing reference that showed a register word INC to take only 2 cycles, not including instruction fetch. So I'm not sure which table(s) to use for instruction cycles, and when one needs to apply instruction fetch timing (after every jump)? Perhaps best to actually try to measure it using "precision'' timing (more on that below).

Actually, all sources I have show the same timings - the Intel manual, the Hummel book, the one you reference which I incidentally looked at this morning. So we do have the exact timings even with every possible addressing mode etc. Still, like Hummel points out in the pdf I included yesterday, there are more variables. Like a DMA going on, a fetch colliding with an execution unit memory reference, an instructions exception etc. So, what Hummel concludes it this (also in the pdf I think):

"Due to all the factors that cannot be predicted with certainty, the average program will take 5-10% [corrected] longer to execute than would be indicated by the timings." Sounds good to me.

approx. 12 cycles, come from a collection of applications and is probably too high for an OS like TLVC/ELKS.

I tend to agree, I was thinking about that as I calculated the irqit timing and those instructions did not seem like ~12 average. However, one must consider an "average" instruction timing would have to include MUL, so that likely pulls the average lots higher. Perhaps we should research and calculate an ELKS/TLVC average instruction that does not include high-time instructions, and watch for and flag the high-time instructions separately (including the C code that generates them).

It would indeed be interesting to see a top 25 list of used instructions in the kernel, which wouldn't be all that hard to do with objdump and an AWK script ...

Maybe we need to rethink this one completely in order to make it viable on slow iron.

Indeed. The 5 MULs were the result of creating a virtual "ntick" and converting PIT countdowns and jiffies to the same base units so they could be added. A better approach might be just to store the 4-byte jiffies and the 2-byte countdown as a "measurement", and delay multiplications, or possibly just display jiffies and countdown separately. Also, for small measurements, the jiffies diff might be 0 and just countdown (wrap handling still needed). So some fast code could perhaps calculate jiffie and countdown diff values and just store them. Of course, once it comes time to actually see the result, the printf/printk routine will be so slow it'll wreck all runtime timing.

Which brings me to my next idea - kernel or application profiling. I am thinking that a super tool would be to store the CS:IP of executing code every timer tick into a far memory area of 64K bytes. Then, process or kernel symbol tables could be consulted to display the function being executed at each clock tick and a profile generated. The same approach could possibly be used for precision timing, where the jiffies/PITcount would just be stored somewhere and analyzed later.

I think this is a great approach, much akin to what they did in the instruction timing document!

I'm inclined to head into _irqit and actually calculate the cycles from objdump. It would be interesting to compare to the estimate to get a feel for how close we really are.

That'd be great - go for it. I'm interested in a better "average" instruction time for sequences that don't include very slow instructions. For reference, I divided irqit into two streams, syscall and hardware interrupt, and traced the function calls, but didn't try to trace the probably execution path, instead just counting all instructions. So the most-used code path is likely smaller than my research suggested.

OK; sounds like a printout, a pencil, a book and a sunny day on the terrace! It's on.

ghaerr commented 1 month ago

he average program will take 510% longer to execute than would be indicated by the timings.

Wow, kind of unbelievable!!! But then again, perhaps ~13.67 cycles/instruction does factor correctly when 510% is considered.

I found copy of Hummel's book on Amazon and ordered it last night. Looking forward to reading it, thanks.

ghaerr commented 1 month ago

@Mellvik, I've create a precision timer library in https://github.com/ghaerr/elks/pull/1953 based on discussions here. Let me know what you think of it, it should work pretty well for measuring elapsed time in the kernel, but I haven't tested it on real hardware. If you can find time to test it out, that would be great.

Coming after this will be getting the routines working outside the kernel, e.g. in ktcp. I would like to know how useful this is before going through that effort.

Mellvik commented 1 month ago

Very exciting @ghaerr - thank you. That's some creative ways to get around the limitations imposed by reality (well, technology). Looking forward to test it.

-M

Mellvik commented 1 month ago

@ghaerr, not much screen time recently and I admit this has been a very fast one, so I may have misunderstood how to use the precision timer routines. For convenience I'm timing a simple outb loop in init/main.c during boot. The code looks like this:

#ifdef CONFIG_PREC_TIMER
void INITPROC testloop(unsigned timer)
{
        unsigned pticks, x = timer;

        get_time_10ms();
        while (x--) outb(x, 0x80);
        pticks = get_time_10ms();
        printk("hptimer %u: %k\n", timer, pticks);
}       
#endif 

Weirdly gcc puts the outb instruction outside the loop, at the end of the routine and jumps to it from the loop, then back, but that's a different issue.

I've calculated the expected timing of the loop itself to ~775ns on the 40HMz 386. Which means that running 10,000 loops may exceed the 10ms limit, but I tried anyway and went down from there. Here's what I got, the first number is the loop count.

hptimer 10000: 5723ms
hptimer 3000: 4731ms
hptimer 500: 790234us

I haven't looked at the code so I'm not attempting any analysis at this point. What am I missing?

ghaerr commented 1 month ago

Hello @Mellvik,

Thanks for the testing. And, you will find, it is very easy to get confused with timing and timing results, I sure have been! It looks though that your code is correct. I have the following comments to try to understand what's going on, as the output timing does look way off, possibly due to the potential variations described below. Of course, the PIT timer should be operating at the same input frequency on your 40Mhz 386 as an 8088, right?

[EDIT: something definitely strange is going on here, as get_time_10ms should never return a value > 11931, as it just compares two PIT values and compares the difference. The only way it could be greater is the PIT start countdown value is > 11932, but that should not be the case as irq_init() initializes the PIT countdown to 11932 for a 10ms interrupt.]

I added your test code, with the printk mod above and ran it on QEMU, and got the following results, different each time because of the timer interrupt delay: (serial console on):

hptimer 10000: 998.058us (1191)
hptimer 3000: 253.076us (302)
hptimer 500: 48.604us (58)

hptimer 10000: 662.858us (791)
hptimer 3000: 195.254us (233)
hptimer 500: 31.844us (38)

hptimer 10000: 1.139ms (1360)
hptimer 3000: 183.522us (219)
hptimer 500: 31.844us (38)

With the above numbers, the parenthesized number are the pticks, and multiplying them by 0.838usecs does yield the displayed time value. I don't know the extent as to QEMU's emulation of the 8254 PIT nor its relation to real-time so it might be interesting to see your QEMU results from the above. I'm testing on a 3.1Ghz MacBook Pro.

Finally, here is the output from timer_test, which should be exactly the same on yours. It displays various ptick values and the %k result:

1 = .838us
2 = 1.676us
3 = 2.514us
100 = 83.800us
500 = 419.000us
1000 = 838.000us
1192 = 998.896us
1193 = 999.734us
1194 = 1.000ms
10000 = 8.380ms
59659 = 49.994ms
59660 = 49.995ms
59661 = 49.995ms
100000 = 83.800ms
5*59660 = 249.975ms
359953 = 301.640ms
19*59660 = 949.906ms
20*59660 = 999.901ms
21*59660 = 1.049s
60*59660 = 2.999s
84*59660 = 4.199s
600*59660 = 30.06s
3000000 = 2.514s
30000000 = 25.20s
35995300 = 30.23s
36000000 = 30.24s
51000000 = 42.84s
51130563 = 42.94s
51130564 = 0us
Mellvik commented 1 month ago

... the PIT timer should be operating at the same input frequency on your 40Mhz 386 as an 8088, right?

Right, otherwise the ISA bus wouldn't work (and BTW, the clock would be off...)

  • Where exactly is this in init/main.c?

I put it at the end of kernel_init() which is just before the idle task init, so we're fine on that one.

  • We need to ensure that the %k formatting is correctly ported, as its quite tricky internally.

printk has been (re)ported in verbatim. Turns out the decimal point was missing from the first runs. That's confusing as hell.

The printk test output matches your list exactly. That's a good sign ...

[EDIT: something definitely strange is going on here, as get_time_10ms should never return a value > 11931, as it just compares two PIT values and compares the difference. The only way it could be greater is the PIT start countdown value is > 11932, but that should not be the case as irq_init() initializes the PIT countdown to 11932 for a 10ms interrupt.]

Makes sense - I'm getting 'sane' values back now.

  • Remember that the timed results may be wildly different, as the time tests are done with interrupts on, and a hardware timer interrupt may occur, which will add significant delay.

Actually, the numbers I'm getting have minimal variations - which of Cours may be because they're still wrong :-)

I added your test code, with the printk mod above and ran it on QEMU, and got the following results, different each time because of the timer interrupt delay: (serial console on):

hptimer 10000: 998.058us (1191)
hptimer 3000: 253.076us (302)
hptimer 500: 48.604us (58)

hptimer 10000: 662.858us (791)
hptimer 3000: 195.254us (233)
hptimer 500: 31.844us (38)

hptimer 10000: 1.139ms (1360)
hptimer 3000: 183.522us (219)
hptimer 500: 31.844us (38)

With the above numbers, the parenthesized number are the pticks, and multiplying them by 0.838usecs does yield the displayed time value. I don't know the extent as to QEMU's emulation of the 8254 PIT nor its relation to real-time so it might be interesting to see your QEMU results from the above. I'm testing on a 3.1Ghz MacBook Pro.

My M1 MB Pro says:

hptimer 10000: 409.782us (489)
hptimer 3000: 118.158us (141)
hptimer 500: 18.436us (22)

hptimer 10000: 387.156us (462)
hptimer 3000: 113.968us (136)
hptimer 500: 18.436us (22)

... sensible numbers that given the variance in your case, are surprisingly stable. And indicating the machine is about twice your speed when doing Intel emulation.

The new 'real' numbers:

(40MHZ 386SX)

hptimer 10000: 6.900ms (8234)
hptimer 3000: 5.064ms (6044)
hptimer 500: 848.894us (1013)

hptimer 10000: 6.899ms (8233)
hptimer 3000: 5.064ms (6043)
hptimer 500: 848.056us (1012)

12.5MHz 286:

hptimer 10000: 2.506ms (2991)
hptimer 3000: 9.767ms (11656)
hptimer 500: 1.707ms (2037)

The 286 numbers for 3000 and 500 loops make some sense given the loop in question (the 10000 exceeds 10ms anyway so it's invalid).

In order to get better control over expectations, I added an assembly loop (conveniently lifted from the el3 driver, temporarily 'hidden' at the end of irqtab.S):

test_udelay:

        push    %bp
        mov     %sp,%bp
        mov     4(%bp),%cx      // count
dly_loop:
        outb    $0x80
        loop    dly_loop

        pop     %bp
        ret

The loop itself has only two instructions consuming

out: 11,12,3
loop: 13,17,10

cycles - for 386, 8088, 286 respectively (interesting how fast the out instruction is on the 286, I should look it up to make sure it isn't a typo). So loop time is 600ns[386], 6.1us[8088], 1.04us[286], leaving expectations at something like

386: 600ns x 3000 + 10% =  2ms
286: 1us x 3000 + 10% = 3.3ms
88: 6.1 x 3000 + 10% = 20.1ms

What we get is this - first QEMU:

hptimer 10000: 31.006us (37)
hptimer 3000: 4.190us (5)
hptimer 500: 4.190us (5)

[386]

hptimer 10000: 118.996us (142)
hptimer 3000: 121.510us (145)
hptimer 500: 120.672us (144)

[286]

hptimer 10000: 167.600us (200)
hptimer 3000: 168.438us (201)
hptimer 500: 169.276us (202)

Bottom line is that except for QEMU, the numbers don't make much sense. What say you @ghaerr?

ghaerr commented 1 month ago

Bottom line is that except for QEMU, the numbers don't make much sense. What say you

Wow, I say this is very frustrating!

After reading your post and all the data in it, I would say it seems that reading and/or interpreting the 8254 PIT countdown register on real hardware isn't working with my code, and that QEMU possibly doesn't do a good job of emulating the PIT real time.

I checked the 80286 Intel reference manual for OUT and LOOP and it says 3 cycles for OUT and either 8 (jump taken) or 4, so there's some possible discrepancy with instruction timing estimates (instead of your 10). But these are quite minor considering the radical difference in timings reported by QEMU, which runs faster and slower on different hosts, and your last two real hardware tests of test_udelay, where there's almost no difference between 500, 3000 and 10000 loops in terms of PIT timer countdowns returned (!!!).

So lets take the latter - the 286 and 386 results running test_udelay: all ~200 for 286 and ~144 for 386. It would seem to me that this is saying that reading the PIT timer somehow returns a constant difference between two readings - very strange. Except then when running your original testloop beforehand the results do vary. Extremely strange that QEMU doesn't vary between 500 and 3000 using test_udelay. I must admit I can't figure that out.

Can you post your main.c, printk.c and prectimer.c files used for the test_udelay? I would like to see if I'm missing something as I can't see the actual code you're executing.

Other thoughts are somehow the timer loop may be overflowing, perhaps change to using a different instruction (like XCHG or XOR instead of OUT) to keep the system bus out of the equation. Another thought would be to just display the PIT count value in an infinite loop, to check that it's decrementing. It should be decrementing every 0.838usecs on all systems using a real 8254:

    outb(0, TIMER_CMDS_PORT);       /* latch timer value */
    lo = inb(TIMER_DATA_PORT);
    hi = inb(TIMER_DATA_PORT) << 8;
    count = lo | hi;

I checked the ptick <-> us/ms conversion and that's accurate. So we're getting a bad ptick difference between reading the PIT countdown register, if that's even working on real hardware. Another thought would be to write the above PIT read code in ASM, save it, run a known loop and read it again to see what the different values are. The PIT timer is pretty basic - you send a zero to the commands port to "latch" the current value and then the next two reads from the data port read the low and high bytes of the countdown. When the count transitions from 1 to 0, an IRQ 0 is fired and 11932 is reloaded into the countdown register by the hardware.

Do both your 286 and 386 systems use a real 8254 PIT? I'm kind of grasping at straws here trying to come up with something that may be amiss...

Thanks for your testing. I'm out of town or I'd test this on my real hardware, which I can do next week.

ghaerr commented 1 month ago

@Mellvik: another thought is that perhaps my PIT counter-reading code is in fact only getting the high byte on real hardware, that could possibly account for the values changing quite slowly. When the PIT is programmed for 16-bit mode, it's supposed to return lo then hi in successive INB, but in other modes it only returns a single value. It might be interesting to try to read the PIT counter value continuously in the kernel, and display in hex, rather than decimal. I'll re-read the 8254 data sheet and see if there's something I missed. Its possible that QEMU incorrectly emulates the real hardware in this case since most OSes just program for clock interrupt rather than reading the countdown directly.

Mellvik commented 1 month ago

I checked the 80286 Intel reference manual for OUT and LOOP and it says 3 cycles for OUT and either 8 (jump taken) or 4, so there's some possible discrepancy with instruction timing estimates (instead of your 10).

It seems they are actually matching. The loop timing is 8 plus 1 per byte in the instruction which is 2.

But these are quite minor considering the radical difference in timings reported by QEMU, which runs faster and slower on different hosts, and your last two real hardware tests of test_udelay, where there's almost no difference between 500, 3000 and 10000 loops in terms of PIT timer countdowns returned (!!!).

it seems to me the difference between 3000 and 10000 on QEMU are reasonable, but the actual numbers aren't. Also, for 500 and 3000 being equal, is it possible that the loop itself is so fast on QEMU that we're measuring only the 'other' stuff?

Other thoughts are somehow the timer loop may be overflowing, perhaps change to using a different instruction (like XCHG or XOR instead of OUT) to keep the system bus out of the equation.

Yes, that is a good idea! Actually, I should add more (longer running) instructions to the loop in order to make it more 'measurable'.

Another thought would be to just display the PIT count value in an infinite loop, to check that it's decrementing. It should be decrementing every 0.838usecs on all systems using a real 8254:

    outb(0, TIMER_CMDS_PORT);       /* latch timer value */
    lo = inb(TIMER_DATA_PORT);
    hi = inb(TIMER_DATA_PORT) << 8;
    count = lo | hi;

I plan to take stab at this tomorrow.

Do both your 286 and 386 systems use a real 8254 PIT? I'm kind of grasping at straws here trying to come up with something that may be amiss...

The 386 is using an AMI south bridge, but that really shouldn't matter. They have to be compatible...

another thought is that perhaps my PIT counter-reading code is in fact only getting the high byte on real hardware, that could possibly account for the values changing quite slowly. When the PIT is programmed for 16-bit mode, it's supposed to return lo then hi in successive INB, but in other modes it only returns a single value. It might be interesting to try to read the PIT counter value continuously in the kernel, and display in hex, rather than decimal. I'll re-read the 8254 data sheet and see if there's something I missed. Its possible that QEMU incorrectly emulates the real hardware in this case since most OSes just program for clock interrupt rather than reading the countdown directly.

this reasoning makes a lot of sense given the symptoms. I'll hold off further experimentation until you have a verdict.

thank you.

ghaerr commented 1 month ago

My third thought about this is that: the timings are correct. That is, the non-test_udelay timings on your M1 as well as real hardware. The real hardware values all show a moderate increase per increased cycles, with the pointed-out exception of the 10,000 cycle 286 which likely overflowed. Taking into account possible cached execution of instructions, it is in fact possible that the timings are correctly measured.

What I'm thinking at this point about the unexplainable test_udelay timings that seem to make no sense whatever, I have the following: the reason they're not working, that is, showing almost the same result, could be that we're running on an unfinished kernel initialization, specifically the idle process stack. Long story short, there is a possibility that the idle stack is getting trashed (via a timer interrupt occurring during kernel init) and the variables in testloop getting trashed. I'm looking at irqit to see whether the gint_count variable could be responsible for resetting the idle stack stack to the wrong location on an interrupt at this time. This could explain why pticks is always the same value, but it is strange, I'm looking further into it. I would suggest moving the testloop calls to just in front of the idle loop as discussed, and/or using only global variables, not auto variables in testloop, to see if that makes any difference.

The funny thing is that on my macOS laptop, testing with moving testloop before and after kernel init as above, makes no difference, and timings look plausible:

// of kernel init
hptimer 10000: 1.933ms (2307)
hptimer 3000: 508.666us (607)
hptimer 500: 82.124us (98)
// before idle loop
hptimer 10000: 1.739ms (2076)
hptimer 3000: 536.320us (640)
hptimer 500: 84.638us (101)

Given the timings you've chosen, its very hard to visually understand differences without breaking out a calculator, but seems these numbers are plausible. Using 100, 1000, and 10000 would be easier to make some sense of it all.

So - it seems that there may be an issue with QEMU's accurate emulation of 8254, compounded by the fact that your M1 is quite a bit faster than mine, which could affect a stack wipeout above. It appears we need to add CLI/STI around the entire test routine(s) to eliminate variables at this time. I remain perplexed about the real hardware test_udelay results, unless that also has something to do with running all this in the idle task. I'll play around more with a better place to test, but with interrupts off for the entire set of function calls, perhaps this will cease to matter.

ghaerr commented 1 month ago

it seems to me the difference between 3000 and 10000 on QEMU are reasonable, but the actual numbers aren't.

Not sure, but sounds complicated when cache memory and instructions etc are taken into account.

Also, for 500 and 3000 being equal, is it possible that the loop itself is so fast on QEMU that we're measuring only the 'other' stuff?

Good thought! Who knows what QEMU might be doing, its possible it could optimize out the entire loop, especially for OUT $0x80 ?

I plan to take stab at this tomorrow.

That would be good to verify whether the AMI bridge and/or 8254 are outputting a decrementing count. I would also suggest adding clr_irq/set_irq around the entire testloop to eliminate possible stack trashes and other timing variables.

I'll hold off further experimentation until you have a verdict.

I've read the entire 8254 DS and can't find anything we're not doing correctly, other than what has just been mentioned.

Finally: the numbers seemed to make some sense prior to introducing another function call to test_udelay in testloop. This would have altered the stack usage, opened up stack wipe possibility, etc. Perhaps get rid of test_udelay and use a long counter to decrement directly in testloop, at much larger values for the speedy systems: 1000, 10000, 100000L and see if that starts making sense? If we can get back to numbers that at least increase with the counter value, we're getting somewhere...

ghaerr commented 1 month ago

More information on kernel initialization issues probably potentially interfering with precision timing tests: In init/main.c::start_kernel():

    setsp(&task->t_regs.ax);    /* change to idle task stack */
// at this point SP is set to the idle task's stack as though it just made a system call
// (but doesn't save current SP location within start_kernel())
// the irqtab.S _gint_count global is initialized to 1, as though an interrupt occurred from user mode
// ret_from_syscall has never been executed which will decrement _gint_count later
// run queue has only single process in it - the idle task

// kernel continues with initializations, any interrupts will occur on kernel interrupt stack 'istack'
//  because _gint_count == 1

    kernel_init();              /* continue init running on idle task stack */

    /* fork and run procedure init_task() as task #1*/
    kfork_proc(init_task);

// kfork_proc sets ret_from_syscall as the user mode "entry point", simulating a system call exit,
//   then when scheduled actually starts running kernel mode and calls mount_root() which opens block DF driver
//   eventually exec'ing /bin/init into task slot then returns to user mode /bin/init entry point via ret_from_syscall

    wake_up_process(&task[1]);

// init task (PID 1) now added to run-queue, but hasn't started
// it will be started by first schedule() below in idle task's loop
// any interrupts run now on kernel stack and unclear what might happen if /bin/init
//  executes a system call that requires rescheduling, i.e. call to schedule()
//  this is because the idle task's stack has never been properly saved by calling schedule() itself yet
// that is how I'm guessing the idle task stack trashing might happen
    /*
     * We are now the idle task. We won't run unless no other process can run.
     */
#if TIMER_TEST
    schedule();  // I'm thinking we need to add this call here in order to save idle task SP
                 // so that kernel finishes initialization by running /bin/init and global irqit _gint_count is decremented first time
    timer_test();
    testloop(10000);
    testloop(3000);
    testloop(500);
#endif
    while (1) {
        schedule();  // the first time this is called in the idle loop, the idle task's SP is properly saved 
                     // and /bin/init finally runs
        idle_halt();    /* halt until interrupt to save power */
    }

We will have to wait until kernel init is completely finished and schedule() is called once before doing anything in the idle task that uses stack variables.

[EDIT: the above over-analysis fails to realize that if the DF driver is opened (instead of BIOS driver), there will be task switches back to the idle task as async I/O waits for completion. Thus, using schedule() won't solve the problem.]

ghaerr commented 1 month ago

@Mellvik: Sorry to bombard you with more stuff, but in thinking even more about this, I realized that there's a major difference between our platforms: you're very like running the direct floppy driver, while mine is optional, and off.

The problem with interrupt latency is greatly complicated when using the DF driver, as on driver open, reset_floppy is called, which generates a sequence of work and interrupts. The reset_floppy routine has a delay(1000) loop in it and is also called on motor off interrupt (!) - all very problematic to try to cleanly determine when its safe to use precision timing without interrupts disabled, and very likely to cause unchecked timer overflows. Thus, my recommendation above, calling schedule() once before testing won't likely work, as the kfork_proc startup of /bin/init immediately calls mount_root(), which calls the DF driver open, which starts probing I/O, etc. etc. Also, running the timer tests after kfork_proc makes things worse for the DF driver if interrupts are disabled, so I'll change that in my source.

In order to visualize what DF driver startup looks like I added an 'X' kputchar just before the idle task's main loop, and then a '@' as the first thing in the loop:

     kputchar('X');
     while (1) {
         kputchar('@');

This was the result:

Screen Shot 2024-08-08 at 4 26 55 PM

In this case, I have the testloop code running prior to the exec(/bin/init) but notice how many times the idle task is scheduled during mount_root and the subsequent probing, then what's probably a super block read with one more idle wait.

Because of all this complexity, perhaps screenshot the exact startup sequence if timings seem off. Ultimately, I think running the timing routines with interrupts disabled is the only way to go while debugging this. After we've got this debugged, it should work OK to measure elapsed time in interrupt service routines without interrupts disabled, which should then accurately measure time elapsed including IRQ 0 interrupt handling and/or interrupts with higher priority than the one being serviced. After all that gets checked out, I was going to go ahead and build the user-mode precision timing library. Whew!

ghaerr commented 1 month ago

Last update: it seems that QEMU's emulation of the 8254 countdown readback is unreliable, and can't be used to get any meaningful measurements. For reasons I don't understand, successive calls to get_time_10ms() initially are quite a bit higher than otherwise, but only within a 0-9 for() loop. For example, I changed testloop to the following:

unsigned int a[10];
void testloop(unsigned timer)
{
        unsigned pticks;
        unsigned x = timer;
        int i;

        clr_irq();
        get_time_10ms(); // there can be 1, 4 or 10 of these, makes no difference to a[0] and a[1]
        get_time_10ms(); // below being very high, but only on the first time calling testloop!
        get_time_10ms();
        get_time_10ms();
        for (i=0; i<10; i++)
            a[i] = get_time_10ms();
        for (i=0; i<10; i++)
            printk("%d,", a[i]);
        printk("\n");
        get_time_10ms();
        while (x--) asm("nop"); 
        pticks = get_time_10ms();
        printk("hptimer %u: %k (%u)\n", timer, pticks, pticks);
        set_irq();
}

Here's the output:

8,22,1,0,1,0,1,0,2,0,
hptimer 10000: 46.090us (55)
0,1,0,1,0,1,0,1,0,2,
hptimer 1000: 3.352us (4)
1,0,1,0,0,2,0,1,0,1,
hptimer 100: .838us (1)

Note the first two numbers. It doesn't matter how many times you add get_timer_10ms (to "prime the pump" so to speak, but actually to set the lastcount variable for computing count difference), one time, ten times - the first two numbers in the for() loop, a[0] and a[1] are always large, in the first run of testloop only. Very strange. Subsequent runs of QEMU are very similar. It seems that QEMU updates the 8254 countdown mechanism asynchronously and/or depending on how often its called. I have come to the conclusion that, unfortunately, QEMU's emulation of the PIT cannot be trusted for reliable timing, at least in the 1 to ~25 pticks level. This same PIT timing strangeness may manifest itself in other ways when using QEMU, its very hard to tell.

I've run out of time for the day, I'm going to try all this on another emulator tomorrow and report back.

Mellvik commented 1 month ago

@ghaerr - this is turning into a real research project, isn't it?

Before heading into your discoveries from last night, here's a possibly interesting observation - for the record. I fired up the XT, made a test-kernel for it, changed the test loop to 50ms instead of 10ms since this machine has a (net) loop time of 31cycles @ 210ns = 6.5us. (4.77MHz)

Here's what I got, the expected (calculated) time to the right:

hptimer 10000: 0us (0)          [65ms, outside the range so 0 is correct]
hptimer 3000: 13.764ms (16425)  [19.5ms]
hptimer 500: 4.437ms (5295)     [3.25ms]

The numbers still don't make a lot of sense, but the 500 number is at least reasonably close and the 0 when outside range is good. I suppose the pticks don't send any signals? [EDIT] I guess the ptick value 16425 may be suspicious, or was the 11931 limit for the 10ms function only?

Mellvik commented 1 month ago

The direct PIT access test confirms that the PIT is counting down as it should. Test object is the XT @ 4.77MHz. The code looks like this:

#ifdef TIMER_TEST
void testloop(unsigned timer)
{
        unsigned pticks, x = timer;

#if 0
        get_time_50ms();
        //while (x--) outb(x, 0x80); 
        test_udelay(timer);
        pticks = get_time_50ms();
        printk("hptimer %u: %k (%u)\n", timer, pticks, pticks);
#endif
        while (x--) {
                outb(0, TIMER_CMDS_PORT);       /* latch timer value */
                printk("%u; ", inb(TIMER_DATA_PORT) + (inb(TIMER_DATA_PORT) << 8));
        }
}
#endif

and the output (part of it):

Kernel text 2d0:0, ftext 1172:0, init 1329:0, data 1520:0, top a000:0, 499K free
7306; 6016; 4772; 3528; 2284; 1040; 11224; 9880; 9248; 8626; 8012; 7254; 6010; 4784; 3540; 
2296; 1052; 11213; 9866; 9252; 8620; 8006; 7266; 6022; 4778; 3534; 2290; 1046; 11209; 
9868; 9254; 8622; 8000; 7260; 6016; 4772; 3528; 2284; 1040; 11224; 9862; 9248; 8626; 8012;
 7254; 6010; 4766; 3522; 2296; 1052; 11213; 9866; 9252; 8628; 8006; 7266; 6022; 4778; 
3534; 2290; 1046; 11209; 9868; 9254; 8622; 8000; 7260; 6016; 4772; 3528; 2284; 1040; 11224; 
9862; 9248; 8626; 7994; 7254; 6010; 4766; 3522; 2278; 1034; 11213; 9866; 9252; 8620; 8006; 
7248; 6022; 4778; 3534; 2290; 1046; 11209; 9868; 9254; 8622; 8000; 7260; 6016; 4772; 
3528; 2284; 1040; 11224; 9862; 9248; 8626; ....

The decrement per loop is reasonably consistent, between 1290 and 1244, approx 1.05ms - which, given the printk makes sense.

IOW - this looks good.

Mellvik commented 1 month ago

I'm about to pop a cork, @ghaerr - I think we nailed it. Look at this one:

Kernel text 2d0:0, ftext 12a0:0, init 1457:0, data 1698:0, top 9fc0:0, 493K free
hptimer 5000: 8.467ms (10104)
hptimer 500: 848.894us (1013)
hptimer 50: 89.666us (107)
MINIX: inodes 21845 imap 3 zmap 8, first data 696

This is using the original test_loop which has a net cycle count of 31 on the 40MHz 386. We 'expect' 75us++ for the 100-loop and get 90us, which is great. Similar for the other loop-counts. So the numbers look not only reasonable, they look great!

The problem was a missing byte-mask in prectimer.c: lo = inb(TIMER_DATA_PORT) & 0xff;

Which means the difference between QEMU and real HW is that on HW, inb doesn't touch the upper 8 bits. Wow.

More testing under way, but this looks very promising.

The test_loop now looks like this:

#ifdef TIMER_TEST
void testloop(unsigned timer)
{
        unsigned pticks, x = timer;

#if 1
        get_time_10ms();
        while (x--) outb(x, 0x80);
        //test_udelay(timer);
        pticks = get_time_10ms();
        printk("hptimer %u: %k (%u)\n", timer, pticks, pticks);
#else
        while (x--) {   
                outb(0, TIMER_CMDS_PORT);       /* latch timer value */
                printk("%u; ", inb(TIMER_DATA_PORT) + (inb(TIMER_DATA_PORT) << 8));
        }
#endif
}
#endif
Mellvik commented 1 month ago

More test output, this is from the XT, using the 50ms timer routine:

hptimer 3600: 49.044ms (58526)
hptimer 360: 5.294ms (6318)
hptimer 36: 634.366us (757)

From the 286, which required drastic reduction in the loop count in order to fit in 10ms:

hptimer 2800: 9.119ms (10883)
hptimer 280: 910.906us (1087)
hptimer 28: 103.074us (123)

Congratulations, @ghaerr - we now have precision timer tools for TLVC and ELKS. A very welcome addition. Well done. Thank you.

ghaerr commented 1 month ago

The problem was a missing byte-mask in prectimer.c: lo = inb(TIMER_DATA_PORT) & 0xff;

I just tested this and the code generated is the same with and without the fix!!!

Which means the difference between QEMU and real HW is that on HW, inb doesn't touch the upper 8 bits. Wow.

Are you saying that everything is now working on real hardware, but not on QEMU? Or does it now work on both?

-outb(0, TIMER_CMDS_PORT);

I notice in your commit/change for get_timer_50ms that this is removed within the disabled interrupts critical section. This is required to stay in so that the 8254 countdown is latched at the same time that the jiffies snapshot is taken. Is this what was required to make things work (on real hardware and/or QEMU or both)? The 8254 data sheet says you can execute an extra latch command (which would be the one seen in get_timer_10ms) and the second latch is ignored. But this change has nothing to do with getting get_timer_10ms to work above.

we now have precision timer tools for TLVC and ELKS

Great!!! But I am still a bit confused as to the real fix necessary - as the lo = inb(TIMER_DATA_PORT) & 0xff; generates the same code:

00000000 <get_time_10ms>:
   0:   ba 43 00                mov    $0x43,%dx
   3:   30 c0                   xor    %al,%al
   5:   ee                      out    %al,(%dx)
   6:   ba 40 00                mov    $0x40,%dx
   9:   ec                      in     (%dx),%al
   a:   88 c5                   mov    %al,%ch
   c:   ec                      in     (%dx),%al
   d:   b1 08                   mov    $0x8,%cl
   f:   d3 e0                   shl    %cl,%ax
  11:   08 e8                   or     %ch,%al
  13:   92                      xchg   %ax,%dx
  14:   a1 06 00                mov    0x6,%ax
            15: R_386_16    .bss
            15: R_386_SUB16 .bss!
  17:   29 d0                   sub    %dx,%ax
  19:   79 03                   jns    1e <get_time_10ms+0x1e>
  1b:   05 9c 2e                add    $0x2e9c,%ax
  1e:   89 16 06 00             mov    %dx,0x6
            20: R_386_16    .bss
            20: R_386_SUB16 .bss!
  22:   c3                      ret

Thank you for your debugging!!!

ghaerr commented 1 month ago

I fired up the XT, made a test-kernel for it, changed the test loop to 50ms

The numbers still don't make a lot of sense, but the 500 number is at least reasonably close and the 0 when outside range is good. I suppose the pticks don't send any signals? [EDIT] I guess the ptick value 16425 may be suspicious, or was the 11931 limit for the 10ms function only?

For the get_time_50ms routine, the return value is between 1 and 5 * 11932 - 1 = 59659, so the 16425 is proper.

I'm thinking that the get_time_50ms routine should be rewritten to include the get_time_10ms code in it directly, rather than calling the function. This will also omit the problem of having to execute a second outb(0, TIMER_CMDS_PORT); in the routine.

I'll wait to do that until I understand what you've done to get everything working - or for various reasons were the routines all working originally on real hardware and we didn't realize it, with QEMU remaining hopelessly inaccurate as shown in my last night's testing?

[EDIT: I also notice that your main loop test routine is always using get_time_50ms - is that required in order for results to show properly, that is, was using the get_time_10ms to short a resolution to provide an accurate result? The get_time_50ms routine takes into account timer interrupts and deals with the jiffies change, something that get_time_10ms can't do because of unknown overflow. Perhaps that's the real issue here.]

Mellvik commented 1 month ago

@ghaerr - wow, now I'm really confused, and not by the cork-popping...

You're right, the mask was not the fix and doesn't even show in the code. So we're in the interesting position of hunting not for the bug but for the fix. That's unusual!

You're also right about the extra call to latch the timer. Observing it was done twice, I commented it out, found that it made no difference (and not looking too closely at the exact numbers), then deleted it. I now understand the motivation for its presence. I was too fast on that one. That said, it may not make much difference anyway.

Still, reversing both changes - which were the last changes I made before it started working - and switching between 10ms and 50ms in the test_loop, I always get good results. So the bug is gone but the hunt goes on.

The other thing I did of course, was to yank out the assembler delay loop, but the rabbit is still at large.

ghaerr commented 1 month ago

I was waiting with baited breath for your response... thanks.

I am (somewhat) glad to hear that the INB & 0xFF fix didn't change anything on your side either. I take it to mean that the get_time_50ms routine is acting properly on your real hardware tests, which is the important point!!!

I now understand the motivation for its presence. I was too fast on that one. That said, it may not make much difference anyway.

That "second" latch has to stay in the way the code is currently written, as otherwise it is possible that jiffies could increment and the timing would be wildly off and we'd not know it.

I am coming to the conclusion that, for whatever reason (certainly some due to QEMU not properly emulating an 8254) , our initial tests produced incorrect or non-understandable results. But now, you're saying that you have verified that, on real hardware, we have a mathematically close numbers for expected timing results, tested on 8086, 286 and 386. That's really something! :)

As a result, I am thinking that we shouldn't use the get_time_10ms routine, as it can't detect overflow; I only originally wrote it for very precise timing possibly down to the single-digit-microsecond level. Without overflow detection, it isn't safe. Instead, I think we should only use get_time_50ms (instead of get_time_10ms), since it always detects overflow. It takes slightly longer but seems to always be correct.

I will rewrite get_time_50ms so that the actual PIT countdown reading code is within get_time_50ms, making it faster without an extra function call, and the second latch won't be necessary. I'll post this today and you can have a look at it.

The get_time_50ms routine essentially breaks down a 5-jiffie span into 59660 subincrements of 0.838us, which should work well enough for our planned measurements. The unsigned long version, get_time() can be used for longer increments, with the resolution decreasing to 10ms accuracy when the return value is >= 59660UL.

The other thing I did of course, was to yank out the assembler delay loop, but the rabbit is still at large. So the bug is gone but the hunt goes on.

I still have some concerns about the kernel call stack getting trashed, which could possibly have something to do with it, but I'm willing to go with your real hardware tests results for now and say that there isn't a bug in the get_time_50ms routine, and that using get_time_10ms is unsafe. I plan on testing on some other emulators to see whether I can find a more accurate 8254 emulation, which should allow us to possibly stop working on chasing down a possible non-bug in the precision timer routines themselves, and start using the routines for real timing!

Thank you!

Mellvik commented 1 month ago

@ghaerr, I haven't tested the latest version on QEMU before, but now even that looks good and stable (predictable):

Kernel text 2d0:0, ftext 127a:0, init 1431:0, data 1672:0, top 9fc0:0, 494K free
hptimer 10000: 403.916us (482)
hptimer 1000: 41.900us (50)
hptimer 100: 5.028us (6)
df0: Auto-detected floppy type 1.44M

This is using the 50ms call. And the numbers stay close to the same when rebooting a number of times. So it may seem QEMU is not to blame here either.

If it weren't for the fact that you have been getting similarly odd results in QEMU as I have on HW, I'd blame lingering object files, surviving make kclean - since going back and forth to/from the XT means complete recompiles, but that cannot be it either.

What are you getting from QEMU now?

ghaerr commented 1 month ago

I'm running QEMU v5.2. Not sure what vintage that is.

With the following testloop:

unsigned int a[10];
void testloop(unsigned timer)
{
        unsigned pticks;
        unsigned x = timer;
        int i;

        get_time_50ms();
        for (i=0; i<10; i++)
            a[i] = get_time_50ms();
        for (i=0; i<10; i++)
            printk("%d,", a[i]);
        printk("\n");
        get_time_50ms();
        while (x--) asm("nop");
        pticks = get_time_50ms();
        printk("hptimer %u: %k (%u)\n", timer, pticks, pticks);
}

I'm getting the following, which doesn't really look right (too low??). Also notice the first two calls in the array. (I planned on using an array to average the times for get_time_50ms but it skewed the results and the average was way high, thus totally unusable). BTW, on my system if you reverse the tests and run the hptimer 100 fast test first, you still get the 84, 83 times back which completely trashes that test. Perhaps I need to upgrade QEMU?

84,33,1,0,2,1,0,1,1,0,
hptimer 10000: 43.576us (52)
1,1,1,0,1,2,0,1,1,0,
hptimer 1000: 3.352us (4)
1,0,1,1,0,1,1,0,2,1,
hptimer 100: .838us (1)

I'd blame lingering object files

I am now suspecting that between our testing and understanding of what we're doing, the initial real hardware tests might have been improper, and/or get_time_10ms overflowed and we didn't know it. But QEMU doesn't seem to be working on my system, and I am willing to guess that's our only bug at this point.

Mellvik commented 1 month ago

@ghaerr, I'm off screen for the day, but could you run this testloop (the one I ran on the xt earlier today per your suggestion) on qemu?


        while (x--) {
                outb(0, TIMER_CMDS_PORT);       /* latch timer value */
                printk("%u; ", inb(TIMER_DATA_PORT) + (inb(TIMER_DATA_PORT) << 8));
        }
Mellvik commented 1 month ago

I'm running QEMU v5.2. Not sure what vintage that is.

My M1 says

~ qemu-system-i386 -version QEMU emulator version 7.2.0 Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers ➜ ~

I'm getting the following, which doesn't really look right (too low??). Also notice the first two calls in the array. (I planned on using an array to average the times for get_time_50ms but it skewed the results and the average was way high, thus totally unusable). BTW, on my system if you reverse the tests and run the hptimer 100 fast test first, you still get the 84, 83 times back which completely trashes that test. Perhaps I need to upgrade QEMU?

84,33,1,0,2,1,0,1,1,0,
hptimer 10000: 43.576us (52)
1,1,1,0,1,2,0,1,1,0,
hptimer 1000: 3.352us (4)
1,0,1,1,0,1,1,0,2,1,
hptimer 100: .838us (1)

I agree that this doesn't make sense. Then again it looks like what i was getting before things started to work. We're still in the dark as to what made everything - including qemu - work on my end.

I'd blame lingering object files

I am now suspecting that between our testing and understanding of what we're doing, the initial real hardware tests might have been improper, and/or get_time_10ms overflowed and we didn't know it.

Agreed, we've come a long way really.

QEMU doesn't seem to be working on my system, and I am willing to guess that's our only bug at this point.

May well be the case! will be really interesting to compare these last numbers.

Mellvik commented 1 month ago

As a result, I am thinking that we shouldn't use the get_time_10ms routine, as it can't detect overflow; I only originally wrote it for very precise timing possibly down to the single-digit-microsecond level. Without overflow detection, it isn't safe. Instead, I think we should only use get_time_50ms (instead of get_time_10ms), since it always detects overflow. It takes slightly longer but seems to always be correct.

I will rewrite get_time_50ms so that the actual PIT countdown reading code is within get_time_50ms, making it faster without an extra function call, and the second latch won't be necessary. I'll post this today and you can have a look at it.

I support this idea. The resolution is good enough and one version makes life simpler.

ghaerr commented 1 month ago

could you run this testloop

With this added to the top of testloop:

        unsigned int y = 1300;
        unsigned int lastt = 0;;
        while (y--) {
                outb(0, TIMER_CMDS_PORT);       /* latch timer value */
                unsigned int t = inb(TIMER_DATA_PORT) + (inb(TIMER_DATA_PORT) << 8);
                printk("%u,%u; ", t, lastt-t);
                lastt = t;
        }

(this displays both the countdown and the difference between countdowns, but uncorrected for recycling to 11932, which the real precision timer routine does), we get:

1

Notice the highlighted portion (and ignore the first difference): the difference starts as very small (30-40), then suddenly jumps to ten times that (400-1200), and stays there.

Finally, the hptimer results display (with the first two values large again!), with the difference values staying high as before:

2

I'm going to display the values in hex to see if I notice anything. More coming.

ghaerr commented 1 month ago

I looked at the hex output and things look ok. I suppose the countdown could be correct but it does seem to vary wildly, when going from quite small to large.

I added yet another delay to the top of testloop:

        x = timer;
        while (x--) asm("nop");
        x = timer; // forgot to add this for later loop

and this produced the following, seemingly inaccurate result, where the 100 timer takes longer than 10000:

[EDIT: I found an error not reinitializing the x loop counter. Now QEMU seems to give better results, but still unexplained why the first two results in the array are so high:

82,22,8,1,1,0,2,0,1,1,
hptimer 10000: 46.928us (56)
1,1,1,1,0,2,1,0,1,0,
hptimer 1000: 3.352us (4)
1,1,0,1,2,0,1,0,1,1,
hptimer 100: .838us (1)

I can't make sense of it. Still thinking. I suppose if QEMU is emulating the NOP loop super super fast these 100-10000 tests could all results in a similar elapsed time (182-333)?

Another theory is that QEMU takes more or less time to update the PIT counter when it is doing I/O, or when it is emulating the serial I/O, because it appears the results could be related to QEMU output to the serial (console) device.

I put the additional wait loop in to try to "drain down" the QEMU output and am now seeing these even stranger results. Adding more waits actually increases the first two numbers shown. All very strange! I will look at upgrading QEMU.

ghaerr commented 1 month ago

I'm about to pop a cork, @ghaerr - I think we nailed it.

I think we did, finally! :) At the last moment I happened across a site saying that on macOS, one can run at near full speed if the host and emulated architectures match by adding -accel hvf to the qemu command line. I tried it, and amazingly enough, qemu is now running almost twice as fast and all the precision timer problems seem to have gone away. I had previously ran the precision timing on DosBoxX, which runs at near real hardware speed (slow), and all numbers there looked good. Here's the story in screenshots:

Old QEMU v5.2, initial timing very high, then very low elapsed times (seemingly broken):

qemu old

DosBoxX, working well:

dosboxx

QEMU v5.2 running -accel hvf option:

qemu new

The raw get_time_50ms numbers are almost the same as DosBoxX but large loop timings very much faster.

Having seen all this, I think I now understand what is likely happening with QEMU, when not running in hardware-accelerated mode: when not running native execution of instructions (e.g. macOS HVF or Linux KVM mode), qemu actually runs code using a built-in JIT translator from target to host. What is likely happening is that QEMU, being the beast that it is, keeps a cache of the instruction stream being emulated. When that stream ends up rerunning a large number of times, it qualifies for JIT execution and is replaced by a JIT stream which is much faster. That's the reason why the first few numbers are very large - the QEMU instruction cache finally triggers a JIT conversion and then everything runs fast. This is also why QEMU running JIT is unreliable as heck for precision timing. Whatcha think of that theory, huh!?!

I would say pop the cork - we have reliable precision timing on QEMU, DosBoxX and 8086, 286 and 386 real hardware. Thank you!!

I've already rewritten get_time_50ms to include the get_time_10ms code and commented out get_time_10ms as discussed. I'll push that so you can look it over and test it if you like. I'm running that code in the screenshots above so we should be good to go!

Mellvik commented 1 month ago

@ghaerr, what can I say? The world is indeed full of surprises - there is a lot to learn around every corner.

I think your analysis with respect to QEMU makes a lot of sense and is well supported by the 'evidence'. I do, however, suggest you upgrade to a more recent version - which quite possibly would have delivered correct results even without the new options. The reason I'm saying this is the following:

Your exercise triggered some serious curiosity on my part, enough to copy a floppy image of TLVC to my old Intel based MacMini server from 2014 running QEMU 7.0. Booting the image using the 'standard' ELKS qemu.sh delivered perfect results:

Kernel text 2d0:0, ftext 127a:0, init 1431:0, data 1672:0, top 9fc0:0, 494K free
hptimer 10000: 841.352us (1004)
hptimer 1000: 92.180us (110)
hptimer 100: 10.056us (12)
df0: Auto-detected floppy type 1.44M
Skjermbilde 2024-08-10 kl  11 14 58

In order to ensure we've covered all bases, I'm suggesting you boot the enclosed TLVC image in your QEMU version without the new options. It should deliver erratic results. If not, there is still a rabbit at large.

fd1440.img.zip

Thanks for the updated code, I'll adapt it right away - and test. As you've seen I've added a hptimer setting to /bootopts as a fast way (no recompile) when timing various kernel elements. Obviously - I got tired of recompiles.

Different, but related: How do you suggest we add a microsecond exact delay loop using this code?

Finally - indeed, the cork hasn't only been popped, the contents have already been consumed! Again - congrats and thanks for an interesting journey.

ghaerr commented 1 month ago

I'm suggesting you boot the enclosed TLVC image in your QEMU version without the new options. It should deliver erratic results. If not, there is still a rabbit at large.

I'm not quite sure what this is testing, but here are the results without the HVF option:

hptimer 10000: 959.510us (1145)
hptimer 1000: 185.198us (221)
hptimer 100: 11.732us (14)

hptimer 10000: 722.356us (862)
hptimer 1000: 138.270us (165)
hptimer 100: 15.922us (19)

hptimer 10000: 1.935ms (2310)
hptimer 1000: 147.488us (176)
hptimer 100: 20.112us (24)

The 10000 timer seems erratic, the others not.

Running with the HVF option:

hptimer 10000: 42.096ms (50235)
hptimer 1000: 5.674ms (6771)
hptimer 100: 508.666us (607)

hptimer 10000: 0us (0)
hptimer 1000: 5.205ms (6212)
hptimer 100: 536.320us (640)

hptimer 10000: 0us (0)
hptimer 1000: 7.262ms (8667)
hptimer 100: 739.116us (882)

These seem less erratic, but obviously the 10000 test is overflowing the 59660 limit.

I do, however, suggest you upgrade to a more recent version

I'm looking at doing that using homebrew but have found various statements from people finding QEMU not working at all after an update. Combined with not knowing what version I'll be getting and Catalina not supported anymore by homebrew, I may just leave well enough alone for the time being 🤷‍♂️.

How do you suggest we add a microsecond exact delay loop using this code?

Getting a true precision delay loop to microsecond accuracy will be hard, given what we've learned with the various huge speed differences between processors, the need to disable interrupts in a driver for accuracy (bad!), and while executing the setup code required for the actual timing.

The idea would be to calculate the number of 8254 PIT countdowns (= 0.838us pdiffs) needed by first dividing the microseconds wanted by 0.838. Then reading the current countdown, taking into account wraps, until the count difference >= pdiffs. Dividing by 0.838 will have to be done as multiplying the microseconds by 1000 first, then dividing by 838. This should be able to be done with a single 16x16=32 multiply and 32/16 divide, but I'm not sure how to take into account the execution of the MUL/DIV instruction timings into the equation, unless the timeout is a constant and can be pre-calculated. I suppose the cycles counts of each the setup instructions could be calculated for 8086/286/386+ and then subtracted from the required microseconds requested based on the running cputype.

What would an accurate delay like this be required for?