Thoughts on changing ktcp timer resolution

Mellvik / TLVC

Tiny Linux for Vintage Computers

Other

13 stars 0 forks source link

Thoughts on changing ktcp timer resolution #71

Closed ghaerr closed 2 weeks ago

ghaerr commented 4 months ago

In previous comments, I made the statement:

I would treat making a single system call per ktcp cycle as very fast. The system call overhead is almost identical to the hw interrupt processing overhead ... and of course the 100 timer calls/second are thought as minuscule overhead.

After looking at the code generated for the kernel sys_gettimeofday routine, I would say "Oops - that's only part of the story, its actually not fast". In order to prepare the returned system time into a struct timeval, a 32-bit divide (by calling __udivsi3) plus an additional 32-bit modulo (by calling __umodsi3) are made, along with a couple other 32-bit adds. This is all slow as heck. (Unbundled sources for each are here and here and they loop).

Then, after that, ktcp's time wrapper function timer_get_time, which returns a 32-bit time in 1/16 second resolution, goes through yet another slow call to __udivsi3 to convert microseconds into 1/16 seconds by dividing by 16:

timeq_t timer_get_time(void)
{
    struct timezone tz;
    struct timeval  tv;

    gettimeofday(&tv, &tz);

    /* return 1/16 second ticks, 1,000,000/16 = 62500*/
    return (tv.tv_sec << 4) | ((unsigned long)tv.tv_usec / 62500U);
}

Besides being quite slow, 1/16 second = 62.5 msecs, 6 times longer than the 10ms clock resolution and the 5-10 msecs for an ACK response, discussed here.

(I also notice in the above code as I'm writing this that the second ("&tz") param to gettimeofday is not used and wastes more time returning it - should be NULL).

In another post I stated:

Whether you choose 1, 10 or 100ms isn't probably a big deal if we stick with the jiffies-related (100ms) hw timer as base real time, but the lower the better

After thinking more on this and looking at the timer_get_time routine - that answer also gives only half the story. The problem is that for efficiency reasons, any "ktcp" time needs to fit in 32 bits, not a timeval struct, in order to do a quick subtraction and not end up calling gcc long div/mod routines each time its used.

Thus the tradeoff: the more resolution given to micro or milliseconds, the less bits given to the "seconds" portion of a timeq_t.

In the current timeq_t, 28 bits are seconds and 4 bits are 1/16 seconds. This allows for 2^28 = 256M/86,400 = ~3106 days of timer. (A timeq_t may have to be in effect for the total duration of a TCP connection, I believe). To move to microsecond resolution would require 20 bits (1M) for microseconds, leaving only 12 bits (=4096) for seconds, a little over an hour. So that won't work, and the kernel hw timer has max 10ms (=HZ = 1/100) resolution anyways.

Considering that we stay with an approx ~10ms timer, which would at least measure ACK responses favorably on a 286/12.5Mhz or V20/8Mhz, how would the math work, given that the gettimeofday/timer_get_time is already quite slow?

If we considered 1/256 second resolution (=~3.9ms) or 1/128 resolution (=~7.8ms), this would help a lot and the timer_get_time routine could be made faster (see below). If 1/256sec is used, that leaves 24 bits for seconds (=16M/86400 = ~194 days). 1/128 would be 25 bits with 388 days. But we should be time sampling with less than half the timer resolution, so I would say 1/256 second resolution would be best, and TLVC/ELKS is likely just fine with an open TCP connection for 194 days before worrying about RTT miscalculations.

One might ask, why not just use 1/100 second (=10ms) resolution directly? Well, the original timer_get_time routine is somewhat magic in that it uses the fact that 1M/16 = 62,500, so a microsecond divide can exactly fit into 4 bits (=1/16 sec). Dividing by 100 will still be quite slow but won't fit neatly into 7 bits. And if we're going to use 10ms resolution, why not make a special new system call and just return jiffies directly, which would be quite a bit faster, and only use a compare and subtract or two to compute elapsed time.

Another thought would be to just use a right-shift-12 (=1M/256 = 2^20/2^8) for roughly converting struct timeval microseconds to 1/256 seconds, this would leave the last line of the routine as:

    return (tv.tv_sec << 8) | (((unsigned long)tv.tv_usec >> 12) & 0xFF);

I haven't yet figured out how to measure how long the __udivsi3 or __umodsi3 routines take, so I'm not fully sure just how important all this is, given the kernel's 10ms max resolution.

What do you think?

ghaerr commented 3 months ago

@Mellvik: I started thinking about porting the precision timing routines to the C library, for use in applications like ktcp.

When considering how to measure, say, the ktcp main loop, I realized we don't want to have to choose between various elapsed time routines, but just use one, since ktcp could wait for quite a while (seconds, minutes), or it could loop quickly. Also, applications are always subject to multitasking and a 50ms interval, 5 time slices, is pretty small when other applications are running - a routine like get_time_50ms could easily overflow.

I've come up with and am testing (in the kernel for now), a single unsigned long time_elapsed() routine, which, due to the math, will work and display properly for elapsed times between 0.838us and ~42.85 seconds, with overflow detection up to ~10.9 minutes.

When the elapsed time is below 50ms, the resolution is 0.838us, but when >= 50ms, the resolution is in 10ms (jiffie) increments. When the elapsed time is <= 42s the us/ms/s value can be displayed correctly, otherwise we'll have to display 0s or ">42s". If the elapsed time is < ~11 minutes, the overflow detection can't work well without impacting < 50ms measurements, so a bad value could be displayed.

The math doesn't matter other than the high end ~42s, but the long return value would be displayed using %lk and normally (< 42s) just work. We would probably want to use the same-named routine in the kernel as well.

Thoughts?

Mellvik commented 3 months ago

@Mellvik: I started thinking about porting the precision timing routines to the C library, for use in applications like ktcp.

When considering how to measure, say, the ktcp main loop, I realized we don't want to have to choose between various elapsed time routines, but just use one, since ktcp could wait for quite a while (seconds, minutes), or it could loop quickly. Also, applications are always subject to multitasking and a 50ms interval, 5 time slices, is pretty small when other applications are running - a routine like get_time_50ms could easily overflow.

I've come up with and am testing (in the kernel for now), a single unsigned long time_elapsed() routine, which, due to the math, will work and display properly for elapsed times between 0.838us and ~42.85 seconds, with overflow detection up to ~10.9 minutes.

When the elapsed time is below 50ms, the resolution is 0.838us, but when >= 50ms, the resolution is in 10ms (jiffie) increments. When the elapsed time is <= 42s the us/ms/s value can be displayed correctly, otherwise we'll have to display 0s or ">42s". If the elapsed time is < ~11 minutes, the overflow detection can't work well without impacting < 50ms measurements, so a bad value could be displayed.

The math doesn't matter other than the high end ~42s, but the long return value would be displayed using %lk and normally (< 42s) just work. We would probably want to use the same-named routine in the kernel as well.

Thoughts?

@ghaerr,

this makes a lot of sense to me. ktcp being my active project at the time, I'm rather looking forward to test this. A few questions/considerations come to mind:

I think capping the coverage at 42s is fine. Above that (and actually quite a bit below), milliseconds wouldn't make sense because of all the other stuff going on on the system, possibly even be misleading. Also there is still the option of using the old method of the new tool doesn't fit, gettimeofday. Presumably such capping would simplify the code somewhat.
ktcp may indeed sleep for a long time when there is no activity, and in such cases reporting '>42s' is fine. It is unlikely that the reading in such cases would be useful anyway.
A related question is what is the lowest/shortest meaningful measurement - which will depend on the complexity of the code, how the timer is accessed - and the speed (or rather, slowness!) of the machine. As for low-level access I'm guessing another /dev/mem ioctl??

We would probably want to use the same-named routine in the kernel as well.

I'm assuming that would be in addition to the get_time_50ms - right?

Thanks, M

Mellvik commented 3 months ago

I'm suggesting you boot the enclosed TLVC image in your QEMU version without the new options. It should deliver erratic results. If not, there is still a rabbit at large.

I'm not quite sure what this is testing, but here are the results without the HVF option:

This simple verifies that both systems/kernels are delivering the same results in your QEMU version, eliminating the chance that there are other elements involved. Thanks for testing. The results are vey conclusive.

I do, however, suggest you upgrade to a more recent version

I'm looking at doing that using homebrew but have found various statements from people finding QEMU not working at all after an update. Combined with not knowing what version I'll be getting and Catalina not supported anymore by homebrew, I may just leave well enough alone for the time being 🤷‍♂️.

It seems a new update qemu would work fine and will give you (at least) version 7.1. I just started such an update on my Mini, which is running an older MacOS than yours, and it seems to run fine (the update that is).

 elks git:(master)$ brew upgrade qemu
Warning: You are using macOS 10.14.
We (and Apple) do not provide support for this old version.
You will encounter build failures with some formulae.
Please create pull requests instead of asking for help on Homebrew's GitHub,
Twitter or any other official channels. You are responsible for resolving
any issues you experience while you are running this
old version.

==> Upgrading 1 outdated package:
qemu 7.0.0 -> 7.1.0

It's repeatedly complaining about Xcode and its command tools needing to be updated, but it's not a requirement (I'll update the command tools manually later).

The process takes hours (zillions of packages that need upgrades/updates), and it's not done yet. I'll let you know how it ends. The good thing is, brew doesn't break anything along the way. If the upgrade fails, the old should be fine (my experience).

How do you suggest we add a microsecond exact delay loop using this code?

Getting a true precision delay loop to microsecond accuracy will be hard, given what we've learned with the various huge speed differences between processors, the need to disable interrupts in a driver for accuracy (bad!), and while executing the setup code required for the actual timing.

The idea would be to calculate the number of 8254 PIT countdowns (= 0.838us pdiffs) needed by first dividing the microseconds wanted by 0.838. Then reading the current countdown, taking into account wraps, until the count difference >= pdiffs. Dividing by 0.838 will have to be done as multiplying the microseconds by 1000 first, then dividing by 838. This should be able to be done with a single 16x16=32 multiply and 32/16 divide, but I'm not sure how to take into account the execution of the MUL/DIV instruction timings into the equation, unless the timeout is a constant and can be pre-calculated. I suppose the cycles counts of each the setup instructions could be calculated for 8086/286/386+ and then subtracted from the required microseconds requested based on the running cputype.

There is obviously a limit to the real/practical accuracy such a function can deliver, the key goal would simply be 'as close as possible' given the constraints - one being simplicity. This would still be a hell of a lot better than the delay loops we have today, which - between the XT and the 40MHz 386 - vary by orders of magnitude. Also it would have to be a goal to avoid both MULs and DIVs, and I have an idea about how to do that. I'll take a stab and see if I can get it to work.

ghaerr commented 3 months ago

Above that (and actually quite a bit below), milliseconds wouldn't make sense because of all the other stuff going on on the system

I finally realized that we can have 0.838us accuracy, all the way from 0.838us to 42s. That is, the elapsed timer reads the wrap-adjusted PIT countdown (0-11931), then adds the jiffies difference times 11932, since jiffies is always incremented on counter reload. I got confused initially thinking we needed a very accurate, low-overhead sub-10ms timer and created three routines, which turns out is not needed when the new single routine returns an unsigned long, and sub-10ms accuracy is tricky when interrupts are enabled and on fast machines.

I'm assuming that would be in addition to the get_time_50ms

No, we just need one routine, I now have it working. The get_time_50ms name is now bad, and get_time isn't a great name for confusion with other C library clock time routines like gettimeofday. I was thinking time_elapsed, or maybe get_time_elapsed, might be a better name?

A related question is what is the lowest/shortest meaningful measurement

I'm thinking of using taking 3-4 adjacent elapsed time samples (with interrupts disabled) and then calculating an average, which could then be used as the average time the time_elapsed function needs to run. This would then be the shortest meaningful measurement, and could be subtracted from each elapsed time result in order to get more precision. The C library version will need an init routine (time_elapsed_init?, to get the kernel jiffies pointer), this could calculate the average and return it.

I have a working time_elapsed function working in the kernel now, and am thinking of submitting a PR when I get it all working and tested for the C library so we don't have to go through (yet) another revision of the routines...

Mellvik commented 3 months ago

How do you suggest we add a microsecond exact delay loop using this code?

Getting a true precision delay loop to microsecond accuracy will be hard, given what we've learned with the various huge speed differences between processors, the need to disable interrupts in a driver for accuracy (bad!), and while executing the setup code required for the actual timing.

Agreed, true precision at microsecond level is impossible given the factors you mention and the fact that an XT/8088 cannot complete even the simplest loop in 1us: Clock cycle 210ns, loop instruction requiring 17 cycles. So we're talking 10us resolution at best and semi-precision - which will have to do. Being orders of magnitude better than what we've got, it's OK.

Mellvik commented 3 months ago

I finally realized that we can have 0.838us accuracy, all the way from 0.838us to 42s. That is, the elapsed timer reads the wrap-adjusted PIT countdown (0-11931), then adds the jiffies difference times 11932, since jiffies is always incremented on counter reload. I got confused initially thinking we needed a very accurate, low-overhead sub-10ms timer and created three routines, which turns out is not needed when the new single routine returns an unsigned long, and sub-10ms accuracy is tricky when interrupts are enabled and on fast machines.

I believe your initial thinking was better than you give yourself credit for. There is a good reason for staying away from longs when possible - they are extremely expensive on pre-286 machines. Which is why I'm concerned that a single version - long only - is not going to work well/be all that useful on older systems when measuring short intervals.

Also there is the concern about complexity in the time_elapsed routines, which - since I haven't seen them - may of course be entirely superfluous. For example, while latching the timer as early as possible when getting the time elapsed is good, it works the opposite way when creating the starting point using the same routine.

A related question is what is the lowest/shortest meaningful measurement

I'm thinking of using taking 3-4 adjacent elapsed time samples (with interrupts disabled) and then calculating an average, which could then be used as the average time the time_elapsed function needs to run. This would then be the shortest meaningful measurement, and could be subtracted from each elapsed time result in order to get more precision. The C library version will need an init routine (time_elapsed_init?, to get the kernel jiffies pointer), this could calculate the average and return it.

Yes, I think this is an important step - incidentally my calibrated delay loop does the same, although no multiple calls to get an average yet. The calibration happens at boot time, but it may still be good (even necessary) to make multiple calls to even out 'noise'. I get differences between two runs just calling get_time_50ms twice (on the 286) exceeding 100us.

As to naming, I agree that the current naming is not ideal. Given that using a timer always requires two steps, set and get, I suggest two different calls, even if they end up calling the exact same routine. Like 'set_ptimer()' and 'get_ptimer()- 'p' meaning/indicating 'precision'. And finallyinit_ptimer(). Ifset_ptimer()andget_ptimer()` were indeed different routines, latching the timer at the optimal point would be possible for both cases.

ghaerr commented 3 months ago

@Mellvik: I've been traveling again, but managed to push PR https://github.com/ghaerr/elks/pull/1962 showing the working refactored code for combining the previous 3 precision timer routines into a single routine: get_ptime. I think this gives us the best of all the variations we're looking for (more discussion on that below).

Which is why I'm concerned that a single version - long only - is not going to work well/be all that useful on older systems when measuring short intervals.

Good point. In get_ptime, for measurements < 10ms (that is, less than a clock interrupt), all arithmetic is done in 16-bit, and no longs are manipulated or stored. If you know the measurement will be small, then you can just cast the result and get a very fast measurement: unsigned int ptick = (unsigned int)get_ptime().

If you're concerned about elapsed times between 10ms and 50ms, there's always going to be a clock interrupt (at least in real execution) within the measurement, and the kernel IRQ 0 handling has shown to vary elapsed times by quite a bit, so there wouldn't be a super-accurate measurement being made anyways. In the new routine, there is only a single MUL, which is executed only when elapsed time > 10ms (through 42s).

So, the same routine can be used for sub-10ms or 0-42s measurements with no extra overhead for < 10ms. The only thing to watch for is that %lk needs to be used when displaying an unsigned long ptick vs before.

For example, while latching the timer as early as possible when getting the time elapsed is good, it works the opposite way when creating the starting point using the same routine.

In general, moving the latch code as first or last complicates matters and would necessitate having two almost-identical routines, which also would have the disadvantage of not being able to be used to compute a more reliable average execution time, and also double the amount of kernel code, just for a few cycles difference.

If measuring elapsed time really really matters (something I have yet to see as really important given the accuracy of our varied results so far as well as the fact that ELKS/TLVC are not real time systems), I suggest another way: writing the small portion of the 8254 counter-reading routine as a GCC static inline routine, which would then be inserted directly into the code being measured. This would remove all procedure call overhead and read the PIT directly from the measured function. More on that later should we really want/need it.

I suggest two different calls, even if they end up calling the exact same routine.

I opted against that idea, as it adds another function call to every measurement and there's no real need for a separately named start and stop function, especially if the elapsed time function is used in a loop, where this idea wouldn't work.

Mellvik commented 3 months ago

@ghaerr,

what an educating and interesting journey it has been since I asked this question about 3 weeks ago - in https://github.com/Mellvik/TLVC/pull/69#issuecomment-2241622226:

As to the time a get_packet call inside a driver takes, jiffies don't have the resolution to measure that. @ghaerr, what's the smart way to get that ...

... including - off all things - a short tutorial on system startup in https://github.com/Mellvik/TLVC/issues/71#issuecomment-2276627078 :-)

And don't worry about timing (pun intended). This side project is not holding up anything and I'm having very limited screen time these days too.

It's easy - and we've been there several times - to lose perspective on reality when digging around low level like this. As you point out, there are severe limits to the useable accuracy we can get given the way the system operates - clock, interrupts, scheduling etc. - and not the least the capabilities of the systems at hand. An XT has a cycle time of 210ns which means that even the tightest loop - a nop and a loop instruction - will take (3+17)*210ns = 4.2us (measured@1000 loops: 5.6us). Also, a minimum call/return without even looping is 26us (measured: 30us). Which makes the idea of a microsecond semi-exact delay loop, 'unviable' to say the least. Even a u10delay() is a stretch if the delay count is low, but I may be going for that anyway, thinking that the XT cannot dictate the choice. Different discussion for a different place though.

That said, my work on the delay idea has turned into a great testing ground for the precision timer tools - for several reasons, including the variations between runs and time consumed by the tools themselves, which must be recorded and corrected for when establishing the metric for the delay loop. A great baseline when testing the 'old' 50ms routine against the new get_ptime(). I have recorded comparable numbers for call duration and various loop times for all 4 system types - XT/4.77, XT-V20/8MHz, 286/12.5MHz, 386SX/40MHz. It's indeed very educational. And I'm getting a 'somewhat' accurate u10delay function across the platforms, which for obvious reasons becomes more accurate the longer the delay. A boot time calibration generates a speed-index for the inner loop of the delay routine.

If measuring elapsed time really really matters (something I have yet to see as really important given the accuracy of our varied results so far as well as the fact that ELKS/TLVC are not real time systems), I suggest another way: writing the small portion of the 8254 counter-reading routine as a GCC static inline routine, which would then be inserted directly into the code being measured. This would remove all procedure call overhead and read the PIT directly from the measured function. More on that later should we really want/need it.

This is a good idea - somewhat similar to what I was thinking when getting frustrated with the variations in the delay function, but discarded for lack of insight into how to do it properly.

Back to where I started, we're getting close to having reliable and reasonably accurate timing for the likes of get_packet(). That's something!

Mellvik commented 3 months ago

@ghaerr, an update on updating QEMU:

It was ultimately unsuccessful - and it didn't destroy anything except the symlinks in /usr/local/bin.

Your system is newer, it may succeed and it's relatively risk free. I can send you a tar file with the 7.0.0 tree which should work if you run brew update to get the rest of the system as up to date as possible.

ghaerr commented 3 months ago

@Mellvik, I've added user mode precision timing in https://github.com/ghaerr/elks/pull/1963, and it all seems to work. Initial testing was done by timing ktcp's main loop. It will be very interesting to see what this looks like on real NIC hardware!

Mellvik commented 3 months ago

Thank you @ghaerr - looking forward to it. I haven't pulled across the final get_timer yet, so that comes first.

I kinda went to town on the scaled delay idea, encountering so many interesting things along the way, and lost track of time (pun intended) in the process. With 4 different processors and clock speeds to test and compare on, there is indeed a lot to be learned when diving all the way down to the clock cycle level (not to mention how to do the math without floats and with the least possible # of MUL/DIVs in the calibration, plus the effect on timings when using INITPROC vs local for the calibration).

Anyway, here's the one that hit me hardest in the tit took quite some time to figure out. Again it's QEMU and my version is 7.2.0. The following C statement works - temp is unsigned long, d0 is unsigned:

        temp = d0*838UL/10000UL;

while this does not, that is, it fails on QEMU, works fine on real HW, all types:

        temp = d0*838L/10000L;

No comments at all from gcc, and here's the thing: The difference in the object file:

➜  TLVC git:(misc) ✗ diff /tmp/fails.out /tmp/works.out     
124c124
<           d3: R_386_PC16  __divsi3
---
>           d3: R_386_PC16  __udivsi3
➜  TLVC git:(misc) ✗

Clearly the problem is in QEMU, and just as clearly, this is good to know for a 'heavy' QEMU user like yourself (the symptom is a stack screwup). It might be of interest to 'our' compiler guru too...

Mellvik commented 3 months ago

@ghaerr - there is apparently more to the 'div problem' described above than initially met the eye - and I'm coming back to that later. What led to the problem in the first place is equally interesting: There are important experiences to be made when using get_ptime on QEMU. Unsurprisingly - as we've discovered before - they don't go together well. In fact there may be cases in which they don't go together at all except in the context of loops - getting averages over many runs. The same piece of code will have vastly different execution times in different contexts - apparently related to caching.

In short, the caching effects in QEMU (possibly in the ia64 processors being emulated) are nothing short of amazing. We can time the same piece of code twice and get results that differ by a factor of 4. Not every once in a while, but consistently. Example: I have placed this piece of code at 3 different places in a routine, the latter 2 are adjacent:

        get_ptime();
        u10delay(0);
        d0 = (unsigned) get_ptime();
        printk("d0=%u (%k)\n", d0, d0);

The values I get are 46, 14 and 9 pticks. When running the following code:

        //-----
        get_ptime();
        u10delay(0);
        d0 = (unsigned) get_ptime();
        printk("d0=%u (%k)\n", d0, d0);
        //-----
        i = 0;
        for (d0 = 0; d0 < 3; d0++) {
                get_ptime();
                i += (unsigned)get_ptime();
        }
        //printk("l=%u (%k)\n", i, i);
        adj = i/3;
        printk("adj %u -- ", adj);

first without, then with the part delineated by //-------, the value of adj is 10 and 2 pticks respectively. Attempting to pre-cache the calls didn't have much effect. Testing on the Intel Mac w QEMU 7.0 gave similar results, sometimes with even higher difference between runs - order of magnitude (using -accel hvf).

These seemingly random differences in measured times were the root cause of the div problem reported above, causing div by zero, which apparently is not tested for in the long div routines, whether signed or unsigned.

My conclusion from this experience is that precision time measurement on QEMU can never be reliable - and that's OK. QEMU is not the environment in which such measurements are meaningful anyway. The same goes for the delay stuff I've been working on. I just have to catch the presence of QEMU (from the speed) to avoid the weird values screwing up the calculations. The actual delays don't matter on QEMU.

ghaerr commented 3 months ago

I'm in agreement with you about QEMU, the differences between execution times with and without HVF or KVM acceleration for its internal JITter along with what you've seen show that show that "elapsed" time measurements using the 8254 timer are unreliable at best, I would say +/- ~10 pticks (sometimes way more). But I think the real hardware measurement are pretty accurate (more on this below), right?

With regards to trying to accurately measure delays on real hardware, agreed an average over several loops should be made. Interrupts have to be disabled around get_ptime or the kernel timer interrupt code might be counted on one of the loops, wildly throwing off the results. I would also recommend using a power of two (4 or 8) for the loop count so that a simple right shift can be used instead of a possibly much longer DIV executed.

were the root cause of the div problem reported above, causing div by zero, which apparently is not tested for in the long div routines, whether signed or unsigned.

Are you talking about the i/3 or something else? That should be a 16-bit divide. For 32-bit divides signed divides are slower than unsigned, and Line 11 says nothing is done about divide-by-zero. Was the divide by zero occurring because of integer overflow on the averaging routine? Not sure how this could happen otherwise.

Mellvik commented 3 months ago

Are you talking about the i/3 or something else? That should be a 16-bit divide. For 32-bit divides signed divides are slower than unsigned, and Line 11 says nothing is done about divide-by-zero. Was the divide by zero occurring because of integer overflow on the averaging routine? Not sure how this could happen otherwise

It's as simple as

unsigned long i, j, k;
i = 80000UL;  // any value really
j = 0;
k = i/j;

Which either cause a hang or sends the the kernel into some interesting places.

The code above has no relation to this problem - I tested the div_by_zero separately after discovering the huge timing discrepancies which cause 'guaranteed' positive values to become negative - and divisors to become 0.

I'm not sure what to make of this. On the one hand, a compiler library gcc level should handle divide by zero, on the other the programmer (which would be me in this case) should make damned sure it cannot happen.

ghaerr commented 3 months ago

It's as simple as Which either cause a hang or sends the the kernel into some interesting places.

I see what you mean now.

I'm looking at divmodsi3.s function __udivsi3 (which is called from the above code). It essentially pulls the arguments off the stack and prepares to divide BX:AX by CX:DI in ldivmod.s function ludivmod. Go ahead and look at it. On lines 72-75 CX:DI is tested for zero, and will jump to divzero, where the input sign bit (stored in DX) is tested (it will be 0, or NS not signed), and jumps to return.

So the GCC library routine (if its the same as this one, they were once in the kernel but are now in a GCC library separately) does check for dividing by zero (and returns the input as output).

You might test this by adding back in ldivmod.s and divmodsi3.s into the kernel build, to get to the very bottom of this.

ghaerr commented 3 months ago

to get to the very bottom of this.

Guess what!? Turns out we're no longer running the soft-divide I referenced above. Further digging shows that the updated ia16-elf-gcc uses a different routine which uses actual DIV instructions. So you've just identified another kernel problem - apparently the 8086 hardware divide-by-zero interrupt vector isn't initialized by the BIOS (our kernel doesn't set s trap vector) and bad stuff is happening!!!

There is _irqit special handling for hardware traps like this, but the kernel doesn't write the divide by zero vector on startup, so yep - on your system the system is crashing on hardware divide by zero.

I'll look into a fix.

ghaerr commented 3 months ago

@Mellvik: I added https://github.com/ghaerr/elks/pull/1969 to catch user and kernel mode DIV/IDIV divide by zero and overflow hardware fault handling. It is able to precisely determine the CS:IP of the offending instruction, although that probably isn't necessary, as fault recovery isn't very useful.

Mellvik commented 3 months ago

That's great news, @ghaerr - another 'cavity' filled, another step toward adulthood for both systems. I'll surely panic alot tomorrow! Thanks!

ghaerr commented 2 months ago

On the business of displaying precision timer values (or any other long or short decimal value) in the kernel: I've been deep into comparing software vs hardware DIV routines as a result of the rabbit hunt in #76 as well as some DIV/MUL speed issues we're looking into now that someone has ported Doom to ELKS.

When GCC sees a 32-bit divide or modulo, it calls a signed or unsigned long divide return (divsi3 or udivsi3), and when needed modulo, the same (modsi3 or umodsi3) to get the job done. This is for both the kernel and user mode. As you know, these routines used to be soft with no DIV/MULs but have been replaced by versions executing one or two DIVs and MULs each, depending on the input. And the kernel and libc source have the old versions laying around but not used, adding some confusion.

In many cases, including our own printk and libc printf, a number-to-ascii routine will be written taking the modulo base 10, then a division base 10, in order to generate the ascii digits in a number:

    unsigned long val = something;
    do {
        int c = val % 10;   
        val = val / 10;
        output(c + '0');   // in real life more complicated because this generates in reverse order
    } while (val != 0);

As it turns out, real HW DIV instructions generate both the division result as well as the remainder in one instruction. But GCC generates two separate calls for the above code, one to umodsi3, then another to udivsi3. What would be nice would be a single divmod routine that calculates both in a single call, thus saving lots of CPU time. What would be even nicer would be a divmod10 routine that is optimized for dividing/modulo 10. I'm working on that, but GCC won't generate calls to them directly, although I have found that when compiling with -Ofast instead of -Os, it can produce a long divide by 10 not using any DIVs or MULs at all. Research continues.

For cases where we want maximum speed, e.g. precision timing output, we might want something like this, with __divmod10 optimized in ASM:

    unsigned long val = something;
    do {
        int rem;
        val = __divmod10(val, &rem);  // return division and store remainder in rem
        output(rem + '0'); 
    } while (val != 0);

So, looking at the kernel printk and guess what?? The way its written (looks like all the way back to the 1990's, it operates in the worse possible case and is guaranteed slow: it uses an algorithm to avoid reversing the decimal output pointed out above that starts by dividing by 1,000,000,000 and then div 10 mod 10 and so on in a for loop for TEN TIMES, calling both udivsi3 and umodsi3 every time!!! This same approach is also used to display 16-bit hex numbers, using a variable for base = 16 vs base = 10, thus disallowing any optimizations and also running worst possible case. It is taking as long to display the number 0 or 0x1234 as a number with value 4GB and 10 digits.

So, the kernel is taking absolute worst time to display calculated number in any base. No wonder printk's change the timing of the kernel, they are taking ridiculously long every time and we never even knew it!!!

I'm working on a rewrite for this as well as the user mode printf, although the latter is not nearly as bad. However, %k ptick output is subject to this massive slowdown in printk as well.

I thought you'd get a kick out of this.

Mellvik commented 2 months ago

Incredible. Yes you're right, this is up my alley (I'm still tweaking the delay calibration, unbelievable). This is research plus creativity. Looking forward to the next episode! Thanks.

ghaerr commented 2 months ago

I have written a very short, custom ASM __divmod10 which runs as fast as possible, and executes one DIV if it can, otherwise two, to compute both the 32-bit quotient and 16-bit remainder of a 32-bit number divided by 10 in the same routine. The next step is testing it and then rewriting printk followed by printf, ultostr, ltostr and ltoa so they're as fast as possible.

This will speed up printk immensely, which is probably very good for older systems, although we may lose the delays previously used to affect timing-specific system behavior - we'll see just how slow it was and whether that mattered.

In particular I have figured out how to combine the %k processing which required comma output for thousands grouping with the mainline numeric decimal conversion routine, which should reduce code size and also give us thousands grouping for other applications, which is something I've wanted for a while.