Subtle combination causing printk to go astray

Mellvik / TLVC

Tiny Linux for Vintage Computers

Other

9 stars 0 forks source link

Subtle combination causing printk to go astray #76

Closed Mellvik closed 1 week ago

Mellvik commented 3 weeks ago

A rather interesting issue related to the discussion in #71, the implementation of a precision timer and a fix for divide-by-zero.

The problem occurs only in QEMU, which means that QEMU itself may be the culprit. That said, it takes either a very fast or very recent version (maybe both) of QEMU for the problem to occur. Which makes the analysis even harder. The symptoms simply don't make sense. A classic start of a real rabbit chase.

The problem: A special sequence of actions cause printk to either cause the current function to exit or the system to hang, a classic stack pollution case.

The situation leading to the problem:

using the value returned from get_ptime
a long divide involving that returned value
a call to printk using the result of that division
and encapsulating it all, a very fast QEMU - or possibly just a very fast processor, which may be a clue to where the rabbit hides, a race condition??

@ghaerr, it would be interesting if you run this code snippet on ELKS on your QEMU which is likely quite a bit faster than my Mac Mini, which does not show the problem.

Anyway, this code snippet (not doing anything other than demonstrating the problem) demonstrates the problem: Either the statements from the second printk to the end of the function are skipped (like an immediate return) - or we have a hang (the experimental code resides in init/main.c):

{               unsigned long temp;
                int dd;
                unsigned d5, d6;

                d5 = 50;
                get_ptime();
                temp = get_ptime();
                printk("Direct %lu $$", temp);       <--- this is printed, value between 3 and 6 (11 on the MacMini)
                //temp = 3UL;
                dd = temp*838UL/1000UL;
                printk("XXXX %lu", temp);                 <---- this prints too, showing that the div is OK and temp us unaffected
                printk("UL dd=%d $$ ", dd);    <--- this and any printk below in the same function never show, the function just exits - or the system hangs right here
                dd = ((long)temp * 100L)/(long)d5;
                printk("Signed dd=%d $$ ", dd);
                ...
}

Some observations from a seemingly infinite number of testvariants:

Activating the //temp = 3UL; line eliminates the problem - which is very interesting because it means that...
```
int d5;
get_ptime();
d5 = get_ptime();
```
... is not the same as
```
d5 = 4351;
```
... even if that number is exactly what was returned from get_ptime();. So it's not the number returned but the call to get_ptime() that seems to be important. That said,
```
get_ptime();
get_ptime();
d5 = 4351;
```
... doesn't cut it either. The error cannot be provoked unless the actual return value from get_ptime() is used. Very strange indeed.
Still, the variable itself is unimportant - as is its size. Setting d6 = temp; where d6 is an unsigned int and subsequently use d6 instead of temp in the division still triggers the problem.
Adding delays between the two get_ptime() calls to increase the # of pticks to > 64k makes no difference, IOW whether get_ptime returns a long or a short doesn't make any difference.
This observation is supported by changing temp to unsigned int and casting the 2nd call to get_ptime(), still no difference.
Changing the dd (int) in the first divide statement to d6 (unsigned) changes behaviour slightly, from 'exit function and continue' to 'hang right there' - which means nothing other than that a 'stray' return happens to 'land' on a slightly different address.
The only way to make the routine 'work' with the return value from get_ptime() is to AVOID long divide.
```
            d5 = d6*838/1000;       /* THIS ONE WORKS */
            printk("2: Direct %x $$", d5);
```
... which obviously is no good if d6 > 78 - but we're closer to a conclusion: Somehow the interaction between get_ptime() and long div (signed and unsigned) cause a stack pollution which in turn cause printk to malfunction on very fast machines - or there is a bug in QEMU 7.2.0.

And the hunt continues.

ghaerr commented 3 weeks ago

I'll take a look and try to analyze the problem. But first - are you running this code in the function start_kernel? The function has the following comment at the top of it:

/* this procedure called using temp stack then switched, no temp vars allowed */
void start_kernel(void)
{
    ...
    setsp(&task->t_regs.ax);    /* change to idle task stack */
    ...
// are your variables being declared here? If so, that's disallowed as the stack has been switched within this function.
}

It seems from your description that local variables are being declared within a block, which the compiler will then expect them in certain registers or stack locations. The setsp call above switches stacks midway through the function, which is required long story short for the way the dynamically allocated task array works. Thus, if any local variables are declared within kernel_start, the variables may or may not be accessible and/or overwritten, depending on system activity. This is all part of the trickiness of running code within kernel init (the complex startup sequences having been previously described).

If the locals are declared within start_kernel, then this is likely the issue. The entire set of testing needs to be performed in a separate function, much like was done with testloop, etc. If this is not the issue, let me know and I'll investigate further.

[EDIT: I probably need to change "no temp variables allowed" to "no local variables allowed" to be more clear.]

Mellvik commented 3 weeks ago

Thank you @ghaerr , The test routine is called next to testloop after kernel_banner in kernel_init.

Possibly the most weird thing is that everything works fine except in this very fast instance of QEMU.

ghaerr commented 3 weeks ago

This is strange, but I still have my theories about stack over/underflow. Can you post your init/main.c as main.zip so I can see it exactly for duplication on my side?

Mellvik commented 3 weeks ago

main.c.zip

Mellvik commented 3 weeks ago

Probably obvious, everything is in the calibrate_delay routine at the end.

ghaerr commented 3 weeks ago

I'm just starting to look at this... I am also wondering whether this behavior might have something to do with the DIV trap handler you just copied over a few days ago... and your statement that things only go bad when the long div routine(s) are called. The DIV instructions can also throw an overflow fault - if by chance the DIV handler isn't working and an overflow is occurring, is that possible something funny might be happening there? This could be tested by executing a DIV by 0 (make sure the compiler doesn't optimize that out) to test the fault handler, which should immediately panic. I never tested DIV overflow, and am not entirely sure when it happens.

I'll take your calibrate_delay routine and look at the code generated, and add it to my main.c and see what happens.

ghaerr commented 3 weeks ago

Just ran calibrate_delay on my system/qemu (v5.2, MacBook Pro x64 3.1Ghz) and it is not failing. Still looking into code generated, not finding anything obvious. Will set console to serial see if that makes any difference...

ghaerr commented 3 weeks ago

Looking at this hard, but not getting it to fail on my QEMU, I have the following thoughts and suggestions:

First, add a clr_irq() at the start of calibrate_delay(), just in case there's something strange with interrupts trashing the stack early on during kernel init. It doesn't seem this should help, but try it.

Second, it appears that GCC is very much rearranging the statements in calibrate_delay, and what you think is happening is likely not. In order to be precise, I saw that GC doesn't even compile statements whose result is not passed to another function or used in another calculation. So for instance, the dd = (temp*838UL/1000UL); can be entirely deleted and it doesn't change the code output. (I am performing a ia16-elf-objdump -D -r -Mi8086 main.o > file to save the results, then diffing that with a previous disassembly).

In order to be more precise with what I'm looking at, I then also removed all "unused variables" and statements that did nothing as described above. This ended up with the following, which produced the exact same code as your version:

void calibrate_delay(void)
{
    unsigned long temp = 0;

    sys_dly_base = 0;
    sys_dly_index = 1;
#if 1
    { /* div test */
        unsigned d5;
        unsigned d6 = 33;

        d5 = 50;
        get_ptime();
        d6 = (unsigned)get_ptime();
        printk("Direct %u $$", d6);
        d5 = (unsigned)((long)d6*838L/1000L);   /* this fails */
        printk("YYY;");
        printk("2: Direct %u $$", d5); /* HANGS HERE */
        d6 = (temp*838UL/1000UL);
        printk("XXXX %lu\n", temp);
        printk("New (UL) dd=%x $$", d6);
    }
#endif
    printk("\n");
}

Now, disassemble the output. Notice that the printk of YYY is compiled BEFORE the "d6 this fails" statement! So for sure what is compiled is out of order with the statements. This means you can't trust what order things are occurring in, if GCC deems it better to reorder function calls! (All variables can be declared volatile to stop this, but first perhaps try turning off optimization as suggested below).

In order to get around this, I suggest turning off optimization for this function. This produces lots more code, and probably will affect what seems to be a timing-depending problem. This can be done by changing the function declaration to:

void __attribute__((optimize(0))) calibrate_delay(void) { ...

You will see the compiled code now orders the statements you're writing.

In summary, I would try:

testing divide by zero as described above to see panic
disable interrupts at beginning of function to remove possible interrupt handler issues
run test on console, not on serial console, as QEMU might be getting ahead/behind of the action with serial emulation lead/lag
turn off optimization for function to see whether you can get problem to repeat
declare variables volatile and keep optimization on just to see what happens

If this fixes things, then we need to take a hard look at the GCC optimizations that are taking place in this function. I traced it a bit, but its complicated. There is a small chance there could be an issue with the compiler, but I still can't see a place where the stack gets clobbered unless its through a kernel hardware interrupt or task switch, neither of which would take place with the above suggestions.

Mellvik commented 3 weeks ago

Just ran calibrate_delay on my system/qemu (v5.2, MacBook Pro x64 3.1Ghz) and it is not failing. Still looking into code generated, not finding anything obvious. Will set console to serial see if that makes any difference...

Thank you @ghaerr, I really needed some fresh eyes in on this, after having banged my head against the wall for a few days.

FWIW - my wife's iMac is a 2020 3.1GHz 6 core Core i5, quite similar in terms of performance to your Macbook Pro, but with the most recent OS. I just installed homebrew, then qemu on it, got version 9.0.2 (!!)(why don't I get this on my M1??) and ran the same floppy image I sent to you, which fails on the M1. Lo and behold, it fails in exactly the same way. (BTW - on this version -accel hvf is not available.)

Now, how useful is that? At least we know the problem is not unique to my machine/config and unrelated to whenever the physical platform is Intel or not. But it may still be related to QEMU versions newer than 7.0 and it may still be related to speed (assuming the iMac is slightly faster than the laptop at the same clock rate) and it may be related to the host OS, which is the latest Sonoma on this iMac.

Mellvik commented 3 weeks ago

I suggest turning off optimization for this function. This produces lots more code, and probably will affect what seems to be a timing-depending problem. This can be done by changing the function declaration to:
void __attribute__((optimize(0))) calibrate_delay(void) { ...

@ghaerr, thanks for this one!! Having pondered too much for too long over object listings these past days, this was the natural first one to check out. And - possibly surprising, then maybe not - the problem goes away.

How interesting! Like you say, a lot more code but here we are, and I'm back at chewing object code, this time with a purpose other than hoping for the needle to magically fall out of the haystack.

Mellvik commented 3 weeks ago

More findings and observations, @ghaerr:

clr_irq() doesn't change anything
div-by-zero works as expected - on real HW too
Forcing QEMU to use 'real' console doesn't change anything
Optimization off means it works, as reported above
your observation about the sequence of things, that the print of 'YYY' actually comes before the bad div, is important. It means that printk is likely not part of the problem, and that it is actually the divsi3 that sends us to oblivion, not a printk.
volatile d6 didn't change anything. Adding volatile d5 (not removing the first) is an entirely different story: Boot fails after the L1/L2 buffer message. Optimization is active.
```
... 
24 ext buffers (24576B ram), 10 L1 buffers, 10240B ram
HEAP: no memory (10240 bytes)
```

panic: No buf mem SYSTEM HALTED - Press CTRL-ALT-DEL to reboot:


... which is surprising: WTF? This sounds like a compiler problem, doesn't it? Unsurprisingly, removing the volatile on d6 changes the back to normal, the bad division hangs in the same way as before, which way is affected only by the amount of code inserted or deleted ahead of it. So one `volatile` is ok in the sense it neither breaks anything not changes anything, two bombs the system. Hmmm

BTW, and possibly interesting, the volatiles have no effect when running on real hardware. Now, that's interesting...

The hunt continues.

ghaerr commented 3 weeks ago

Very interesting. So what is the current prognosis from the test results? It never fails on real hardware, right?

Only certain code sequences cause failures on QEMU, but those sequences cause a fault every time with the same version of QEMU (or higher), right? Are we really dealing with a speed issue here, or just a possible QEMU bug (like DIV trashing some wrong registers or the stack in certain QEMU versions)?

I'm trying to determine if the problem is definitively in QEMU at or above a certain version number, or whether there is a speed/stack trashing issue as well.

If you can determine that the problem only occurs in QEMU and during or as a result of the __divsi3 routine, we may have to pull that from the GCC library (its at cross/build/gcc-src/libgcc/config/ia16/) and possibly comment out some of that code (like the DIV instructions) as a kernel .S file just to see what happens. I am not sure the kernel uses much of long division, but I would bet some of the time routines do. We will need to be sure the problem is within that code first before though. I would bet the kernel would likely function with a broken __divsi3 routine for debugging.

If the problem only occurs on QEMU and never on real hardware, I would say it is not a compiler issue but possibly QEMU interpretation issue with library code...

the volatiles have no effect when running on real hardware.

All a volatile will do is stop optimization and cause a reload from memory instead of using a cached register value in most cases. So if the problem never occurs on real hardware, you won't see a difference with volatile.

Mellvik commented 3 weeks ago

What we know:

happens on QEMU only
is related to either (host) speed or QEMU version (it would be useful to get QEMU 9 running on some really slow hardware, but I don't know if that is possible)
requires divisor to have been returned from get_ptime() (really makes no sense)
the error manifests itself in divsi3 (and unsigned version) but the root cause may be somewhere else
does not happen when optimization is turned off

The effect of using volatile is worth pondering. Adding the 2 volatile declarations I mentioned, changed system behaviour completely: The kernel bombs long before it get to the point where the calibrate_delay is called. This makes even less sense than the rest of the picture, so I need to head back into this one to verify.

My hunch is that we have a processor/compiler problem. That the ia64 architecture that QEMU emulates regardless of the switches we set telling it otherwise, implements some stack related instruction(s) different from its older brethren (like the difference between the 8088/86 and the 286 and later in the way the SP works (vague memory on this, may be slightly different). Speed getting fast enough, possibly caching, pipeline and other effects, I don't know how far down emulation really goes, and we get this effect.

Seems it may be worthwhile to continue my analysis of the optimised vs. unoptimized code. What do you think, @ghaerr ?

Mellvik commented 3 weeks ago

Update: There is a chance I can get qemu 9.0.2 running on a VERY slow machine. My 10 year old MacBook 12" has a 1,3GHz 2core CoreM processor runs MacOS BigSur and brew says it will install 9.0.2 if the latest version of Xcode command line tools is installed. That may take a while though. If successful we can eliminate of confirm the speed issue.

ghaerr commented 3 weeks ago

Adding the 2 volatile declarations I mentioned, changed system behaviour completely: The kernel bombs long before it get to the point where the calibrate_delay is called.

Wow - this needs more investigation for sure! We need to be looking at the differences in code generation - especially if you're saying that volatile declarations in a function that hasn't been called yet cause a crash. If that's the case, my long-though issue of the kernel init stack not being safe might be the real issue.

This is beginning to sound like the issue isn't in ldivmod.S, but possibly something to do with the early kernel stack setup and situation. I have spent lots of time tracing and thinking about this, and haven't come up with anything. I would suggest taking the entire set of routines and sticking them in another place, perhaps the system timer interrupt timer_tick. While not a great place, this would take it kernel init issues out of the equation. Another place to try calling the procedures would be in the kernel file read or open routine, for instance.

Seems it may be worthwhile to continue my analysis of the optimised vs. unoptimized code. What do you think

Yes - we should definitely continue to chase this down. Bugs or misunderstood behavior like this alway comes back to bite, and its fun learning about the system through this process anyways :)

the difference between the 8088/86 and the 286 and later in the way the SP works

There are some differences with "PUSH SP" and CS:IP with the divide-by-zero interrupt along with a few others, but I have checked that none of those affect the kernel, so this is not likely the issue.

is related to either (host) speed or QEMU version

I thought you were saying this occurs with your laptop and QEMU every time. Remind me why are you thinking this is a speed issue when your version of QEMU always shows the problem?

does not happen when optimization is turned off

If volatile alone keeps the problem happening, even before the "bad" function is called, this is a sign of another problem, as described above.

ghaerr commented 3 weeks ago

Adding volatile d5 (not removing the first) is an entirely different story: Boot fails after the L1/L2 buffer message. 24 ext buffers (24576B ram), 10 L1 buffers, 10240B ram HEAP: no memory (10240 bytes)

I was reading too quickly this morning and completely missed this. It seems you may be running out of heap and not know it. The particular failure is occurring as the kernel is trying to allocate 10k bytes for the L1 buffers.

The previous dynamic task array PR allocates the task array out of the heap early on, and it is possible I suppose that the heap is running out but not showing it due to a bug in the heap manager. This could manifest itself by allocating less than required or another allocation error or overlap due to 16-bit overflow, etc. BTW the task array allocation occurs as the second line in start_kernel, while the buffer allocations occur after inodes and files have been allocated (using inodes= and files= in /bootopts, not sure all of what you ported over) in kernel_init and prior to calling calibrate_delay (as you said). This all could be caused by the way the kernel code or data segment is laid out and exactly what code the compiler generates in various places, and sounds like this is getting closer to the real problem.

Check whether you're still running 17 tasks and reduce it to 16 or lower, and check other dynamic requirements in /bootopts. These may be helping to expose a problem. We should definitely not be running out of heap early on in the system, unless the task array is too large. Also try reducing EXT buffers for the time being to see if that makes a difference, although likely not as IIRC they use only ~26 bytes in the near heap/buffer

Take a screenshot of your boot screen so we can see how much heap memory is available in the banner line(s). Then also run meminfo to see the heap, when the system is not crashing. These may get us more evidence as to what's happening.

Finally, I have built two images of the system with your calibrate_delay (cleaned up but no change to object code) which both work on my system, optimize calibrate_delay in and no optimize. You might want to try them on your QEMU to see whether they run as that would give us another data point if the above task= doesn't change anything. ELKS.zip

Mellvik commented 3 weeks ago

Thank you @ghaerr - that was revealing indeed: Both ELKS images boot fine on my QEMU - and my first thought was that that makes the problem TLVC only. This however, is not necessarily the right conclusion. We're facing a memory corruption/stack pollution issue at a seemingly fixed memory location. Which is why rearranging code, adding/removing optimization, adding volatile etc. always cause different behaviour without being part of the problem.

This means that (my analysis) a different kernel configuration will place something else at the location where my calibration_delay is, and the problem will not show.

The setting reminds me of the endless chase we had some years ago which turned out to be a boot sector virus modifying fixed locations in memory (DOS memory size vector). Not that I think we have a virus, just a similar problem.

IOW - it's unlikely that the suspected DIV routines have anything to do with it. The dynamic task structures implementation should also be in the clear. The # of tasks is 16 on the emulator image used in all QEMU instances. The physical systems get only the kernel on every test-round, and use their own /bootopts. One of them is set to 20 tasks, the other to 16, and we never push the # of tasks in any of the test.

Somehow the fact that the problem shows up only on these particular (fast and recent) QEMU instances must be the key. Timing is something the trigger or at least the enabler. Apropos - I ran into some trouble with the QEMU brew installation on the very slow machine, so that one is still 'in the works'.

I need to think more about this - and I'd certainly like to hear your thoughts too - before proceeding. Maybe mapping the memory in question - dump before and after like in the case mentioned above, would be a useful idea.

ghaerr commented 3 weeks ago

I would like to learn more about the kind of QEMU problem this seems it could be - can you send me a compressed image of TLVC that always fails (but works on real hardware)? In particular, I really like the version that crashes very early and always at a known spot - the L1 buffer allocation failure. There are good reasons to use that version to do a deep dive.

Also, since the /bootopts are different on your testing QEMU that fails and real hardware, the differences there would be nice to know. I am also looking for the heap value displayed in the banner at startup.

I'll try running your failed QEMU image over here and see what happens.

ghaerr commented 3 weeks ago

More ideas:

which turned out to be a boot sector virus modifying fixed locations in memory (DOS memory size vector).

The divide-by-zero vector was very recently changed, which also changes the low core 0:0 location. Perhaps comment out the actual set of that vector in irq.c: int_handler_add(IDX_DIVZERO, 0x00, _irqit).

Other things that might change memory locations are the DF driver reading a stray sector somehow. Can you still turn off the DF driver and substitute BIOS, or an HD boot.

I'm not sure whether we're dealing with a code overwrite or data segment overwrite (or neither). Console the kernel's system.map file to see where your routines are - IIRC you'll find the non-INITSEG sections of the kernel init code in low memory. This could be an issue if that region is being overwritten.

Check the memory size given to QEMU on startup, vary that and see what happens.

Mellvik commented 3 weeks ago

@ghaerr, it turns out my report on the consequences of adding two volatile declarations was exaggerated. Nevertheless, they're interesting. Here's the boot sequence when those two are added, look closely at the last few lines: They were the only lines I saw because the HEAP: no memory was repeated infinitely.

Direct console, scan kbd 80x25 emulating ANSI (3 virtual consoles)
ttyS0 at 0x3f8, irq 4 is a 16550A
24 ext buffers (24576B ram), 10 L1 buffers, 10240B ram
eth: ne0 ignored
eth: wd0 at 0x280, irq 2, ram 0xcc00 not found
eth: ee0 at 0x360, irq 11, (ee16) (bad EEPROM xsum 0xffc0)
ee0: bad memory size (0k).
 not found
ath0: AT/IDE controller at 0x1f0
athd0: IDE CHS: 1006/16/63 serial# MQ0000 1            
athd0: Multisector I/O, max 16 sects
athd1: IDE CHS: 1006/16/63 (from /bootopts) serial# MQ0000 2            
athd1: Multisector I/O, max 16 sects
athd: found 2 hard drives
athd: /dev/dhda: 16 heads, 1006 cylinders, 63 sectors (~507MB)
athd: /dev/dhdb: 16 heads, 1006 cylinders, 63 sectors (~507MB)
Floppy drive(s) [CMOS]: df0 is 1.44M (4)6, df1 is 1.2M (2)2
df: Direct floppy driver, FDC (82077) @ irq 6, DMA 2, DIR active
Partitions: dhda:(0,1014048)  dhda1:(63,880929)  dhda2:(880992,134064) 
 dhdb:(0,1014048)  dhdb1:(63,1013985) 
PC/AT class machine, syscaps 0xff, 639K base ram, 16 tasks.
TLVC 0.6.0 (64240 text, 16288 ftext, 12224 data, 16944 bss, 36366 heap)
Kernel text 2d0:0, ftext 127f:0, init 1436:0, data 1679:0, top 9fc0:0, 494K free
Direct 34 $$Total L2 buffers inuse 28492/24 (24 free), 10k L1 (map 250 unmap 53504 remap 19456)
            ^-------- this is where it goes off the deep end.
Direct console, scan kbd 80x25 emulating ANSI (3 virtual consoles)    <---- this was the top line on my screen when I misread it :-)
ttyS0 at 0x3f8, irq 4 is a 16550A
24 ext buffers (24576B ram), 10 L1 buffers, 10240B ram
HEAP: no memory (10240 bytes)

panic: No buf mem
SYSTEM HALTED - Press CTRL-ALT-DEL to reboot:

Interesting but not sensational. What may be even more interesting is that the first '#if 0' was added recently, but didn't change much. It should because its presence changes the addresses of the rest significantly. I'll have a close look at that too tomorrow. Not much screen time today.

BTW removing the DIV0 vector didn't change anything, it should be in the clear.

ghaerr commented 3 weeks ago

Oh now I see what might be happening, thanks for the boot screenshot - the kernel seems to be restarting from CS:0, or something close to that. A near ret with a small value would do that. I had thought it was the initial heap_alloc that was failing, what is happening is that the kernel is restarting and the heap_alloc is failing a second time - as it should.

What is happening is that right after the system goes bad, it displays the "Total L2 buffers..." message which is in list_buffer_status, then proceeds to return from that function into possibly CS:0 where it looks like the kernel is restarting from CS:_start (CS:0).

My kernel doesn't have the debug turned on so list_buffer_status isn't included, but check your system.map file - I'll bet its at a low IP address.

I have some ideas which I'll post, including turning off interrupts with cli as the first thing in either setup.S or crt0.S. I will post them after looking at this a bit more. I am wondering whether your QEMU might be running a slightly different BIOS which has interrupts enabled at boot for some reason. ELKS/TLVC never actually disables interrupts.

Your initial heap value does look kind of low - I'm wondering whether moving task=8 to test might be interesting as well. I'll post more thoughts in a bit.

I would like to see your .config file and bootopts, as well as this same image, so I can try to duplicate over here.

Mellvik commented 3 weeks ago

Thank you, @ghaerr , 'offscreen' but I can get to the files:

## boot options max 511 bytes
console=ttyS0,57600 debug net=ne0 3 # sercons, multiuser, networking
#root=df0
#QEMU=1                 # to use ftp/ftpd in qemu
#TZ=MDT7
LOCALIP=10.0.2.15
HOSTNAME=tlvc15
ne0=12,0x00,,0x80
wd0=2,0x280,0xCC00,0x80
#3c0=11,0x330,,0x80
ee0=11,0x360,,0x80
le0=9,0x325,,0x80
comirq=,,7,10
cache=10        # L1 cache
bufs=24         # L2 
hdparms=306,4,17,128,614,4,17,-1
netbufs=2,0
#sync=30                # autosync secs
#init=/bin/init 3 n     # multiuser serial no /etc/rc.sys
#init=/bin/sh           # singleuser shell

fd1440.img.zip

config.zip

Mellvik commented 3 weeks ago

System.map excerpt

00010000 T _start
00010031 T int3
00010033 T early_putchar
00010042 t init_task
00010048 T testloop
00010073 T u2delay
00010083 T calibrate_delay
0001011a T start_kernel
0001016c t put_last_lru
000101b6 t find_buffer
000101dd T list_buffer_status
00010312 T wait_on_buffer
0001034a t sync_buffers
00010396 T lock_buffer
000103aa T unlock_buffer

ghaerr commented 3 weeks ago

Ok, talk about a wild rabbit hole... this has been one!!! Spent way too long on this, but good things happen to those who work hard, right!?

I took the main.c you gave me a while back for which I copied the calibrate_delay, as discussed above. I could not get this to fail with my QEMU v5.2.0. I then decided to rebuild my ELKS with your config file, from which I found they're not really compatible (we knew that) but then ran into a bunch of ELKS linking issues from even the compatible config options that were or were not set. Long story, these issues I placed in the work push-stack for later.

Having finally got an ELKS build going with close to your config, I tried again: no fail with QEMU.

Then, I decided to copy your /bootopts over and run that with my ELKS build. Crash! Except the crash was resulting in something we hadn't ever seen: /bin/init was declared missing, and the single user shell was exec'd instead. Very strange. After way too much fooling around, I found that, with your bootopts, a major data overwrite was occurring in both our main.c argv_init and envp_init variables, in THREE different places! Wow, the rabbit was cornered.

This problem is brought to a head in /bootopts when either 1) more than 8 arguments are tried to be passed to /bin/init, or 2) more than 8 environment variables (e.g. any x=y that is not an otherwise parsed /bootopts variable setting), or 3) the sum total of all the argv vector, the envp vector, and all the strings is greater than 80 words (i.e. the static char *argv_init[80] is used as both an argv array and a temp array for the stack bytes passed to run_init_process_sptr).

Sounds complicated, and it kind of is. So I fixed it all (see following patch) and then took your TLVC image and changed up /bootopts (since QEMU wasn't crashing) and changed it such that it should crash given what I now knew was happening. No crash.

I then realized this was a little monster rabbit, so brought out the big guns, and a second laptop running macOS Ventura. I decided to pull down and build QEMU 9.2, which had its own share of problems, but finally built. Then tried building ELKS, but the ELKS source tree was 2 years old and GCC had to be rebuilt... which took forever with its own problems: makeinfo was somehow missing from macOS and had to be copied from my Catalina copy, then after a full half hour of GCC building, the build complained that the multilibs were out of date and a bunch of directories had to be manually removed and GCC rebuilt from scratch, again. (Hint: just mv cross cross.org and rebuild from total scratch, much easier).

Finally, a new QEMU, new GCC, new ELKS, time to run your TLVC image. I had -accel hvf in the QEMU command line. No crash. I removed -accel hvf on a whim and ran the TLVC image: Crash! Exactly duplicated what you're seeing!!

The bad news is the rabbit is still at large, kind of. I couldn't get it to not crash, but I did start seeing the crash operates a bit differently sometimes, depending on what's in /bootopts. Of course, I had found a major /bootopts main.c variable overwrite bug, but am not able to rebuild TLVC (more on that below). So I don't actually know what the problem is.

However, I did start looking very carefully at your main.c, and found some things that are goofed up. I then pulled down a full TLVC repo, and confirmed that things are screwy: basically, the main.c that you gave me can't compile using any of the branches in the TLVC repo. That's because ETH_LANCE seems to be entirely missing from the source tree. I would assume that ETH_LANCE is defined to be 4, because it would come directly after ETH_EE16, which is 3.

The problem is, your tree and main.c use ETH_LANCE to parse the le0= /bootopts option using a call to arse_nic(line+4, &netif_parms[ETH_LANCE]). But the declaration of netif_parms is struct netif_parms netif_parms[MAX_ETHS], and in the tree, MAX_ETHS is 4 (unless you've changed it somewhere else). This means that the ETH_LANCE initialization is out of bounds for the netif_parms array. What I'm thinking is that if all this is the case and you forgot to increase MAX_ETHS, then the parsing of le0= in /bootopts would cause main.c variable overwrites, exactly like what I found with argv/envp processing, except for overrunning the netif_parms[] array. This would trash portions the data segment. You would need to look at system.map (I don't have a copy) in order to know which variables are where, which is very informative.

So - my hunch is that your source tree is possibly out of sync with the repo, and that could be causing these kinds of data overwriting very early on in the kernel startup. My advice is to put everything in a branch, then pull down a fresh clone somewhere else and do a build - as I mentioned ETH_LANCE is undefined so the main.c you gave me won't compile.

I then also think this is not a QEMU bug per-se, but instead something quite strange with the way acceleration works with JIT enabled vs not, etc. While still very strange, I am starting to think that the issue is TLVC (or further ELKS) startup bugs, not QEMU. A way to test this, other than making sure the repo is cleaned up for a fresh build, would be to see that, interestingly enough, opts = parse_options() is called in early_kernel_init(), giving exactly the chance to screw up main.c's data before even the task array is allocated in kernel_init, and before calibrate_delay is even called.

If the above doesn't fix the problem, then I also suggest commenting out the init (probe) routines for all of the ethernet drivers, and also adding a CLI as the first instruction after _start in tlvc/arch/i86/boot/crt0.S. This is because I can't find a guarantee that the boot code nor setup nor the kernel actually disables interrupts during init.

Here's the bug fix for the /bootopts argv/envp overwrite problem:

diff --git a/elks/init/main.c b/elks/init/main.c
index 9c1b53a8..5cf2f4bc 100644
--- a/elks/init/main.c
+++ b/elks/init/main.c
@@ -29,8 +29,15 @@

 #include <linuxmt/debug.h>

-#define MAX_INIT_ARGS  8
-#define MAX_INIT_ENVS  8
+#define MAX_INIT_ARGS  6       /* max # arguments to /bin/init or init= program */
+#define MAX_INIT_ENVS  12      /* max # environ variables passed to /bin/init */
+#define MAX_INIT_SLEN   80      /* max # words of args + environ passed to /bin/init */
+
+#define STR(x)          __STRING(x)
+/* bootopts error message are duplicated below so static here for space savings */
+char errmsg_initargs[] = "bootopts: init args > " STR(MAX_INIT_ARGS);
+char errmsg_initenvs[] = "bootopts: env vars > " STR(MAX_INIT_ENVS);
+char errmsg_initslen[] = "bootopts: init args+env > " STR(MAX_INIT_SLEN) " words";

 int root_mountflags;
 struct netif_parms netif_parms[MAX_ETHS] = {
@@ -108,7 +115,7 @@ void testloop(unsigned timer)
 }
 #endif

-/* this procedure called using temp stack then switched, no temp vars allowed */
+/* this procedure called using temp stack then switched, no local vars allowed */
 void start_kernel(void)
 {
     early_kernel_init();        /* read bootopts using kernel interrupt stack */
@@ -161,6 +168,9 @@ static void INITPROC early_kernel_init(void)
 #endif
 }

+unsigned sys_dly_index;
+unsigned sys_dly_base;
+void calibrate_delay(void);
 void INITPROC kernel_init(void)
 {
     /* set us (the current stack) to be idle task #0*/
@@ -212,6 +222,7 @@ void INITPROC kernel_init(void)
 #endif

     kernel_banner(membase, memend, s, e - s);
+    calibrate_delay();
 }

 static void INITPROC kernel_banner(seg_t start, seg_t end, seg_t init, seg_t extra)
@@ -559,13 +570,13 @@ static int INITPROC parse_options(void)
                 */
                if (!strchr(line,'=')) {    /* no '=' means init argument*/
                        if (args >= MAX_INIT_ARGS)
-                               break;
+                               panic(errmsg_initargs);
                        argv_init[args++] = line;
                }
 #if ENV
                else {
                        if (envs >= MAX_INIT_ENVS)
-                               break;
+                               panic(errmsg_initenvs);
                        envp_init[envs++] = line;
                }
 #endif
@@ -578,11 +589,15 @@ static void INITPROC finalize_options(void)
 {
        int i;

+#if ENV
        /* set ROOTDEV environment variable for rc.sys fsck*/
        if (envs < MAX_INIT_ENVS)
                envp_init[envs++] = root_dev_name(ROOT_DEV);
        if (running_qemu && envs < MAX_INIT_ENVS)
                envp_init[envs++] = (char *)"QEMU=1";
+       if (envs >= MAX_INIT_ENVS)
+               panic(errmsg_initenvs);
+#endif

 #if DEBUG
        printk("args: ");
@@ -624,6 +639,8 @@ static void INITPROC finalize_options(void)
 #endif
        /*argv_init[args+2+envs] = NULL;*/
        argv_slen = q - (char *)argv_init;
+       if (argv_slen > sizeof(argv_init))
+               panic(errmsg_initslen);
 }

 /* return whitespace-delimited string*/

In summary:

QEMU 9.2 with -accel hvf works, crash duplicated without -accel hvf
QEMU operates quite different with and without JIT caching, very hard to determine what its actually doing
add /bootopts bug fix
clean up repo source and check MAX_ETHS >= 5
remove opts = parse_options(); from early_kernel_init if the above doesn't fix problem
possibly comment out call to eth_init in drivers/char/init.c
add cli as first instruction executed after _start in crt0.S

Mellvik commented 3 weeks ago

Wow, @ghaerr - that's a wild ride indeed. Much of it all too familiar - but the good thing that hopefully makes it worth it is that a number of thus far unknown bugs have been discovered and fixed.

This is great. About the time spent - yeah, sometimes it feels completely useless. So much time, so little progress. Except neither of us seem capable of letting go. The rabbit just have to be found, otherwise it is bound to come back and hit when least expected. So all and all, this is good. I must have been at this one on and off more more than two weeks now. It would be really good to put it at rest. Seems we're getting close.

BTW - thanks for the main.c fixes. They balance at least some of the effort, right?

Finally, a new QEMU, new GCC, new ELKS, time to run your TLVC image. I had -accel hvf in the QEMU command line. No crash. I removed -accel hvf on a whim and ran the TLVC image: Crash! Exactly duplicated what you're seeing!!

Very interesting indeed. And GREAT! Finally we have a setting where you can duplicate the problem. That simplifies the process dramatically. And it's interesting for an entirely different reason: Your QEMU doesn't barf at the hvf accellerator, like mine does. So I went back to the Intel system on which I had duplicated the problem and tried again. Still barfing:

qemu-system-i386: -accel hvf: invalid accelerator hvf

Turns out the clue is in the message above. I've intentionally been running qemu-system-i386 while the ELKS qemu.sh script prefers qemu-system-x86_64. Changing that back to x64 I find that -accle hvf is accepted. Apparently, the two command lines are more different than I had anticipated. Given this experience, it seems the -i386 command line is good for ELKS and TLVC. According to the docs, the key difference is 64 vs 32 bit emulation.Further, on the M1 (and later M-series processors) there wouldn't be a hvf accelerator option anyway.

This 'discovery' still does not answer the question about whether speed or version (or if the host OS version) has anything to do with it. On my faster Intel Mac w/QEMU 9, the x64 command line fails just like yours. On the slower MacMini w QEMU v7, it does not (no difference when -accel is removed). So the jury is still out on that (at least until I get QEMU9 running on the old MacBook).

The bad news is the rabbit is still at large, kind of. I couldn't get it to not crash, but I did start seeing the crash operates a bit differently sometimes, depending on what's in /bootopts. Of course, I had found a major /bootopts main.c variable overwrite bug, but am not able to rebuild TLVC (more on that below). So I don't actually know what the problem is.

However, I did start looking very carefully at your main.c, and found some things that are goofed up. I then pulled down a full TLVC repo, and confirmed that things are screwy: basically, the main.c that you gave me can't compile using any of the branches in the TLVC repo. That's because ETH_LANCE seems to be entirely missing from the source tree. I would assume that ETH_LANCE is defined to be 4, because it would come directly after ETH_EE16, which is 3.

That's right, I've been sloppy (lazy is more honest) about including some of the changes in recent commits - too many interrelated projects, and jumping between them became - well, complicated. Anyway, the real problem is I forgot to check in the updated config.in which excludes all the LANCE stuff. In the system I've been running, MAX_ETHS has been correct all the way. Apologies for that one.

So - my hunch is that your source tree is possibly out of sync with the repo, and that could be causing these kinds of data overwriting very early on in the kernel startup. My advice is to put everything in a branch, then pull down a fresh clone somewhere else and do a build - as I mentioned ETH_LANCE is undefined so the main.c you gave me won't compile.

It is definitely out of sync, thanks for the push. It's overdue, and admittedly it may reveal a problem other than the MAX_ETHS.

I then also think this is not a QEMU bug per-se, but instead something quite strange with the way acceleration works with JIT enabled vs not, etc. While still very strange, I am starting to think that the issue is TLVC (or further ELKS) startup bugs, not QEMU. A way to test this, other than making sure the repo is cleaned up for a fresh build, would be to see that, interestingly enough, opts = parse_options() is called in early_kernel_init(), giving exactly the chance to screw up main.c's data before even the task array is allocated in kernel_init, and before calibrate_delay is even called.

It's an interesting thought even though it's hard to make a connection to what speed may have to do with it. So let's check it out. BTW my 'trick' when suspecting init or bootopts is renaming the latter to get it out of the way. I haven't done that in this case. Will test that.

In summary:

QEMU 9.2 with -accel hvf works, crash duplicated without -accel hvf

QEMU operates quite different with and without JIT caching, very hard to determine what its actually doing

add /bootopts bug fix

clean up repo source and check MAX_ETHS >= 5

remove opts = parse_options(); from early_kernel_init if the above doesn't fix problem

possibly comment out call to eth_init in drivers/char/init.c

add cli as first instruction executed after _start in crt0.S

Great summary, I'll be back on that.

Thanks.

Mellvik commented 3 weeks ago

+#if ENV
       /* set ROOTDEV environment variable for rc.sys fsck*/
       if (envs < MAX_INIT_ENVS)
               envp_init[envs++] = root_dev_name(ROOT_DEV);
       if (running_qemu && envs < MAX_INIT_ENVS)
               envp_init[envs++] = (char *)"QEMU=1";
+       if (envs >= MAX_INIT_ENVS)
+               panic(errmsg_initenvs);
+#endif

It seems to me the last if will become true and panic if one of the above ifs add an entry to the last legitimate spot in the env_init table. I do not however, see an easy fix to that...

ghaerr commented 2 weeks ago

It seems to me the last if will become true and panic if one of the above ifs add an entry to the last legitimate spot

Yep, caught that one before I committed my fix.

I do not however, see an easy fix to that...

Here's the fix for that: https://github.com/ghaerr/elks/pull/1976/commits/ed269046fa05b5ca68bf57282b62692f419b4ec3.

A few other changes were made to the final PR, thanks!

Mellvik commented 2 weeks ago

@ghaerr, I just managed to accidentally kill the Safari window with the notes from the last two days of testing, but the bottom line is that none of the changes on the table have had much I impact on the problem, except changing the behaviour slightly. Disabling eth_init() had immediate effect in that the problem went away, but again it was only changing addresses relative to where the problem is thus bypassing it as we've seen before. Possibly useful to pinpoint the address we're looking for - I'm still on that.

ghaerr commented 2 weeks ago

Disabling eth_init() had immediate effect in that the problem went away

Wow, looking at system.map I see that chr_dev_init where eth_init was commented out is at a high address, so this should have nothing to do with the code being executed at start_kernel and calibrate_delay, which are at very low addresses.

but again it was only changing addresses relative to where the problem is thus bypassing it as we've seen before.

How is it you can make that determination? Given that eth_init and all the driver probe code is executed prior to calibrate_delay, I would say this should not be an assumption. Instead, I suggest going putting eth_init back in and commenting out each of the network driver init routines one by one to see what happens. Doing this in the INITSEG segment (which is within eth_init) rather than within the driver code itself (which might be in the lower near kernel CS segment) will help not move around memory contents too much so the area of code causing the problem can be more easily found.

For some reason I had gotten confused thinking that only those items very early in the kernel init process were possible problem areas for tracking this bug down. But now after looking at boot screen, I see almost the entirety of the kernel has been initialized before the problem shows up in calibrate_delay. Since you've been working on the network code recently, it is highly suspect as a possible contributor. It would be possible to move the calibrate_delay routine around, but I think that risks the much bigger problem of the problem just going away and not being reproducible - it is entirely possible this bug has been around for quite some time and could be in any of the init routines prior to kernel banner, just just the NIC routines. Thus sticking in returns in various INITPROC functions might yield good results while continuing solid reproducibility.

I also suggest looking at all issues with system.map in hand, so that we move from symbolic to numeric analysis, so to speak.

Mellvik commented 2 weeks ago

but again it was only changing addresses relative to where the problem is thus bypassing it as we've seen before.

How is it you can make that determination?

I can make that determination with 100% certainty by doing what I suppose most coders would do in this situation: commenting in and out every single line of eth_init(), moving them around and analyzing the results.

Admittedly I was elated at first. It seemed likely I'd have the rabbit cornered. Then - some 30 rounds, maybe 50, later all driver init calls could produce the error when placed in the right spot/sequence in the eth_init() function. No cigar. Whether this exercise may be useful to locate the rabbit or getting closer to its whereabouts, is a different story.

ghaerr commented 2 weeks ago

I can make that determination with 100% certainty

Dang! Well I was hoping. But this is one helluva rabbit. I'm going to analyze the calibrate_delay routine code generate with optimization on to see whether we might be dealing with a compiler code generation issue then.

all driver init calls could produce the error when placed in the right spot/sequence in the eth_init()

You mean that, when eth_init is out, the problem goes away, but adding it back in and taking any one of the individual ee0_init etc routines out still causes the error? What about when eith_init is in and all three NIC init routines are out?

I can't tell for certain what changes are really happening - we need to diff the system.map files between these cases and ensure that the 0x10000 - 0x1FFFF block (.text) of addresses remains unchanged, while only 0x20000+ changes (.fartext).

The above is going on the premise that the low offsets of the CS segment are getting trashed somehow, which seems possibly relevant. But perhaps I'm totally of - as the code in low CS may be all good, just the stack itself is getting overwritten with a low address so we end up there multiple times. This seems more likely, so perhaps we're dealing with a compiler bug after all. The part that doesn't make sense about the compiler bug is that the error is only occurring on a single QEMU version, never on real hardware.

When you get the TLVC repo committed and able to do a "make" which reproduces the problem immediately, with a default .config file and committed bootopts, I may jump in and compile TLVC on my other laptop to see if it reproduces there (as it should), and then try to debug TLVC locally.

Mellvik commented 2 weeks ago

It's easy to forget, I do it all the time, that we're looking for something that is somehow time sensitive. Very high speed apparently changes the speed of things. Somewhat akin to the situation I had in the ee16 driver where the TX completion interrupt happened almost immediately (next codeine) after the final clearance to send was passed to the NIC, even on the 40MHz machine. I know I mentioned this before, and it was an eyeopener. Don't assume things take time, some times - on fast iron - they don't, and it creates havoc if we haven't prepared for it.

ghaerr commented 2 weeks ago

we're looking for something that is somehow time sensitive.

I had in the ee16 driver where the TX completion interrupt happened almost immediately

Yes, good point. So if you can get the problem to remain while NOT executing all the NIC startup routines, so much the better,. We may need to comment out as much of kernel init as possible, then go right into calibrate_delay, then halts, just to catch this bug. Execute as little code as possible while continuing reproducibility. I don't know how much you're able to rearrange the low CS code in order to keep the bug showing, but it would help to move calibrate_delay to way earlier in kernel_init as possible. Just don't lose the ability to duplicate the problem!! (Easier said than done, I'm sure).

Also, when code is commented out, that may be the actual buggy code!

Mellvik commented 2 weeks ago

I can make that determination with 100% certainty

all driver init calls could produce the error when placed in the right spot/sequence in the eth_init()

You mean that, when eth_init is out, the problem goes away, but adding it back in and taking any one of the individual ee0_init etc routines out still causes the error? What about when eith_init is in and all three NIC init routines are out?

Same thing. All 4 commented out and all is good as expected. We could look at system.map but we don't know what to look for yet. So I'm back to calibrate_delay and the DIV routines, there is a tiny hole in there that the rabbit made it into.

I can't tell for certain what changes are really happening - we need to diff the system.map files between these cases and ensure that the 0x10000 - 0x1FFFF block (.text) of addresses remains unchanged, while only 0x20000+ changes (.fartext).

The above is going on the premise that the low offsets of the CS segment are getting trashed somehow, which seems possibly relevant. But perhaps I'm totally of - as the code in low CS may be all good, just the stack itself is getting overwritten with a low address so we end up there multiple times. This seems more likely, so perhaps we're dealing with a compiler bug after all. The part that doesn't make sense about the compiler bug is that the error is only occurring on a single QEMU version, never on real hardware.

That must be the speed issue - discussed separately. BTW, I've had a feeling all along that we have a local unsigned being treated as a long, overwriting the next word on the stack which happens to be the return address. However, getting that to match with the speed sensitivity is not easy.

When you get the TLVC repo committed and able to do a "make" which reproduces the problem immediately, with a default .config file and committed bootopts, I may jump in and compile TLVC on my other laptop to see if it reproduces there (as it should), and then try to debug TLVC locally.

The repo compiles fine now, unless xms is activated. I haven't pushed the .config I'm using, but you could use the one I sent you. The bootopts file should be fine. The low bufs= and cache= are leftovers from our buffer excise this spring, and a reminder that there is still some optimization work do to in that department.

Thanks.

ghaerr commented 2 weeks ago

All 4 commented out and all is good as expected.

I thought what we're looking for is that when the routines are commented out, the system does NOT crash. That means one of the routines might be causing the problem? Or do you mean that "all is good" that the system remains crashing?

I want to be sure I understand what is happening when the NIC routines are commented out - a good result means that the error/crash still occurs when the routines are commented out, and the commented out init calls are ABOVE the kernel_init INITPROC address. We want to ensure that you can still see the crash when each of the NIC routines are commented out, which ensures that the NIC routines have nothing to do with the bug.

We could look at system.map but we don't know what to look for yet.

Look for the address of the routine that you are changing the code in. If it is above the values of the kernel_init code then that means the lower kernel areas are unaffected, meaning less likely to change the continued reproducibility.

I've had a feeling all along that we have a local unsigned being treated as a long, overwriting the next word on the stack which happens to be the return address

I am developing a new theory - that the __divsi3 routine or others modifies a register that is compiler-expected to be saved. That is, the compiler ABI exception for functions. For instance, AX, CX and DX are assumed scratch between functions. IIRC if BX is or not, I think it's also scratch. SI, DI, BP, DS and ES must be saved. So - if a badly written ASM function were called that trashed a register, this would affect something later in time. Same issue if an interrupt handler calls an ASM routine that is non-ABI conformant and trashes a register - it will only be noticed later in time. For interrupts, all register are saved restored to the interrupting process, but any function-to-function call could result in failure, depending on the timing of the code execution, or sometimes the result of an uninitialized stack variable.

I'm disassembling calibrate_delay now, and will post it for observation later. We need to fully understand the optimized version to see if that routine itself is the culprit.

ghaerr commented 2 weeks ago

Here is the result of the analyzed version of optimized calibrate_delay. It looks correct and leaves no stack variable issues that I can see. Especially concerning is the code looks fantastic right up to the supposed failure just before printk("YYY"). I have not analyzed __divsi3 again yet, but that routine is called AFTER the crash just mentioned.

00000015 <calibrate_delay>:
  15:   56                      push   %si
  16:   55                      push   %bp
  17:   89 e5                   mov    %sp,%bp
  19:   1e                      push   %ds              d5 uninitialized
  1a:   1e                      push   %ds              d6 uninitialized
  1b:   c7 06 00 00 00 00       movw   $0x0,0x0         sys_dly_base
  21:   c7 06 00 00 01 00       movw   $0x1,0x0         sys_dly_index
  27:   e8 fe ff                call   28 <calibrate_delay+0x13> get_ptime
  2a:   e8 fe ff                call   2b <calibrate_delay+0x16> get_ptime
  2d:   50                      push   %ax              push d6 for printk
  2e:   89 46 fc                mov    %ax,-0x4(%bp)    =d6
  31:   ba ac 01                mov    $0x1ac,%dx       "Direct %u $$"
  34:   52                      push   %dx              &"Direct %u $$"
  35:   e8 fe ff                call   36 <calibrate_delay+0x21> printk "Direct %u $$",d6
// crash occurs here, on return from printk? Very strange. None of the subsequent code has executed yet
// and very straightforward code above this point.
  38:   ba b9 01                mov    $0x1b9,%dx       "YYY"
  3b:   52                      push   %dx              &"YYY"
  3c:   e8 fe ff                call   3d <calibrate_delay+0x28> printk "YYY"
  3f:   31 f6                   xor    %si,%si
  41:   bb 46 03                mov    $0x346,%bx
  44:   8b 4e fc                mov    -0x4(%bp),%cx    save d6 in CX
  47:   91                      xchg   %ax,%cx
  48:   f7 e3                   mul    %bx
  4a:   89 46 fc                mov    %ax,-0x4(%bp)    save result in long temp
  4d:   89 56 fe                mov    %dx,-0x2(%bp)
  50:   56                      push   %si              SI, CX, DX, AX args to __divsi3
  51:   b9 e8 03                mov    $0x3e8,%cx
  54:   51                      push   %cx
  55:   52                      push   %dx
  56:   50                      push   %ax
  57:   e8 fe ff                call   58 <calibrate_delay+0x43> __divsi3
  5a:   83 c4 08                add    $0x8,%sp         remove args passed to __divsi3
  5d:   50                      push   %ax              push low div result for printk
  5e:   b8 be 01                mov    $0x1be,%ax       "2: Direct %u $$"
  61:   50                      push   %ax
  62:   e8 fe ff                call   63 <calibrate_delay+0x4e> printk "2:...", d5
  65:   56                      push   %si              ARGS for temp?
  66:   56                      push   %si
  67:   b8 ce 01                mov    $0x1ce,%ax       "XXX %lu"
  6a:   50                      push   %ax
  6b:   e8 fe ff                call   6c <calibrate_delay+0x57> printk "XXX...",temp
  6e:   56                      push   %si              d6 or temp?
  6f:   b8 d8 01                mov    $0x1d8,%ax       "New (UL)..."
  72:   50                      push   %ax
  73:   e8 fe ff                call   74 <calibrate_delay+0x5f> printk "New...", d6
  76:   b8 ea 01                mov    $0x1ea,%ax       "\n"
  79:   50                      push   %ax
  7a:   e8 fe ff                call   7b <calibrate_delay+0x66> printk
  7d:   89 ec                   mov    %bp,%sp
  7f:   5d                      pop    %bp
  80:   5e                      pop    %si
  81:   c3                      ret

I am still very concerned about building a reproducible crash with the ethernet init routines commented out. I don't quite yet understand what your testing showed. If you can build a TLVC kernel that crashes with the NIC init routines OUT, then we have ruled that case out, otherwise it is still suspect. I am still thinking that somehow either low CS: code is getting trashed, or a stack location is getting clobbered. The likeliest place for that to happen in this codebase is ANYTHING with ASM in it. This includes NIC drivers and all kernel ASM init code. This is also the possible place for ABI register nonconformance as mentioned above. These are just ideas, I'm trying hard to keep coming up with something!!!

I will also look at get_ptime to see whether the couple of ASM macros could possibly be doing something incorrectly.

Mellvik commented 2 weeks ago

All 4 commented out and all is good as expected.

I thought what we're looking for is that when the routines are commented out, the system does NOT crash. That means one of the routines might be causing the problem? Or do you mean that "all is good" that the system remains crashing?

Sorry, I don't see what's unclear. Commenting out the call or commenting out the contents of the callee deliver the same result, as expected. Normal boot.

I want to be sure I understand what is happening when the NIC routines are commented out - a good result means that the error/crash still occurs when the routines are commented out, and the commented out init calls are ABOVE the kernel_init INITPROC address. We want to ensure that you can still see the crash when each of the NIC routines are commented out, which ensures that the NIC routines have nothing to do with the bug.

ok, I see, the definition of 'good' is unclear. Good means boots normally. I'll be more concise in coming postings. :-)

Mellvik commented 2 weeks ago

we're looking for something that is somehow time sensitive.

I had in the ee16 driver where the TX completion interrupt happened almost immediately

Yes, good point. So if you can get the problem to remain while NOT executing all the NIC startup routines, so much the better,. We may need to comment out as much of kernel init as possible, then go right into calibrate_delay, then halts, just to catch this bug. Execute as little code as possible while continuing reproducibility. I don't know how much you're able to rearrange the low CS code in order to keep the bug showing, but it would help to move calibrate_delay to way earlier in kernel_init as possible. Just don't lose the ability to duplicate the problem!! (Easier said than done, I'm sure).

Also, when code is commented out, that may be the actual buggy code!

Good points! Related, and another point I keep reminding myself: in the original setting, the bug would manifest itself as an immediate return from calibrate_delay(), skipping the rest of the routine. The system would then continue to boot normally. Whether it would also run normally or eventually run unto some damaged memory location is unknown. Given what we know today, that experience may have new significance.

I will try to get that situation back tomorrow.

Mellvik commented 2 weeks ago

I am still very concerned about building a reproducible crash with the ethernet init routines commented out. I don't quite yet understand what your testing showed. If you can build a TLVC kernel that crashes with the NIC init routines OUT, then we have ruled that case out, otherwise it is still suspect. I am still thinking that somehow either low CS: code is getting trashed, or a stack location is getting clobbered. The likeliest place for that to happen in this codebase is ANYTHING with ASM in it. This includes NIC drivers and all kernel ASM init code. This is also the possible place for ABI register nonconformance as mentioned above. These are just ideas, I'm trying hard to keep coming up with something!!!

I will also look at get_ptime to see whether the couple of ASM macros could possibly be doing something incorrectly.

I'll continue with this tomorrow. Thank you.

ghaerr commented 2 weeks ago

Something very strange is going on with your build. I have the problem reproduced building TLVC from scratch, and am using your bootopts and your config. BTW, with your default bootopts we are getting envp overflow, not sure why that is right now. And your __STRING macro is incorrect, should be #x using token stringify operator. Your version displays an incorrect error message. So I realized that your bootopts is 510 bytes, very close to the limit. I commented out hdparms=, netbufs= and le0= and the error went away.

But that's not the problem. Continuing to search over at the NIC stuff, just for my own satisfaction, I got to this point:

void INITPROC eth_init(void)
{
    //register_chrdev(ETH_MAJOR, "eth", &eth_fops);

#ifdef CONFIG_ETH_NE2K
    eths[ETH_NE2K].ops = &ne2k_fops; // <- leave this line in, CRASH. Comment out, OK
    //ne2k_drv_init();
#endif
#ifdef CONFIG_ETH_WD
    eths[ETH_WD].ops = &wd_fops;
    //wd_drv_init();
#endif
#ifdef CONFIG_ETH_EL3
    eths[ETH_EL3].ops = &el3_fops;
    //el3_drv_init();
#endif
#ifdef CONFIG_ETH_EE16
    eths[ETH_EE16].ops = &ee16_fops;
    //ee16_drv_init();
#endif
}

So I tracked down eths[] and it looks fine. Everything looks fine.

Then, went into init/main.c and added some `returns debugging some more. Very strange things happening. Then the bug went away entirely and I had to re-edit eth.c differently to get the bug to come back.

Finally, more calibrate_delay() just under console_init(); way prior to where it was previously:

void INITPROC kernel_init(void)
{
    /* set us (the current stack) to be idle task #0 */
    sched_init();
    irq_init();

    /* set console from /bootopts console= or 0=default */
    set_console(boot_console);

    /* init direct, bios or headless console */
    console_init();
    calibrate_delay();

Look what we got (!!!):

The rabbit may be getting cornered... looks like the calibrate_delay routine may? be generating a divide by zero somehow in __divsi3, and then QEMU somehow catching up generating the actual emulated hardware fault at a later time? Not sure. Also perhaps for some reason my latest DIV 0 handler isn't been called correctly later on, not sure. But getting a DIV 0 panic I would say is getting somewhere... we will see!!

Mellvik commented 2 weeks ago

Wow, how interesting. Yes, this is progress- - can smell the rabbit now. But wait, how's that possible? I mean, there is no zero divisor, are we seeing a short-to-long expansion problem after all?

I thinking back to the starting point again, where I found that the 2 requirements for getting the error was the call to get_ptime and subsequently using the result in a long div. Now that you have changed the environment of the 'trigger'-call completely, it may be useful to take another look at the parameters to the div?

also, at this new position, will the same routine trigger a fail in ELKS?

ghaerr commented 2 weeks ago

I've got the damn rabbit, a real SOB. Finally understand what's happening and it's complicated and crazy. Fix coming after I figure out how, but here's what's happening. All these things need to be wrong in order for this to occur: (!!!!)

calibrate_delay is passing zero 0 __divsi3. This is likely the result of an optimization of d5 = (unsigned)((long)d6*838L/1000L); /* this fails */ with the last two constants being optimized to 0. I haven't tried putting parens around the 'd6*838L', but I bet that fixes it. Nonetheless absolutely sure DIV0 happening in __divsi3.
The DIV in __divsi3 is generating an INT 0 hardware divide by zero interrupt (fault).
do_irq catches the interrupt, but there's a bug in my irq pre-handler routine where do_irq thinks its IRQ 0 and doesn't call the div0_handler (its probably calling the timer IRQ handler instead). Unsure, as later on after kernel init this was tested and works on ELKS. Could be because only partial port of DIV0 handler code, still checking.
(the good part>): The 8086 and 80386 handle the DIV0 fault differently: on 8086, the pushed CS:IP is after the DIV instruction, while on the 286+, the CS:IP is not incremented and still points to the DIV instruction.
Upon (buggy) interrupt handler return, the registers are restored and the DIV instruction is executed again. This is probably why QEMU faults and real hardware doesn't. But it could be a speed issue. Nonetheless, on QEMU, debugging shows we get a continuous stream of INT 0's.
QEMU 5.2, 7.0 and 9.2 operate differently in this regard, as well as operating differently with regards to whether HW acceleration is on
The screenshot below shows that, even when normally running, the business of generating hardware faults is OUT OF SYNC with code execution, somehow!!!!????? I forced a call to div0_handler and immediate panic, and it is showing that AFTER the panic, we get ANOTHER DIV0 fault!! This will be a good reason to clear interrupts on panic, because the interrupt system continues to run even when kernel is "halted". I am thinking this is async nature of fault generation might be what is causing the unexplainable stack overflow. But screenshot shows two faults, that should never happen, as the kernel is supposed to be in panic and for(;;) halt()).
My div0_handler has a problem: it was coded to do different things depending on whether we're in kernel mode or user mode. In kernel mode, it panics, but in user mode, it sends a signal to the "current" process. The problem is, that early in the kernel init, the logic is wrong and the handler thinks we're in user mode, even though we're really in the idle task executing on a kernel stack but _gint_count == 1, not > 1, as is otherwise normally the case. So no kernel panic occurs, and instead calls sys_kill to current->pid, which is 0, which operates different than expected, etc.

So I've got the rabbit by the ears and he's not escaping, but not sure yet what kind of cage to build...

To be continued!!! Fix coming when ready.

ghaerr commented 2 weeks ago

Update: You're not going to believe this, but the rabbit has, in fact, jumped out of my hands. Not all of the above analysis is correct, turns out I was confusing the divide by zero (INT 0) interrupt with the IRQ 0 interrupt in do_IRQ, and some of what I said is not the case, as it was the real timer interrupt not DIV0.

At present, treating IRQ 0 (not yet INT 0) properly, the system is now hanging in the calibrate_delay routine. This may be the result of a continuous cycle of INT 0 DIV instructions being reset to the original DIV, not sure yet. Can't seem to catch the INT 0 interrupt.

Stay tuned.

ghaerr commented 2 weeks ago

Ok - this has been one giant bug hunt, but I think finally I have what's happening identified, fixed, and double-checked. It has been a long day! :)

First, I think now that there is NOT a divide by zero happening in calibrate_delay, even though another developer just reported a divide-by-zero fault on an unrelated issue (see below).

Very long story short, the problem was found to be in the GCC library embedded under cross/* in the __divsi3 routine. This is the newer routine which lives in cross/build/gcc-src/libgcc/config/ia16/ia16-divmodsi.h, and the associated divsi3.S. This newer routine uses DIV instructions and is not to be confused with either the older kernel routine in arch/i86/lib/unused/divmodsi3.s or the C library routine of the same name (which matches the older unused kernel routine and does not use DIV instructions) in libc/gcc/divmodsi3.s.

Yes - if it sounds confusing, it is. The kernel and C library use different __divsi3 and associated unsigned and mod routines than does the GCC library. The GCC library routines in cross/build/gcc-src/libgcc/config/ia16 support cross-model compilations for small vs medium model, as well as __cdecl, __stdcall, and __regparm C calling sequences.

ELKS and TLVC build the kernel in small model, and use INITPROC or other far text specification segment naming to place specific functions into a named fartext segment, rather than using medium model, which would compile all functions in medium model and sort them willy-nilly.

The actual problem appears to be very basic: the GCC library routine for __divsi3 seems to have been assembled for medium model only, not small model. I can't fully prove the problem without literally rebuilding GCC which takes a half-hour each time, for which I have no time. What is happening is that when calibrate_delay needs a long divide, it calls __divsi3 which ends up using lret rather than ret for returning. Both instructions pop IP from top of stack, but LRET pop CS as well. So the system effectively crashes on return from __divsi3.

I have pulled the source for all the div/mod/signed/unsigned routines from the GCC source tree and put them in the TLVC/ELKS tree, and built them properly for small model - and everything works. Then I purposely built our copy of the routines specifying a medium model build - and we get varied results, none working: either the same hang in calibrate_delay after the first printk as described in this issue, or sometimes a "Divide by Zero" fault, very interestingly described the same here with a seemingly unrelated issue on ELKS. Here's the screenshot for the Divide by Zero startup problem:

BTW, it is not inconceivable that the BIOS itself could be generating a divide-by-zero fault, for which we now catch and panic, or our kernel could have other DIV0 faults that we didn't previously notice. So adding a DIV0 fault handler other than the BIOS default could be risky when using BIOS calls. Something to think about.

As to why this seems to always work on real hardware, I don't get it (yet). Since the entire problem is quite time and code-placement sensitive, it could well be this is not the problem but everything is working because the location of the long div routine has changed.

However, it isn't proven that the problem is just an LRET vs RET. But when I modify the copied routines and use an LRET instead of RET, I get the exact same hang experienced by this issue:

Finally, there are some other issues I've run into chasing this rabbit: the kernel itself is linked using various *.a archives of kernel .o files in the final link line. This means that it is possible, depending on what has been modified and how the ia16-elf-ar works, that the finally-linked kernel may or may be in the exact same order of routines, depending on the order of the .o files in the .a archives. This could be why certain times the kernel acts differently on rebuilds not using a full "make kclean".

Another issue is that the kernel itself is never built exactly the same, as evidenced by running cmp on one version and another built with no changes. I think its the date being embedded into its, but haven't had time to check.

I'm running out of time to post the updated files, but will do so ASAP. After using our own copies of the new DIV routines, I can't get it to fail.

ghaerr commented 2 weeks ago

The actual problem appears to be very basic: the GCC library routine for __divsi3 seems to have been assembled for medium model only, not small model.

I can't fully prove the problem

I finally realized the way to "prove" that the __divsi3 routine was in fact assembled using the wrong model was to disassemble the routine in a running kernel. In order to do that, I left calibrate_delay with the crash code calling __divsi3, but then didn't actually call calibrate_delay. Then, I looked at the system.map file and found the locations for calibrate_delay and __udivsi3:

...
000100d5 T calibrate_delay
...
0001fa17 t __divsi3

Given that the startup banner mentions that the text segment starts at 2d0, I then used my nifty disasm program from the ELKS debug library to disassemble the running kernel:

tlvc15# ./disasm 2d0:fa17
Disassembly of fa17:
02d0:fa17  89 e3             mov     %sp,%bx
02d0:fa19  8b 47 02          mov     0x02(%bx),%ax
02d0:fa1c  8b 57 04          mov     0x04(%bx),%dx
02d0:fa1f  8b 4f 08          mov     0x08(%bx),%cx
02d0:fa22  8b 5f 06          mov     0x06(%bx),%bx
02d0:fa25  e9 16 00          jmp     fa3e // fa3e

tlvc15# ./disasm 2d0:d5#100
Disassembly of 00d5:
02d0:00d5  56                push    %si
02d0:00d6  55                push    %bp
02d0:00d7  89 e5             mov     %sp,%bp
02d0:00d9  83 ec 08          sub     $0x8,%sp
02d0:00dc  c7 06 32 43 00 00 movw    $0x0,(4332)
02d0:00e2  c7 06 34 43 01 00 movw    $0x1,(4334)
02d0:00e8  c7 46 fe 21 00    movw    $0x21,-0x02(%bp)
02d0:00ed  c7 46 fc 32 00    movw    $0x32,-0x04(%bp)
02d0:00f2  e8 28 83          call    841d // 841d
02d0:00f5  e8 25 83          call    841d // 841d
02d0:00f8  89 46 fe          mov     %ax,-0x02(%bp)
02d0:00fb  8b 46 fe          mov     -0x02(%bp),%ax
02d0:00fe  50                push    %ax
02d0:00ff  b8 20 02          mov     $0x220,%ax
02d0:0102  50                push    %ax
02d0:0103  e8 e8 2f          call    30ee // 30ee
02d0:0106  8b 4e fe          mov     -0x02(%bp),%cx
02d0:0109  bb 46 03          mov     $0x346,%bx
02d0:010c  91                xchg    %ax,%cx
02d0:010d  f7 e3             mul     %bx
02d0:010f  89 46 f8          mov     %ax,-0x08(%bp)
02d0:0112  89 56 fa          mov     %dx,-0x06(%bp)
02d0:0115  31 f6             xor     %si,%si
02d0:0117  56                push    %si
02d0:0118  b8 0a 00          mov     $0xa,%ax
02d0:011b  50                push    %ax
02d0:011c  52                push    %dx
02d0:011d  ff 76 f8          push    -0x08(%bp)
02d0:0120  e8 f4 f8          call    fa17 // fa17
02d0:0123  83 c4 08          add     $0x8,%sp

In the last listing, it can be seen that a near call is made to fa17 (__divsi3), and in the first listing it can be seen that that routine is in the near text section, and it is NOT assembled for far text!!! Thus, my theory in the last comment is in fact proved incorrect.

What this means is that, despite a very long day of chasing this down, it appears that the routines are in fact correct. After all of this, I can reach no other conclusion than what @Mellvik thought in the first place - this is a QEMU bug in v9.2 (at least) when NOT running in hardware acceleration mode. Both @Mellvik and I are running on macOS, not Linux, so the QEMU acceleration options are different.

Thus, and I have to admit to completely have run out of ideas and have thoroughly checked the compiler libraries, the ELKS and TLVC kernel, and a lot more is that on macOS x86, one has to use the following option when running v9.2:

qemu-system-x86_64 -accel hvf ...

This option is included in ELKS qemu.sh as #MACOS="-accel hvf" and commented out. Because the option is invalid on Linux, we will need a better shell script to automatically add this option for macOS.

This entire run-around being a QEMU bug puts to rest the long story of this issue: this same code always runs on real hardware, slower hosts, and QEMU v5.2 without problems. But when QEMU v9.2 is run on macOS without -accel hvf, the system may crash due to stack corruption that looks amazingly similar to executing an LRET instead of a RET in certain circumstances. I have no idea why!

I notice the QEMU BIOS is quite a few revs forward with QEMU v9.2, but this likely isn't a BIOS emulation issue, given that it fails only when the QEMU internal (likely original) x86 code generator is emulating the system, rather than the direct-in-hardware emulation that I believe hvf is.

@vkoskiv, you might try running your QEMU with and without an acceleration option to see whether that changes https://github.com/ghaerr/elks/pull/1977 and your new divide-by-zero problem. We'd like to know which version of QEMU you're running as well as what OS you're running. I've tested the kernel divide-by-zero pretty well, and you shouldn't be getting a DIV0 unless the emulated or real hardware BIOS actually generates one. That's why I believe your issue is likely a QEMU bug as well, since I've seen a divide-by-zero in this rabbit chase.

Mellvik commented 2 weeks ago

Again a big WOW, @ghaerr. What an incredible journey. And a rabbit slippery enough to wiggle itself out of a good grip by the ears, that's a rarity indeed. And it still being at large ... unique.

Regardless of the final outcome, quite a few bugs have been cornered and fixed along the way. And new perspectives on a number of things. For example, I'd like to verify the finding that sequential builds of the same kernel may in fact be different. If this is the case, my method of checking which kernel I'm using on different machines (using sum) is in fact useless (cmp would be better). I use it quite a bit to track what I did last on each of four physical machines.

Back to the issue at hand, I agree with your conclusion - that we're likely facing a QEMU problem - possibly a bug. The symptoms/experiences are confusing though. Here's a summary:

Ideally, we should all be using qemu-system-i386 which emulates the 32 bit Intel architecture which is closer to our real hardware target than the x64 version and platform independent: Identical emulation across host processor architectures.
However, this variant does not have the hardware accelerator option (-accel hvf) which seems to alleviate the problem on Intel-based systems running 64bit processors.
The problem is not present on older version of QEMU. V7.0 is the oldest find to be 'safe' in that regard. However, speed is still a joker: The pre-7.2 versions tested are also the slowest in the pack (7.2 is running on my Apple M1 and fails). The fact that -accel hvf helps when available indicates that speed has nothing to do with it. After all this option is there to speed up emulation, right? Using hardware acceleration introduces other 'variables' to the equation too, like caching, which QEMU doesn't do. Like one contributor pointed out on a forum (in 2020):
```
QEMU does not simulate timing of execution. It doesn't know anything about CPU caches etc., 
so it can't be accurate even if it is desirable. It simply executes guest code, as fast as it can. 
No acceleration or deceleration.
```

The responsible ( :-) ) thing to do if our conclusion is that a bug was introduced in the x86 emulator between 7.0 and 7.2 would be to create an issue on the right issue tracker. The question is, do we have a sequence of instructions that definitively will trigger the problem? It seems to me the answer is no so the real question becomes 'how close are we' (that would be you, @ghaerr)? Not that I'm all that optimistic about getting a response, the load of new issues (user space) seems to be > 20 per day ... but saying nothing after having spent all this time on the issue seems like - well, no good.

Anyway - the conclusion - It may seem the rabbit escaped over to the QEMU repository where there is many more places to hide (and I'm hosed since I cannot cure the issue by adding an option to QEMU :-( ).

Enough curiosity remains though that I'll keep the effort to get QEMU 9.0.2 running on my old 1.3MHz MacBook going. It would be good to know for sure that speed has nothing to do with it.

Regardless, the journey has left ELKS and TLVC 'safer' and thus better 'places'.

ghaerr commented 2 weeks ago

What an incredible journey. And it still being at large ... unique.

Yes, that was one long, somewhat agonizing but interesting day. I just couldn't stop working on the chase. While we still don't know the real reason for the result, I at least don't feel the need to fervently chase it down, since I'm pretty sure its not in our code, and there is a workaround. I'm still curious as to whether there's some kind of connection with DIV instructions, kernel startup code, and whether this kind of thing could occur again, but for me, I'm always going to run -accel hvf to ensure I never see it.

my method of checking which kernel I'm using on different machines (using sum) is in fact useless

I checked this and the date is built into the kernel data UTS_VERSION variable. I'm not sure what to do - this was added to uniquely identify a kernel for user bug reporting, but every kernel build is now different. The project version number is also added in UTS_RELEASE.

we should all be using qemu-system-i386 which emulates the 32 bit Intel architecture which is closer to our real hardware target than the x64 version

I'm not sure about that? The real mode emulation in 32- vs 64- bit Intel processors is suppose to be identical? And I believe our XMS extensions using the shadow limit registers are also operate identically.

However, this variant does not have the hardware accelerator option (-accel hvf)

My take is we have to run with hvf enabled, until more is known about the actual bug. I've zipped up an entire copy of the failing repo for which one day I'll try on Linux just to see whether this is macOS only or not.

V7.0 is the oldest find to be 'safe' in that regard.

That's good to know. Perhaps I'll read the QEMU changelogs to see what occurred after 7.0

Regardless, the journey has left ELKS and TLVC 'safer' and thus better 'places'.

Yes, for sure. And I am truly amazed at your ability to find bugs! :) Given what we know about this issue, how hard it is to reproduce, how the bug is so dependent on the exact sequence of instructions in calibrate_delay, where it is executing in memory, and changing the contents of unrelated memory (like commenting in or out eths[] array members) makes me wonder - what are the odds you could have just ran into that issue?! I'm just as puzzled at that!

I'll be going through my notes on this issue and elsewhere and publishing fixes or other possible enhancements.

Thank you!

Mellvik commented 2 weeks ago

but for me, I'm always going to run -accel hvf to ensure I never see it.

That's the thing - I don't have it, so I'll have to use other workarounds. Also, and going back to where this all started, my attempt at getting a reasonable delay loop calibration fails dramatically on QEMU without the hvf accelerator (actually even on the 386 the variations are 'serious' bordering on useless thanks to (probably) pipelining and caching). The easy fix is to detect QEMU (which is easy) and use the RDTSC instruction, which unfortunately isn't present on the 386. (For 286 and lower the delay calibration works fine.)

my method of checking which kernel I'm using on different machines (using sum) is in fact useless

I checked this and the date is built into the kernel data UTS_VERSION variable. I'm not sure what to do - this was added to uniquely identify a kernel for user bug reporting, but every kernel build is now different. The project version number is also added in UTS_RELEASE.

It's OK, I had forgotten all about that and the reasons for keeping it that way are more than good enough. I just have to keep it in mind.

My take is we have to run with hvf enabled, until more is known about the actual bug. I've zipped up an entire copy of the failing repo for which one day I'll try on Linux just to see whether this is macOS only or not.

I'm ahead of you on that one. :-) QEMU is being updated on a raspberrypi 4 (running raspbian) - in a day or two it might just be up and running (it's a headless machine, I'm working on getting a console screen via the system builtin VNC server). Will let you know how it goes...

V7.0 is the oldest find to be 'safe' in that regard.

That's good to know. Perhaps I'll read the QEMU changelogs to see what occurred after 7.0

Good idea, the problem is definitely present in 7.2.

Regardless, the journey has left ELKS and TLVC 'safer' and thus better 'places'.

Yes, for sure. And I am truly amazed at your ability to find bugs! :)

I'm not sure whether it's a blessing or a curse attracting rabbits like that. Single day projects tend to last for weeks. Alwas behind, and not just a little ... :-)

Given what we know about this issue, how hard it is to reproduce, how the bug is so dependent on the exact sequence of instructions in calibrate_delay, where it is executing in memory, and changing the contents of unrelated memory (like commenting in or out eths[] array members) makes me wonder - what are the odds you could have just ran into that issue?! I'm just as puzzled at that!

That makes it two of us, thanks for helping out!! My head was really sore from banging the wall at the time I posted the issue.

I'll be going through my notes on this issue and elsewhere and publishing fixes or other possible enhancements.

Great, Thanks.

ghaerr commented 2 weeks ago

That's the thing - I don't have it, so I'll have to use other workarounds.

Oh, geez. I see. That's because you're running on M1, right? So this means that you can't in fact run this cut of TLVC on QEMU. That's bad. Perhaps its time for us to find another (fast) emulator, I'd be up for that. That might be worth discussing in a new issue.

I have a previously stacked desire to learn a bit more about how QEMU actually works, I believe the unaccelerated mode is the original mode/design. So we're talking about a likely mainline QEMU bug/issue here. Definitely interested in hearing more about your rPI tests.

use the RDTSC instruction, which unfortunately isn't present on the 386.

Really, when did RDTSC debut? I thought for sure it was in all 386+? That means we need a 386/486+ CPU differentiation algorithm since my plan is to use RTDSC in my ELKS debugger library more extensively.

(For 286 and lower the delay calibration works fine.)

You mean that get_ptime works well on 286, but not 386+? On QEMU only or real hardware?

Now that we both know a hell of a lot more about timing routines and hardware issues, I'm thinking of another get_rtime or something like that which uses and requires RDTSC and whose output can be used with printf/printk. This would only really be useful for very small timings, or what purpose would that serve?

Single day projects tend to last for weeks.

It's tenacity then. Experience shows leaving a problem unknown usually comes back to bite!

Mellvik commented 2 weeks ago

That's the thing - I don't have it, so I'll have to use other workarounds.

Oh, geez. I see. That's because you're running on M1, right? So this means that you can't in fact run this cut of TLVC on QEMU. That's bad. Perhaps it's time for us to find another (fast) emulator, I'd be up for that. That might be worth discussing in a new issue.

That's always an interesting discussion, but I don't think this issue should be the deal breaker. It's simply too esoteric, and I have a workaround (aka hack) already (it's actually in the code you have): If the machine is too fast, use some hardcoded values instead of the calculations. It may turn out to be too far off, but I got distracted by a rabbit :-)

I have a previously stacked desire to learn a bit more about how QEMU actually works, I believe the unaccelerated mode is the original mode/design. So we're talking about a likely mainline QEMU bug/issue here. Definitely interested in hearing more about your rPI tests.

PI test results are in. The problem is there, so the speed question is answered: This has nothing to do with speed. It's version 7.2.50 BTW. So it has to be a QEMU bug.

use the RDTSC instruction, which unfortunately isn't present on the 386.

Really, when did RDTSC debut? I thought for sure it was in all 386+? That means we need a 386/486+ CPU differentiation algorithm since my plan is to use RTDSC in my ELKS debugger library more extensively.

According to the sources I could find (probably Wikipedia IIRC), the RDTSC came with the 486.

(For 286 and lower the delay calibration works fine.)

You mean that get_ptime works well on 286, but not 386+? On QEMU only or real hardware?

'Works well' - well, that depends. What I'm saying is that the results from my calibration are almost on the mark for the 286, too low for the 386 because of the way I do the calibration. 8088 is ok, but less accurate for the opposite reason than the 386, the effect of the clock interrupts (and on a fully running system other interrupts) bite harder. In both cases it's really hard to compensate for the inaccuracies because the real world is, well, inaccurate. OTOH - all (are well within broadly acceptable limits given the (IN)accuracy of the delay loops we have today.

Given the time I ended up spending on this (the calibrate delay 'project') I really regret having started. Then there is all the good we got done ... this is the nature of things.

Now that we both know a hell of a lot more about timing routines and hardware issues, I'm thinking of another get_rtime or something like that which uses and requires RDTSC and whose output can be used with printf/printk. This would only really be useful for very small timings, or what purpose would that serve?

You know, I would definitely use that for the delay calibration on faster (than 386) systems!!!

Single day projects tend to last for weeks.

It's tenacity then. Experience shows leaving a problem unknown usually comes back to bite!

TOUCHÈ!