Mellvik / TLVC

Tiny Linux for Vintage Computers
Other
9 stars 0 forks source link

Direct floppy driver #13

Closed Mellvik closed 1 year ago

Mellvik commented 1 year ago

The main component of this PR is the new direct floppy driver - first cut. Based on the code that has been lingering in ELKS and now TLVC since the early nineties, this rewrite adapts the code to the present day environment which entails lots of changes in many different places. As of this PR the driver works well, boots, and needs more testing:

While developing the driver, a number of bugs have been encountered and fixed, most notably

The block floppies have major device # 2, moving the ssd major to 7. The device names are /dev/f0 and /dev/f1 to distinguish them from the BIOS floppy devices. The update also prepares for the upcoming direct hd driver and includes updates to some commands, such as df, dd, sys- the latter still not working pending a geometry IOCTL in the floppy driver.

Mellvik commented 1 year ago

Thank you @ghaerr, It took me quite some time to figure this one out. I’ve been looking for a way to stop the motor timeout counter at 0x440, no success. So I ended up with this band aid solution. How can that BIOS callback be removed?

Thanks, and yes I notices the size bug in the poke a few days ago. It was included in the PR posted earlier today. BTW - out of curiosity I compared this method with using the pokeb() call. Turns out the memory footprint is the same, so I kept this one.

—M

  1. jul. 2023 kl. 02:20 skrev Gregory Haerr @.***>:

@ghaerr commented on this pull request.

In tlvc/arch/i86/drivers/block/directfd.c https://github.com/Mellvik/TLVC/pull/13#discussion_r1272874527:

  • if (debug) printk("flpON");
  • fl_timeout = 0; / Reset BIOS motor timeout counter, neccessary on some machines */ The kernel IRQ 0 hardware timer interrupt calls a saved copy of the original BIOS' interrupt vector every 5 clock ticks. The BIOS usually uses this interrupt to decrement floppy motor timeouts, amongst other things.

It would be interesting to remove that BIOS callback entirely and see whether that 1) affects anything workings of the direct FD and HD drivers, thus removing another BIOS dependency, as well as 2) eliminate the need for poking a zero in the Bios Data Area to reset the motor timeout counter.

Also: given that the declaration of fl_timeout is long __far , using fl_timeout = 0 zeros not only the byte location of the BIOS motor timeout at 0x440, but also three other BDA variables at 0x441-0x443!

— Reply to this email directly, view it on GitHub https://github.com/Mellvik/TLVC/pull/13#pullrequestreview-1544450492, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGODGFZ4UQ7ZEVOWHT6DXR4GOJANCNFSM6AAAAAAXBCMZY4. You are receiving this because you modified the open/close state.

ghaerr commented 1 year ago

How can that BIOS callback be removed?

The code is in arch/i86/kernel/irqtab.S, find this code and comment it out to quickly test:

//
//      IRQ 0 (timer) has to go on to the bios for some systems
//
        decw    bios_call_cnt   // Will call bios int?
        jne     do_eoi
        movw    $5,bios_call_cnt
        pushf
        lcall   *org_irq0
        jmp     was_trap        // EOI already sent by bios int

After testing, a few more lines would also be removed for a final version. For BIOS FD/HD, it would want to remain in, and out for direct FD/HD. It sounds like TLVC is at the point where one should select only BIOS or direct, never both. I am guessing the only reason BIOS is available is for compatibility testing with machines that might not yet work with direct FD/HD.

I compared this method with using the pokeb() call. Turns out the memory footprint is the same, so I kept this one.

Interesting that the amount of code it takes to push three parameters and call a function is the same as loading a far pointer and dereferencing... and certainly the far pointer dereference is faster. I have found that having at least a function wrapper around nonstandard C extensions (e.g. __far) allows for superior portability, especially when a compiler changes ("Our compiler won't ever change", famous last words I know :)

Mellvik commented 1 year ago

Thanks @ghaerr, I certainly didn’t catch that one. Removing it does the trick.

It’s on my list to track down and remove/ifdef low level bios dependencies (e.g. the drive parameter table stuff), but I want to get the system stable first. That said, in this case keeping it made the system unstable, so the argument is not a good one. My priority list just changed.

The BIOS block IO stuff may eventually go away, but like you say it's useful to keep around for comparison for a while.

C extensions: Right - then there is the third option: «Better call @tkchia» :-)

Code size (poke): Your comment made me curious, it does indeed sound too good to be true. And it is, but just barely: I was 2 bytes off :-) a28: 30 c0 xor %al,%al a2a: 50 push %ax a2b: b8 40 00 mov $0x40,%ax a2e: 50 push %ax a2f: 50 push %ax a30: e8 fe ff call a31 <floppy_on+0x19> a31: R_386_PC16 pokeb

Moot now, but interesting anyway.

-M

  1. jul. 2023 kl. 23:46 skrev Gregory Haerr @.***>:

How can that BIOS callback be removed?

The code is in arch/i86/kernel/irqtab.S, find this code and comment it out to quickly test:

// // IRQ 0 (timer) has to go on to the bios for some systems // decw bios_call_cnt // Will call bios int? jne do_eoi movw $5,bios_call_cnt pushf lcall *org_irq0 jmp was_trap // EOI already sent by bios int After testing, a few more lines would also be removed for a final version. For BIOS FD/HD, it would want to remain in, and out for direct FD/HD. It sounds like TLVC is at the point where one should select only BIOS or direct, never both. I am guessing the only reason BIOS is available is for compatibility testing with machines that might not yet work with direct FD/HD.

I compared this method with using the pokeb() call. Turns out the memory footprint is the same, so I kept this one.

Interesting that the amount of code it takes to push three parameters and call a function is the same as loading a far pointer and dereferencing... and certainly the far pointer dereference is faster. I have found that having at least a function wrapper around nonstandard C extensions (e.g. __far) allows for superior portability, especially when a compiler changes ("Our compiler won't ever change", famous last words I know :)

— Reply to this email directly, view it on GitHub https://github.com/Mellvik/TLVC/pull/13#issuecomment-1650604987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOFVRPO3ZT5RPDJYXR3XSA5D5ANCNFSM6AAAAAAXBCMZY4. You are receiving this because you modified the open/close state.

ghaerr commented 1 year ago

the timers in the scheduler (del_timer, add_timer) had a bug causing the system to crash if deleting a non-existent timer. The timer code has been fixed and simplified.

I know the timer code simplification was later reverted, but was there a bug when deleting a non-existent timer? For some reason, I thought I saw a fix somewhere like this but now can't find it. IIRC it was a null pointer de-reference?

Mellvik commented 1 year ago

You remember (almost) correctly. I had to look it up. The fix was simply initializing the static pointer next_timer:

static struct timer_list *next_timer = NULL;

The (rather elusive) problem happened only if del_timer was called before ANY add_timer call. Which is exactly what the direct floppy driver did/does.

ghaerr commented 1 year ago

The fix was simply initializing the static pointer next_timer:

Could you do me a big favor and remove the initialization = NULL and see whether the problem persists? If so, a much bigger problem is at hand, all uninitialized kernel data variables (static or not) are placed in the .bss data section, and cleared to 0 at the time of kernel startup. If that isn't working, we have a much bigger problem.

Here's the code in crt0.s that clears .bss:

/*
! Setup.S already initialized DS and ES (but not SS)
! In addition, registers contain:
!   BX, Text size
!   DI  Far text size
!   SI, Data size
!   DX, BSS size
*/
        mov     %bx,_endtext
        mov     %di,_endftext
        mov     %si,_enddata
        add     %dx,%si
        mov     %si,_endbss

// Start cleaning BSS. Still using setup.S stack

        mov     _enddata,%di    // start of BSS
        mov     %dx,%cx         // CX = BSS size
        xor     %ax,%ax
        shr     $1,%cx
        cld
        rep
        stosw

// End cleaning BSS

The only thing the initialization does in the struct time_list code is move the location of the stored pointer to .data rather than .bss. That is, it's address will be different, but the initialized value the same.

In summary, the above fix shouldn't need to happen and needs more investigation. All uninitialized kernel variables are set to 0. If you can find another case where this isn't true, that will be proof of problem. Otherwise, all that initialization of a variable to 0 does is increase the size of /linux in its .data section on disk.

Mellvik commented 1 year ago

I'll do that - but bear with me while I try to catch this new rabbit and gasp for air every once in a while. I have a buffer pollution problem - probably self-inflicted in the sense that there is a typo somewhere in all the changes I made (and imported) over the past few days. Of course it only happens when booting off of a (slow) floppy. git diff is my friend...

I remember the del_timer() issue well, I went as far as to print the initialized value to make sure I was on the right track. I got random data. If initialization works as you describe - which is likely otherwise we'd likely have all kinds of other problems - I had or have a memory leak or similar. Interesting times.

ghaerr commented 1 year ago

I had or have a memory leak or similar.

I'll wager on memory overrun. A quick way to get real on this is to look at the system.map file in arch/i86/boot/system.map. That will show the variables in front and back of the one getting trashed, which usually points you in the right direction. Since initializing the variable stuck it in a different section, it doesn't get trashed... but the one that took its place could still be. Look at the map file locations for all the direct FD driver variables (both .data and .bss).

probably self-inflicted in the sense that there is a typo somewhere in all the changes I made (and imported) over the past few days. Of course it only happens when booting off of a (slow) floppy.

Or this could be the same original problem, just trashing different variables since the codebase changed...