fvdi + nvdi5 + freemint leads to memory violation when calling v_opnvwk()

fVDI documentation states:

If you have NVDI too, you can make NVDI use fVDI for the actual screen output by making sure that NVDI runs last. This will be slower, but you'll be able to use NVDI's outline fonts and printer drivers (if you don't need the outline fonts, NVDI can run before fVDI and then there will be no speed penalty for screen output).

Then there is nvdifix in fvdi.sys which somehow tries to move NVDI in the trap chain so it's not first:

Modify a loaded NVDI so that it will never try to move itself forward in the Trap 2 chain.

but its description is not very encouraging:

With this uncommented fVDI can use NVDI as a background VDI (for dealing with non-screen devices). Some strange problems still remain, though.

FreeMiNT's INSTALL file says:

NVDI must always be started before MiNT.

So this would imply that the best supported combination of those three is:

FVDI.PRG (with booted set and nvdifix not set)
NVDI.PRG (with screen drivers disabled)
MINT.PRG

However this means that both NVDI and FVDI are outside of FreeMiNT's control. If I understand it correctly, they even use TOS supervisor stack (right?) and are not listed as separate processes. This seems to create a strange issue.

Repro steps:

take the latest bootable snapshot
create a separate drive image for C:
copy all files from hostfs's C: to image's C:
install NVDI (5.03 in my case) to image's C:
disable / delete all NVDIDRV?.SYS files
make sure that the programs in AUTO are in order as described above

Now start QED (/opt/GEM/qed/qed.app). Depending on the actual setup, sometimes it runs at the first time. Quit it. Run again. It wont run again. Sometimes it wouldn't run even on the first try, sometimes TosWin2 wouldn't run.

If started from TosWin2, one can see a simple "Bus error" message. When mint.prg compiled with debug infos, one can see in Aranym's console:

pid   8 (qed): [1][ PROCESS  "qed"  KILLED: | MEMORY VIOLATION.  (PID 008) | Type: free      PC: 01008F52 | Addr: 01397FF0  BP: 01318000 |         
pid   8 (qed): signal #10 raised [syscall_pc 0x1008F52, exception_pc 0x1008F52]

When looking where exactly it crashes, it goes all the way to gemlib's v_opnvwk. It wouldn't return from the trap. FreeMiNT's trap 2 handler is not even called in this case (i.e. it's not a bug related to freemint), fVDI's https://github.com/freemint/fvdi/blob/c19b1b0c6429c2d4cd0aa94ab002b56cb9883c94/fvdi/engine/workstn.c#L167

seems to successfully enter and exit so it must be crashing somewhere in NVDI's trap handler. It seems to depend also on the resolution, if I set fvdi.sys with 640x480x8, it doesn't crash, if 640x480x16, it does.

It's very hard to debug as I can't find a way to reproduce this with anything else than aranym.sys.

Strangely, when following https://atari-forum.com/viewtopic.php?p=247471#p247471, i.e. all the possible recommendations thrown away:

NVDI started after MINT.PRG
FVDI started after NVDI with nvdifix enabled

Printing and vector fonts not only work but also the crashing v_opnvwk is gone.

Could be the issue caused by a stack overflow in NVDI.PRG ? Symptoms seems to be very similar to https://github.com/freemint/freemint/pull/269 however when I started investigating this one, I had the 'wrong' setup (NVDI.PRG after MINT.PRG, FVDI.PRG in mint.cnf with nvdifix enabled) so the stack was wrong on a different place. Ironically, after fixing that I tried also the correct setup and everything seemed to work: https://atari-forum.com/viewtopic.php?p=435414#p435414 ... so the problem can be easily hidden as it seems.

Ironically, after fixing that I tried also the correct setup and everything seemed to work: https://atari-forum.com/viewtopic.php?p=435414#p435414

OK, at least this mystery has been solved: it's really important that it is NVDI 4.x that's installed. @wongck68 had been using NVDI 4.11 when the XaAES crashes started to occur and XaAES was booting just fine with NVDI 5: https://groups.google.com/g/aranym/c/iyeFnaoSMeU/m/goacblKtAgAJ. However he didn't want to use NVDI 5 ("NVDI 5 has issues with non-NVDI printer drivers") so that's why I kept focusing on the 4.11 and fixed that. If I tested NVDI 5 with the same setup, I would have seen the same QED crash for sure.

I was able to replicate both scenarios:

latest snapshot with EmuTOS + FVDI + NVDI 4.11 + MINT.PRG opens QED just fine
latest snapshot (more precisely any snapshot, even the ones mentioned to be working with 4.11 before any changes, e.g. https://github.com/freemint/freemint/commit/2b8d5dbc) with EmuTOS + FVDI + NVDI 5 + MINT.PRG leads to QED being killed

So it's a NVDI5 issue specifically and it's not a regression of any kind.

Not happening on TOS 4.04: same setup (an IDE image in fact) works with TOS 4.04 (QED starts), when ROM changed to EmuTOS (in aranym's config file), QED doesn't work again.

Thanks to @czietz there has been some progress: this is not issue specific to EmuTOS. The reason why it doesn't crash in TOS 4.04 is that TOS, when booted in Aranym, it patched, i.e. all of its 030 MMU code is disabled, essentially booting Aranym without memory protection.

As soon as one executes set_mmu.prg from AUTO folder before fvdi.prg, qed crashes, too.

According to Christian's findings, fvdi crashes somewhere in https://github.com/freemint/fvdi/blob/master/fvdi/engine/utility.c.

More precisely, the MEMORY VIOLATION crash occurs here:

[010082b0] 0839 0000 0106 33ea       btst      #0,$010633EA
[010082b8] 677c                      beq.s     $01008336
[010082ba] 4a79 0106 1ed2            tst.w     $01061ED2
[010082c0] 6726                      beq.s     $010082E8
[010082c2] 2079 0106 334c            movea.l   $0106334C,a0
[010082c8] b0fc 0000                 cmpa.w    #$0000,a0
[010082cc] 6700 00be                 beq       $0100838C
[010082d0] 2890                      move.l    (a0),(a4)
[010082d2] 2948 0004                 move.l    a0,4(a4)
[010082d6] 70fe                      moveq.l   #-2,d0
[010082d8] c090                      and.l     (a0),d0
[010082da] 2240                      movea.l   d0,a1
[010082dc] 234c 0004                 move.l    a4,4(a1)  <===== CRASH
[010082e0] 200c                      move.l    a4,d0
[010082e2] 7201                      moveq.l   #1,d1
[010082e4] 8081                      or.l      d1,d0
[010082e6] 2080                      move.l    d0,(a0)

This code sequence is within FVDI.PRG, but I did not investigate far enough to point to the exact line of code.

With Christian's nudge I was able to uncover it a bit more as well. The crash happens exactly here: https://github.com/freemint/fvdi/blob/c19b1b0c6429c2d4cd0aa94ab002b56cb9883c94/fvdi/engine/utility.c#L782

As a proof see my custom-compiled fvdi.prg with debug symbols:

000085f8 <.L743>:
    85f8:       2079 0001 6da8  moveal 16da8 <_mblocks>,%a0
    85fe:       200a            movel %a2,%d0
    8600:       7201            moveq #1,%d1
    8602:       8081            orl %d1,%d0
    8604:       4a88            tstl %a0
    8606:       6700 00b6       beqw 86be <.L728>
    860a:       2490            movel %a0@,%a2@
    860c:       2548 0004       movel %a0,%a2@(4)
    8610:       72fe            moveq #-2,%d1
    8612:       c290            andl %a0@,%d1
    8614:       2241            moveal %d1,%a1
    8616:       234a 0004       movel %a2,%a1@(4)             <---- crash
    861a:       2080            movel %d0,%a0@

0000861c <.L727>:
    861c:       254d 0008       movel %a5,%a2@(8)
    8620:       254c 000c       movel %a4,%a2@(12)
    8624:       d4fc 000c       addaw #12,%a2

00008628 <.L718>:
    8628:       200a            movel %a2,%d0
    862a:       4cdf 3c1c       moveml %sp@+,%d2-%d4/%a2-%a5
    862e:       4e75            rts

Some context:

pid   9 (qed): [1][ PROCESS  "qed"  KILLED: | MEMORY VIOLATION.  (PID 009) | Type: private   PC: 01008726 | Addr: 01370004  BP: 015A4000 |         
pid   9 (qed): signal #10 raised [syscall_pc 0x1008726, exception_pc 0x1008726]

Registers immediately before crash:

 A0: 01018134 A4: 00003fc0   D0: 01406001 D4: 01128b70  USP=0161a3b4
 A1: 01370000 A5: 00003fcc   D1: 01370000 D5: 015f2472  ISP=04e365aa
 A2: 01406000 A6: 04e3662a   D2: 00000008 D6: 015f5de8  MSP=00000000
 A3: 00006814 A7: 04e365aa   D3: 00000003 D7: 015f1a9c  VBR=00000000
T=00 S=1 M=0 X=0 N=0 Z=0 V=0 C=0           TC=8000
CACR=80008000 DTT0=00000000 ITT0=00000000 SRP=013e8000  SFC=000
CAAR=00000000 DTT1=00000000 ITT1=00000000 URP=013e8000  DFC=000
FP0: -nan FP1: -nan FP2: -nan FP3: -nan N=0 Z=0
FP4: -nan FP5: -nan FP6: -nan FP7: -nan I=0 NAN=0
[01008722] 234a 0004                 move.l    a2,4(a1)
next PC: 01008726

I'm not sure what's the deal with those | 1 (there's a few other occasions of using it across the engine), it would seem that it assumes that Mxalloc never returns an odd number and therefore the bit 0 is used as some kind of flag ((a0) does point to 0x01370001).

I put some debug outputs into around Mxalloc and Mfree, it doesn't seem that the affected pointer has been released. When crashing at accessing 0x01488004 (different address due to debug stuff), I can see that

Allocation at $01488000, 16320 bytes
       Malloc at $0148800c
       Splitting
       Allocating at $0148800c (next at $01489fec)
       6: 0/0 0/0 2/0 7/0 1/1 0/1 0/0 3/0 9/1 0/0

and there is no Mfree with any such address so I don't know. Maybe some of the diagnostic functions in utility.c can be of use. Maybe @mfro0 , @th-otto or @vinriviere know more?

freemint / fvdi

fvdi + nvdi5 + freemint leads to memory violation when calling v_opnvwk() #50