MEN-Mikro-Elektronik / 13MD05-90

MDIS5 System Package for Linux (including drivers)
Other
4 stars 4 forks source link

m_open to 13Z051-06 driver at F401 leads to hanging PCI bus #311

Open dpfeuffer opened 1 year ago

dpfeuffer commented 1 year ago

@mad-jsanjuan @mad-jrodriguez Please fill this issue with your discovered details.

mad-jrodriguez commented 1 year ago

Hi all,

I'm going to fill the issue with the discovered details.

First of all, tell that issue where found when we tried to run the command

$ sudo m_open dac_2

The first time, it executed properly, but the second time, we got an estrange error. To be sure that we did everything ok, we just reboot the setup and then we try again, getting the same error. Please, take a look:

men@men-MEN-men@men-MEN-F026L00:~$ sudo m_open dac_2
open dac_2
path=3 opened
close path

men@men-MEN-F026L00:~$ sudo m_open dac_2
open dac_2
*** can't open: ERROR (MDIS) 0x0709:  BBIS: illegal board slot

At this point, if we try to run the same command one more time, the system get stuck, completely unresponsive, so a hard reboot is required (using a relay box we have in the lab).

Besides, the system also got stuck in the same way if instead of running the same command, we run the command "lspci" in order to list the PCI devices connected.

Reading the dmesg, we got that, the error "0x0709: BBIS: illegal board slot" is thrown because it cannot init the PCI device, telling "illegal pci-vendor-id"

[87203.440418]  no mem mapped table found, try to find io mapped table
[87203.440420] CHAM - InitPci: pci-device domain/bus/dev/func=0/14/14/0
[87203.440424] OSS_PciGetConfig domain 0 bus e dev e func 0 which 1
[87203.490537]   value=0x0000ffff
[87203.490542] *** CHAM - InitPci: illegal pci-vendor-id=0xffff at domain/bus/dev/func 0/14/14/0
[87203.490545] *** CHAMELEON_BrdInit: CHAM_InitPci error 0xc0! (PciBus 0xe, PciDev 0xe)
[87203.490549] *** BBIS:OpenDevice: can't init board on mezz_cham_2
[87203.490551] OSS_Exit
[87203.490553] BB - CHAMELEON_Exit
[87203.490555] BB - CHAMELEON_Cleanup
[87203.490556] OSS_MemFree (Lin): addr=0x00000000a816a222 size=0x28
[87203.490560] <mezz_cham_2> :OSS_SpinLockRemove: entered
[87203.490563] OSS_MemFree (Lin): addr=0x000000008c7bc104 size=0x16f0
[87203.490566] bbis_open ex: mezz_cham_2 ret=-1801 -0x709
[87203.490570] *** MK:InitialOpen: : dac_2 can't open BBIS dev mezz_cham_2

It seems like something goes wrong during the deinit, leaving the PCI bus subsystem in a weird state, because any further operation in the PCI bus after get this state, leave the system completely unresponsive.

The next step was to check how Z051 is designed.

After an quick analysis, we saw the the hardware deinit is done in _Z051Exit( ) function but the hardware init is not done in _Z051Init( ) but in _Z051Write( ) function. We don't know if this was designed as it is due to FPGA restrictions but it is not correct. Moreover, as the _Z051Write( ) function is called more than time, the chunk of core that do the hardware initialization is protected with a flag to only run it once.

Z055_Write( ) function:

    /*
     * If this is the first write we have to initialize the communication
     * between the FPGA and the DAC. This also connects the DAC's outputs to
     * the output drivers of the module.
     * So before the first access we have always 0mA output current (on F401).
     */
    if( llHdl->initDac ) {
        MWRITE_D32( ma, DAC_SCLK_REG, DAC_SCLK_DEFAULT );

        /*
         * In order to avoid an unwanted interrupt we have to wait for the
         * watchdog circuit to release the IRQ input before we can enable
         * the interrupt.
         */
        if( llHdl->irqEnable ) {
            DBGWRT_3((DBH, "delay for watchdog to come up...\n"));
            OSS_Delay( OSH, 1010 );  /* worst case = 1000ms */
            MWRITE_D32( ma, DAC_IER_REG, DAC_IRQ_MASK );
        }
        llHdl->initDac = 0;
    }

Z055_Exit( ) function:

    /*------------------------------+
    |  de-init hardware             |
    +------------------------------*/

    /* set DAC outputs to zero */
    MWRITE_D32( ma, DAC_CTRL_REG,
                DAC_CMD_BUF_A | 0 );
    OSS_MikroDelay(OSH, 1);
    MWRITE_D32( ma, DAC_CTRL_REG,
                DAC_CMD_LOAD_AB | DAC_CMD_BUF_B | 0 );
    OSS_Delay( OSH, 100 );

    /* disable interrupt */
    MWRITE_D32( ma, DAC_IER_REG, 0 );

    /* turn off outputs using F401 watchdog mechanism */
    MWRITE_D32( ma, DAC_SCLK_REG, 0 );

In our opinion, the crash happens because the binary m_open just executes the functions _Z051Init( ) and then, _Z051Exit( ) and, something bad happens if someone tries to deinit the hardware if it was not initialized first.

The probe that, we did several tests, isolating writing operations in _Z051Exit( ) function and we got insteresting results.

If we run the binary z51_simp that is designed to test the IP Core and actually perform write operations, it works fine if it is run more than once, however, if after that, we run the m_open binary, then second time it crashes as always. Check the logs:

men@men-MEN-F026L00:/usr/local/bin$ sudo ./z51_simp dac_2 0 555
Hit any key to finish program

men@men-MEN-F026L00:/usr/local/bin$ sudo ./z51_simp dac_2 0 555
Hit any key to finish program

men@men-MEN-F026L00:/usr/local/bin$ sudo ./z51_simp dac_2 0 555
Hit any key to finish program

men@men-MEN-F026L00:/usr/local/bin$ sudo ./z51_simp dac_2 0 555
Hit any key to finish program

men@men-MEN-F026L00:/usr/local/bin$ sudo ./z51_simp dac_2 0 555
Hit any key to finish program

men@men-MEN-F026L00:/usr/local/bin$ sudo m_open dac_2
open dac_2
path=3 opened
close path

men@men-MEN-F026L00:/usr/local/bin$ sudo m_open dac_2
open dac_2
*** can't open: ERROR (MDIS) 0x0709:  BBIS: illegal board slot

So it probes our theory, because if hardware is initialized, the deinitialization is done OK, however, if the hardware is not initialized, the deinitialization makes something in a weird state.

We have created a work arround to fix the issue. Ideally a full refactor should be done, but with our fix, the problem cannot be reproduced.

Additionally, we have removed below lines as they are no longer necessary.

MWRITE_D32( ma, DAC_CTRL_REG, DAC_CMD_BUF_A | 0 );
OSS_MikroDelay(OSH, 1);

Our proposed patch:

--- a/DRIVERS/MDIS_LL/Z051/DRIVER/COM/z51_drv.c
+++ b/DRIVERS/MDIS_LL/Z051/DRIVER/COM/z51_drv.c
@@ -112,6 +112,7 @@ typedef struct {
     u_int32         gain[2];        /**< gain parameter */
     u_int32         powerdown[2];   /**< powerdown mode */
     int             initDac;        /**< init data communication and IRQ */
+    int             hwInit;         /**< hardware initialized */
     OSS_SIG_HANDLE  *hwSig;         /**< signal for hardware malfunction */
 } LL_HANDLE;

@@ -302,6 +303,7 @@ static int32 Z51_Init(

     /* tell write routine to init DAC communication and IRQ */
     llHdl->initDac = 1;
+    llHdl->hwInit = 0;
     llHdl->powerdown[0] = llHdl->powerdown[1] = 0;

     DBGWRT_3((DBH, "Z51_Init: offset=%d,%d  gain=%d,%d\n", 
@@ -342,18 +344,19 @@ static int32 Z51_Exit(
     +------------------------------*/

     /* set DAC outputs to zero */
-    MWRITE_D32( ma, DAC_CTRL_REG,
-                DAC_CMD_BUF_A | 0 );
-       OSS_MikroDelay(OSH, 1);
-    MWRITE_D32( ma, DAC_CTRL_REG,
-                DAC_CMD_LOAD_AB | DAC_CMD_BUF_B | 0 );
-    OSS_Delay( OSH, 100 );
-
-    /* disable interrupt */
-    MWRITE_D32( ma, DAC_IER_REG, 0 );
+    if (llHdl->hwInit) {
+        MWRITE_D32( ma, DAC_CTRL_REG,
+                    DAC_CMD_LOAD_AB | DAC_CMD_BUF_B | 0 );
+        OSS_Delay( OSH, 100 );
+
+        /* disable interrupt */
+        MWRITE_D32( ma, DAC_IER_REG, 0 );

-    /* turn off outputs using F401 watchdog mechanism */
-    MWRITE_D32( ma, DAC_SCLK_REG, 0 );
+        /* turn off outputs using F401 watchdog mechanism */
+        MWRITE_D32( ma, DAC_SCLK_REG, 0 );
+
+        llHdl->hwInit = 0;
+    }

     /*------------------------------+
     |  clean up memory               |
@@ -430,6 +433,7 @@ static int32 Z51_Write(
             MWRITE_D32( ma, DAC_IER_REG, DAC_IRQ_MASK );
         }
         llHdl->initDac = 0;
+        llHdl->hwInit = 1;
     }

     /* dependant on channel set output A, B or both */
mad-jsanjuan commented 1 year ago

This is still an issue. Although we fixed the issue with m_open at the initialization of the HW, the problem still persists when running the example program to set the analog outputs.

We tried a few updates of the FPGA program file and the problem still persists. The last file we load was:

F401-00IC001A13_lh.rbf (md5sum: 618f68d7b1c5ddb1a0a6b75c1a0367d8)