MEN-Mikro-Elektronik / 13MD05-90

MDIS5 System Package for Linux (including drivers)
Other
4 stars 4 forks source link

G028 mcb uart driver can only initialize one out of 5 instances #244

Closed duagon-rvw closed 1 year ago

duagon-rvw commented 1 year ago

02G028I02, Thales version with 229, mcb driver fails for 3 out of 4 instances on standard Ubuntu 22.04.1 live system

uname -a
Linux ubuntu 5.15.0-43-generic #46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
dmesg | grep 8250
[    0.825042] pci 0000:00:1c.6: PTM enabled (root), 4ns granularity
[    1.107855] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
[   31.169864] 8250_men_mcb mcb0-16z125-0:0:0: can't request region for resource [mem 0x74500100-0x7450010f]
[   31.201391] 8250_men_mcb: probe of mcb0-16z125-0:0:0 failed with error -16
[   31.201473] 8250_men_mcb mcb0-16z125-1:0:0: can't request region for resource [mem 0x74500110-0x7450011f]
[   31.208991] 8250_men_mcb: probe of mcb0-16z125-1:0:0 failed with error -16
[   31.209008] 8250_men_mcb mcb0-16z125-2:0:0: can't request region for resource [mem 0x74500120-0x7450012f]
[   31.212033] 8250_men_mcb: probe of mcb0-16z125-2:0:0 failed with error -16
[   31.212046] 8250_men_mcb mcb0-16z125-3:0:0: can't request region for resource [mem 0x74500130-0x7450013f]
[   31.215779] 8250_men_mcb: probe of mcb0-16z125-3:0:0 failed with error -16
[   31.215826] 8250_men_mcb mcb0-16z125-4:0:0: mcb0-16z125-4:0:0 on board
[   31.215828] 8250_men_mcb mcb0-16z125-4:0:0: board not detected, using default uartclk
[   31.216252] 8250_men_mcb mcb0-16z125-4:0:0: found MCB UART: ttyS5
./eeprodprog -d

eeprod2 data at i2c-2, addr=0x57, offset=0x00
----------------------------------------------
HW article    : G028I02
HW revision   : 01.00.00
HW serial-nbr : 21
HW production : 2022-03-21
HW repair     : -

similar effects on G022:

dmesg | grep 8250
[    0.368092] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
[   30.252472] 8250_men_mcb mcb0-16z125-0:0:0: can't request region for resource [mem 0xc2800100-0xc280010f]
[   30.257713] 8250_men_mcb: probe of mcb0-16z125-0:0:0 failed with error -16
[   30.257754] 8250_men_mcb mcb0-16z125-1:0:0: can't request region for resource [mem 0xc2800110-0xc280011f]
[   30.262312] 8250_men_mcb: probe of mcb0-16z125-1:0:0 failed with error -16
[   30.262332] 8250_men_mcb mcb0-16z125-2:0:0: can't request region for resource [mem 0xc2800120-0xc280012f]
[   30.265892] 8250_men_mcb: probe of mcb0-16z125-2:0:0 failed with error -16
[   30.265919] 8250_men_mcb mcb0-16z125-3:0:0: can't request region for resource [mem 0xc2800130-0xc280013f]
[   30.269396] 8250_men_mcb: probe of mcb0-16z125-3:0:0 failed with error -16
[   30.269427] 8250_men_mcb mcb0-16z125-4:0:0: mcb0-16z125-4:0:0 on board
[   30.269430] 8250_men_mcb mcb0-16z125-4:0:0: board not detected, using default uartclk
[   30.269724] 8250_men_mcb mcb0-16z125-4:0:0: found MCB UART: ttyS5
grep -i uarts /boot/config-5.15.0-43-generic
CONFIG_SERIAL_8250_NR_UARTS=48
CONFIG_SERIAL_8250_RUNTIME_UARTS=32
CONFIG_SERIAL_UARTLITE_NR_UARTS=1
CONFIG_SERIAL_RP2_NR_UARTS=32

[Florian Rauh: ] pls have a look in the corresponding FPGA design. The failed UARTS 0-4 are grouped. The working one is a single one.

G028I_sn21_dmesg_8250_mcb_fail.log G022_sn370_dmesg_8250_mcb_fail.log

duagon-rvw commented 1 year ago

Possibly linked with #235

duagon-rvw commented 1 year ago

https://elixir.bootlin.com/linux/v5.15.76/source/drivers/tty/serial/8250/8250_men_mcb.c https://elixir.bootlin.com/linux/v5.15.76/source/drivers/mcb

dpfeuffer commented 1 year ago

I assigned this Issue to the upcoming release because of an urgent customer request.

duagon-rvw commented 1 year ago

First check with Ubuntu 22.04 without MDIS Installation. If this does not reproduce the issue, install MDIS without blacklisting mcb and see if this occurs than. If this is then happening it is an issue of #248

mad-jsanjuan commented 1 year ago

We can reproduce this issue with other CPU boards. We are using a F26L connected to F215 and G215 boards (test setup 1). Without MDIS installed, the same issue appears if we do not blacklist the upstream mcb driver. The 8250_men_mcb driver fails to probe with the same error:

[   32.381290] 8250_men_mcb mcb0-16z125-0:0:0: can't request region for resource [mem 0xa8200100-0xa820010f]
[   32.381320] 8250_men_mcb: probe of mcb0-16z125-0:0:0 failed with error -16
[   32.381339] 8250_men_mcb mcb0-16z125-1:0:0: can't request region for resource [mem 0xa8200110-0xa820011f]
[   32.381348] 8250_men_mcb: probe of mcb0-16z125-1:0:0 failed with error -16
[   32.381367] 8250_men_mcb mcb1-16z125-0:0:0: can't request region for resource [mem 0xaa600100-0xaa60010f]
[   32.381375] 8250_men_mcb: probe of mcb1-16z125-0:0:0 failed with error -16
[   32.381388] 8250_men_mcb mcb1-16z125-1:0:0: can't request region for resource [mem 0xaa600110-0xaa60011f]
[   32.381396] 8250_men_mcb: probe of mcb1-16z125-1:0:0 failed with error -16

This was reproduced on:

mad-jrodriguez commented 1 year ago

We can consider the issue is already fixed. However, as the the patch must be submited to the upstream linux kernel, no change is added to our repo.

The patch contain several changes that are not exclusively related with this issue but are wrong as well.

The problem reported here was caused by an memory overlapping. The mcb-pci tries to allocate a memory region with a fixed size. Such size corresponds to the maximum size of the chameleon table can have, however, if the chameleon table of the FPGA is smaller, the memory region allocated can overlaps the memory regions of one or more IP Cores.

Check the following logs:

[   31.016972] 8250_men_mcb mcb0-16z125-0:0:0: can't request region for resource [mem 0xa8200100-0xa820010f]
[   31.016994] 8250_men_mcb: probe of mcb0-16z125-0:0:0 failed with error -16
[   31.017010] 8250_men_mcb mcb0-16z125-1:0:0: can't request region for resource [mem 0xa8200110-0xa820011f]

However, if we dump the file /proc/iomem to check the memory regions allocated:

user@host:$ sudo /proc/iomem
...
a8200000-a82001ff : mcb_pci
...

We can see that the region is already allocated by the module mcb_pci

To fix it, we have decided to not allocate memory region in the mcb_pci as it only uses such memory region to parse the chameleon table and for only reading the memory it is sufficient to perform an ioremap, thus, the IP Core have their memory regions available.

Additionally, we have added some correction in other 2 mcb modules.

8250_men_mcb: In this serial driver, have added some refactoring to simplify the function get_num_ports(), passing only one argument instead of 2. Additionally, we have changed the way the device is configured. Now, the driver is auto-configured, leaving the responsibility to the 8250/serial subsystem (even the responsibility of allocating/deallocating the associated memory regions).

Now, the driver works in the same way of the MDIS driver.

mcb-core: During the unloading operation of the mcb-pci module, we detected a crash/warning while trying to call to ida_free() with an identifier that was not allocated.

[  286.691693] ------------[ cut here ]------------
[  286.691695] ida_free called for id=1 which is not allocated.
[  286.691714] WARNING: CPU: 0 PID: 1719 at lib/idr.c:523 ida_free+0xe0/0x140
[  286.691715] Modules linked in: snd_hda_codec_hdmi amd64_edac_mod snd_hda_intel edac_mce_amd snd_intel_dspcfg kvm_amd snd_hda_codec amdgpu nls_iso8859_1 ccp snd_hda_core snd_hwdep amd_iommu_v2 kvm snd_pcm gpu_sched crct10dif_pclmul crc32_pclmul snd_seq_midi snd_seq_midi_event ghash_clmulni_intel ttm snd_rawmidi aesni_intel snd_seq binfmt_misc crypto_simd cryptd glue_helper drm_kms_helper snd_seq_device snd_timer drm snd k10temp fb_sys_fops syscopyarea sysfillrect sysimgblt snd_rn_pci_acp3x mcb_pci(-) snd_pci_acp3x soundcore altera_cvp fpga_mgr mcb spi_nor mtd 8250_dw mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 mmc_block nvme ahci i2c_piix4 libahci i2c_amd_mp2_pci igb nvme_core i2c_algo_bit dca video sdhci_acpi sdhci [last unloaded: 8250_men_mcb]
[  286.691752] CPU: 0 PID: 1719 Comm: modprobe Not tainted 5.4.702+ #11
[  286.691753] Hardware name: MEN F027/n/a, BIOS 1.03 04/20/2021
[  286.691756] RIP: 0010:ida_free+0xe0/0x140
[  286.691759] Code: a8 31 f6 e8 12 f7 00 00 eb 4b 4c 0f a3 28 72 21 48 8b 7d a8 4c 89 f6 e8 8e ad 02 00 89 de 48 c7 c7 e8 02 83 b5 e8 b0 7a 5d ff <0f> 0b e9 67 ff ff ff 4c 0f b3 28 48 8d 7d a8 31 f6 e8 da e0 00 00
[  286.691761] RSP: 0018:ffff9a56c38f7bd8 EFLAGS: 00010282
[  286.691763] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000006
[  286.691764] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d881fa1c8c0
[  286.691765] RBP: ffff9a56c38f7c30 R08: 0000000000000487 R09: 0000000000000004
[  286.691766] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001
[  286.691767] R13: 0000000000000001 R14: 0000000000000202 R15: 0000000000000001
[  286.691769] FS:  00007fb78e303540(0000) GS:ffff8d881fa00000(0000) knlGS:0000000000000000
[  286.691770] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  286.691771] CR2: 00007ffe92b2ce98 CR3: 000000079fd9c000 CR4: 00000000003406f0
[  286.691772] Call Trace:
[  286.691781]  mcb_free_bus+0x2b/0x40 [mcb]
[  286.691785]  device_release+0x2c/0x80
[  286.691787]  kobject_put+0xb9/0x1d0
[  286.691790]  put_device+0x13/0x20

At first time, we thought that could be a memory leak or a double-free bug, however, after a deeper analysis, we found that the issue as being caused by a bad management of the mcb_bus component.

The mcb_bus component has the responsibility of grouping all the devices (IP Cores) of an FPGA (so we have 2 mcb_bus actually, one for the G215 and other for the F215). It adds all devices within a list, associating them with the same bus_type (mcb_bus_type). However, the mcb_bus itself was being added to the same list so when the module iterates in the list during the add_devices operation, it call to the function __mcb_bus_add_devices() for all devices, including for mcb_bus

However, this issue only happens when the identifiers generated with ida_alloc( ) starts in 0. In such case, the identifier was changed from 0 to 1 at a certain place, having 2 mcb_bus with the same Id (1). If it starts in 1, it does not happen.

Check the following structs:

struct mcb_bus {
    struct device dev;
    struct device *carrier;
    int bus_nr;
    u8 revision;
    char model;
    u8 minor;
    char name[CHAMELEON_FILENAME_LEN + 1];
    int (*get_irq)(struct mcb_device *dev);
};

struct mcb_device {
    struct device dev;
    struct mcb_bus *bus;
    bool is_added;
    struct mcb_driver *driver;
    u16 id;
    int inst;
    int group;
    int var;
    int bar;
    int rev;
    struct resource irq;
    struct resource mem;
    struct device *dma_dev;
};

As you can see, the mcb_bus::bus_nr and mcb_device::is_added are placed at the same offset in the structure.

Now, check the following function. Such is function is called for all devices in the list (including the mcb_bus).

static int __mcb_bus_add_devices(struct device *dev, void *data)
{
    struct mcb_device *mdev = to_mcb_device(dev);
    int retval;

    if (mdev->is_added)
        return 0;

    retval = device_attach(dev);
    if (retval < 0)
        dev_err(dev, "Error adding device (%d)\n", retval);

    mdev->is_added = true;

    return 0;
}

So, when the function is called for mcb_bus (that is wrong) it transforms the mcb_bus to mcb_device, so if bus_nr is 0, the is_added is 0 as well, so it does not meet the if clause. At the end, it sets is_added to 1. For this reason, if bus_nr starts in 0, the function changes the value to 1.

For fixing it, we just configured the mcb_bus to not be part of the list (removing the part when the mcb_bus is set with mcb_bus_type).

issue_244_patches.tar.gz

dpfeuffer commented 1 year ago

Great work and well explained! 👍

I downloaded the patch and went quickly through it. I'm thinking about this change:

    else if (strncmp(mdev->bus->name, "G215", 4) == 0)
        clkval = 1843200;
+   else if (strncmp(mdev->bus->name, "215", 3) == 0)
+       clkval = 1843200;

It is not wrong, but the first else matches G215 only, the second one matches G215 an F215. So you could remove the first else (or change the second one to match F215 only "F215", 4).

mad-jrodriguez commented 1 year ago

Hi Dieter,

Thank you so much.

The issue I faced here is that the name for the F205 is only 205, not F205. An approach could be using 205 for matching both boards, however, I should remove the first character in G205.

I'm going to check other functions such as strstr( ).

If this function covers all possible cases, I will replace it.

Thank you Dieter for you suggestions. :)

mad-jrodriguez commented 1 year ago
static u32 men_lookup_uartclk(struct mcb_device *mdev)
{
    /* use default value if board is not available below */
    u32 clkval = 1041666;

    dev_info(&mdev->dev, "%s on board %s\n",
        dev_name(&mdev->dev),
        mdev->bus->name);
    if  (strncmp(mdev->bus->name, "F075", 4) == 0)
        clkval = 1041666;
    else if (strncmp(mdev->bus->name, "F216", 4) == 0)
        clkval = 1843200;
    else if (strncmp(mdev->bus->name, "F210", 4) == 0)
        clkval = 115200;
    else if (strstr(mdev->bus->name, "215"))
        clkval = 1843200;
    else
        dev_info(&mdev->dev,
             "board not detected, using default uartclk\n");

    clkval = clkval  << 4;

    return clkval;
}

I have checked that the function strstr works. The function search for a substring within a string. Returning a pointer to the first character of the first appearance of the substring, or NULL if the substring is not found.

I have moved the check to the end to avoid possible issues if other boards names could contain the substring (as the bus->name could be longer than only the board name).

So first it finds exact board name match, then find if the substring is contained within the board name.

The function works fine for G215, F215 and 215 board names.

According with the data dumped with fpga_load, the boards names are:

G215: G215-00IC001 F215: 215-00IC001A

mad-jrodriguez commented 1 year ago

Hi all,

Here, the final patches with all fixes, including the fix for reading from global register in IP Cores Z025 and Z057.

Besides, an ERRATA file is included within the compressed file.

issue_224.tar.gz

duagon-rvw commented 1 year ago

Looks fine! Thanks :)

mad-jrodriguez commented 1 year ago

Closing the issue as it is fixed.