aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
25 stars 30 forks source link

Arista 7170-64c Setup #78

Closed jimthedj65 closed 1 year ago

jimthedj65 commented 1 year ago

Hi All,

When I run the setup for the 7170-64c platform, we have the following errors.

Platform Info Platform: x86_64-arista_7170_64c HwSKU: Arista-7170-64C ASIC: barefoot ASIC Count: 1 Serial Number: SGD19451927 Model Number: DCS-7170-64C Hardware Revision: 02.00

ASY: ASY0279504A0 HwApi: 02.00 HwRev: 11.01 KVN: 111 MAC: fc:bd:67:62:bc:d0 MfgTime: 20191204003043 PCA: PCA0100004A0 SID: Alhambra SKU: DCS-7170-64C SerialNumber: SGD19451927

show platform barefoot profile Error response from daemon: Container 235f5ca1f92e31e92513f8930de7e9d163a3e602ee2acb7c4c9ab2438764ccc4 is not running Current profile: default

show platform pcieinfo ==============================Display PCIe Device=============================== bus:dev.fn 00:1f.6 - dev_id=0x8c24, Intel Corporation 8 Series Chipset Family Thermal Management Controller 8086:8c24 bus:dev.fn 00:1c.0 - dev_id=0x8c10, Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 8086:8c10 bus:dev.fn 06:00.0 - dev_id=0x0001, Arista Networks, Inc. (Device) [3475:0001] bus:dev.fn 00:1c.4 - dev_id=0x8c18, Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #5 8086:8c18 bus:dev.fn 07:00.0 - dev_id=0x0010, (Vendor) (Device) [1d1c:0010] bus:dev.fn ff:0b.3 - dev_id=0x0001, Arista Networks, Inc. (Device) [3475:0001]


ERROR: writeConfig path=/sys/devices/pci0000:ff/0000:ff:0b.3 data={'new_object': 'smbus_master 0x8000 0 4\nsmbus_master 0x8080 1 4\nsmbus_master 0x8100 2 4\nsmbus_master 0x8180 3 4\nsmbus_master 0x8200 4 4'} error=Device or resource busy
ERROR: something happened while trying to detect the psu: [Errno 5] Input/output error
ERROR: something happened while trying to detect the psu: [Errno 5] Input/output error
ERROR: something happened while trying to detect the psu: [Errno 5] Input/output error
ERROR: PSU 1 unknown, discovery failed
ERROR: something happened while trying to detect the psu: [Errno 5] Input/output error
ERROR: something happened while trying to detect the psu: [Errno 5] Input/output error
ERROR: something happened while trying to detect the psu: [Errno 5] Input/output error
ERROR: PSU 2 unknown, discovery failed
INFO: Ucd90160(addr=82-004e) version: SFT005530103 UCD90160 2.3.4.0010 160329
INFO: Ucd90160(addr=82-004e) time: 2023-02-07 23:45:09.013000
INFO: Ucd90120A(addr=91-004e) version: SFT006590221 UCD90120A 2.3.4.0011 160720
INFO: Ucd90120A(addr=91-004e) time: 2023-02-07 23:45:09.029000```
Staphylo commented 1 year ago

This is an odd error message to see. It's a message that usually happens when trying to initialize the drivers twice in a row without deinitializing them in between. However during normal operations this is not supposed to be happening. Did you try to see if a cold reboot solves the issue? I currently don't have enough logs to properly understand what might be causing this. I would need the full syslog since the boot and the associated arista.log.

jimthedj65 commented 1 year ago

Hi Staphylo,

The cold boot helped and the pcieinfo shows properly, what's the best way to reinitialise the arista setup to see if that still is an issue.

From what I can see, the Tofino/Barefoot platform stopped updating about 10 months ago and I am thinking the latest master may be an issue. should I cycle back to 202205 or 202111?

The drivers may have got initialised twice by mistake, that's a good call.

I can send you the logs they are very large over 130MB, let me know where to send them.

jimthedj65 commented 1 year ago

Further to add when I do show ip int the first time it shows all the devices, but second time it only shows the management interface?

Staphylo commented 1 year ago

@jimthedj65 to reinitialize the platform drivers you'll need to run systemctl restart platform-arista.target. However be aware that this command might reset your dataplane. There is some manual steps that could prevent this, but it is subject to change so I prefer if folks do not have expectations there.

Depending on your needs I would advocate on using release branches as they tend to be more stable than master. 202205 and 202211 would be good choices. I would probably recommend looking at the branch that receives the most attention based on commit frequency and use that one.

What you're describing is usually a symptom of swss/syncd crash. It's going to initialize, create the host interfaces for the ASIC, do some stuff, crash, remove the host interfaces. You should check for worrying messages tied to syncd in your syslog (segfault, ERR messages, ...).

jimthedj65 commented 1 year ago

Thanks, that's helpful. dmesg on aboot doesn't throw any errors, so it must be a software issue, I will load 202205 and see where we go my feeling is that barefoot stopped 10 months ago and it may well be 202111 that is stable.

when installing new images I see this below, can that be ignored or is there a setting somewhere to allow TPM boot from flash?

Error: TPM_BAD_PARAMETER TPM: failed to deassert physical presence Error: TPM_BAD_PARAMETER TPM: failed to lock physical presence command unzip: short read

Staphylo commented 1 year ago

Aboot has not much to do with SONiC, it's just a combo BIOS/bootloader.

Trying a stable branch is a good option. FYI, our internal testing is however doing well on master from a few weeks ago.

On this platform the TPM messages are harmless, you can safely ignore them. The TPM mostly matters for secureboot which this product does not support.

jimthedj65 commented 1 year ago

Thanks, Staphylo that's great; once I get these running I can start my encryption acceleration dev. Thanks for all your help.

jimthedj65 commented 1 year ago

as a sanity checker, I am installing in aboot doing the following Aboot:~# cd /mnt/flash Aboot:~# cp -a ./* /mnt/usb1/ Aboot:~# cp /mnt/usb1/sonic-aboot-barefoot.swi .

Aboot:~# boot /mnt/flash/sonic-aboot-barefoot.swi

Am I missing any steps? It should be this straight forward and creates the boot-config for the image once it loads; right?

Staphylo commented 1 year ago

Your solution works but you have to leave /mnt/flash before running boot. Otherwise the boot process will not be able to umount /mnt/flash and exit early.

The way I usually do it when I have to is

cd /mnt/flash
wget http://some.url/to/sonic-aboot-barefoot.swi
echo SWI=flash:sonic-aboot-barefoot.swi > boot-config
sync
reboot

The reason I prefer reboot over just boot is that it doesn't work on secureboot platforms so I prefer only having one workflow.

jimthedj65 commented 1 year ago

Thanks I will document your process it seems more robust. This is why my earlier post. and first attempt was not saving the config, so it all makes sense now, I was using boot only.

Staphylo commented 1 year ago

Just a quick note, if you have a dhcp server you can run udhcpc to get a management IP address in Aboot. Otherwise you'll need to configure your network statically via the console.

jimthedj65 commented 1 year ago

oh that's really helpful and then I could just scp it over right? I was about to come over and ask if that was possible.

jimthedj65 commented 1 year ago

I appear to get the interfaces below and tried udhcpc -q -i ma1 with no joy from my dnsmasq server, any settings I need to set in dnsmasq?


1: lo: <LOOPBACK> mtu 16436 qdisc noop state DOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ma2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether 0e:c9:53:be:84:d9 brd ff:ff:ff:ff:ff:ff
3: ma1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether fc:bd:67:62:bc:d0 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::febd:67ff:fe62:bcd0/64 scope link
       valid_lft forever preferred_lft forever```
jimthedj65 commented 1 year ago

My switch seems to corrupt the image size when inserted into the usb slot. I have tried umounting and mounting. It constanlty shows up as HPFS/NTFS does it have to be formatted with that type, I have been formatting with fat?

Staphylo commented 1 year ago

Yes you should be able to ssh/scp as a client from Aboot. There's no server however. I've only ever run udhcpc without further parameters. I'm not sure if there's any specific setup required, I would have expected things to just work with basic dhcp options (ip, mask, gw, nameservers) If you have networking devices in between your management port and your server you'll need to setup your acls/dhcp_relay properly

jimthedj65 commented 1 year ago

ok let me try the relay option

Staphylo commented 1 year ago

I believe vfat or ext2-3 should be fine. ext4 might be a bit problematic if your mkfs.ext4 uses recent options that might be unsupported by Aboot. But technically SONiC uses the flash as ext4 so that should be fine.

jimthedj65 commented 1 year ago

Hi Staphylo, I configured KexAlgorithms diffie-hellman-group1-sha1 for my ssh server side and added an ip with ifconig on ma1 to the switch and boom I was able to scp the file over and it loaded 202205 great no more USBs lol

thanks for your help.

jimthedj65 commented 1 year ago

admin@sonic:~$ show platform pcieinfo ==============================Display PCIe Device=============================== bus:dev.fn 00:1f.6 - dev_id=0x8c24, Intel Corporation 8 Series Chipset Family Thermal Management Controller 8086:8c24 bus:dev.fn 00:1c.0 - dev_id=0x8c10, Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 8086:8c10 bus:dev.fn 06:00.0 - dev_id=0x0001, Arista Networks, Inc. (Device) [3475:0001] bus:dev.fn 00:1c.4 - dev_id=0x8c18, Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #5 8086:8c18 bus:dev.fn 07:00.0 - dev_id=0x0010, (Vendor) (Device) [1d1c:0010] bus:dev.fn ff:0b.3 - dev_id=0x0001, Arista Networks, Inc. (Device) [3475:0001]

jimthedj65 commented 1 year ago

success a working switch

jimthedj65 commented 1 year ago

loaded 202205 ran sudo config hostname myhostname and then ran sudo config interface ip add eth0 192.168.1.2/24 192.168.1.254 than ran the sudo aristasetup which ran fine, complained about my psu not being available and then ran sudo config save and rebooted

After reboot I ran sudo show int status

  Interface    Lanes    Speed    MTU    FEC    Alias    Vlan    Oper    Admin    Type    Asym PFC
-----------  -------  -------  -----  -----  -------  ------  ------  -------  ------  ----------

now my interfaces don't show and the error on pcieinfo is back, looks like I need to load back 202111?

show platform pcieinfo
Traceback (most recent call last):
  File "/usr/local/bin/pcieutil", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1134, in invoke
    Command.invoke(self, ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/pcieutil/main.py", line 78, in cli
    load_platform_pcieutil()
  File "/usr/local/lib/python3.9/dist-packages/pcieutil/main.py", line 52, in load_platform_pcieutil
    platform_path, _ = device_info.get_paths_to_platform_and_hwsku_dirs()
  File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 266, in get_paths_to_platform_and_hwsku_dirs
    hwsku_path = os.path.join(platform_path, hwsku)
  File "/usr/lib/python3.9/posixpath.py", line 90, in join
    genericpath._check_arg_types('join', a, *p)
  File "/usr/lib/python3.9/genericpath.py", line 152, in _check_arg_types
    raise TypeError(f'{funcname}() argument must be str, bytes, or '
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'
jimthedj65 commented 1 year ago

Before I ran Arista setup, the pcieinfo was fine. My config_db.json has changed after the setup , and all ports have been deleted.

jimthedj65 commented 1 year ago

ok this is down to me not understanding when to hard code the config and when to use the config cli command

Staphylo commented 1 year ago

Just to be sure, you should never use the arista binary under regular operations. This is mostly a tool for us to develop and debug during escalation. Our platform is iniatialized automatically at boot up via systemd service platform-arista* and all platform information reported via the sonic platform API.

The first time you boot SONiC it enters a particular mode based on /host/image-XXX/platform/firsttime This mode will generate a default configuration for your product and then delete that aforementioned file. However I don't believe SONiC saves the configuration so you need to run config save, so that's probably your issue. Also you can provide your own configuration the first time you boot by putting either minigraph.xml or config_db.json on your flash.

Regarding the eth0 management ip, I'm not sure if you can use this CLI to configure it. (Maybe you can) If you want to configure a static mgmt IP you need to edit MGMT_INTERFACE and MGMT_PORT table in config_db.json. You can also do this dynamically by using sonic-db-cli CONFIG_DB hset 'MGMT_INTERFACE|eth0|<ip>' gwaddr <gwip> forced_mgmt_routes@ "<subnet1>,<subnet2>,..."

jimthedj65 commented 1 year ago

Hi Staphylo, thanks for the updates I got all that done, and now I have two stable switches one on 202205 and one of a recent master. do you have any guidelines on breaking out a port on the Arista? I have two ports 100GBE to 2 x 50GBE

I tried sudo config interface breakout Ethernet8 '2x50G'

Do you want to Breakout the port, continue? [y/N]: y [ERROR] Breakout feature is not available without platform.json file Aborted!

I have confirmed that there is a platform.json file in my device directory.

thanks for all your help

jimthedj65 commented 1 year ago

Also for brevity I did exactly this and it worked, the cli sometimes stalls at times after I issued this command.

sonic-db-cli CONFIG_DB hset "PORT|Ethernet256" alias Ethernet65/1 fec none index 65 lanes 256 mtu 9100 pfc_asym off speed 10000 admin_status up

sonic-db-cli CONFIG_DB hset "PORT|Ethernet257" alias Ethernet66/1 fec none index 66 lanes 257 mtu 9100 pfc_asym off speed 10000 admin_status up

jimthedj65 commented 1 year ago

Regarding the eth0 management ip, I'm not sure if you can use this CLI to configure it. (Maybe you can) If you want to configure a static mgmt IP you need to edit MGMT_INTERFACE and MGMT_PORT table in config_db.json. You can also do this dynamically by using sonic-db-cli CONFIG_DB hset 'MGMT_INTERFACE|eth0|<ip>' gwaddr <gwip> forced_mgmt_routes@ "<subnet1>,<subnet2>,..."

I ran sudo config interface ip add eth0 192.168.1.2/24 192.168.1.254 and it allowed me local access via ssh

Staphylo commented 1 year ago

We did not implement DPB (Dynamic Port Breakout) for this product so you won't be able to use the helpers for that purpose. It's not in our plans to invest time doing this. It essentially requires 2 things:

config interface breakout will read platform.json to check the breakout parameter and will update the BREAKOUT_CFG table in CONFIG_DB. However on barefoot duts, you should be able to perform the breakouts you want manually by populating the correct information in the PORT table of CONFIG_DB and reloading your configuration.

To convert a 1x100G to 2x50G you need to split the lanes and keep the index value:

Ethernet0:
   lanes: 1,2,3,4
   index: 1
   alias: Ethernet1/1
   speed: 100000

would become

Ethernet0:
  lanes: 1,2
  index: 1
  alias: Ethernet1/1
  speed: 50000
Ethernet2:
  lanes: 3,4
  index: 1
  alias: Ethernet1/3
  speed: 50000

Glad to know that config interface ip add eth0 works, thanks for sharing.

jimthedj65 commented 1 year ago

Thanks for your help. That's great. I was working through this.


/usr/share/sonic/device/x86_64-arista_7170_64c

"Ethernet8": {
    "index": "1,1,1,1",
    "lanes": "8,9,10,11",
     "breakout_modes": {
         "1x100G[40G]": ["Eth1"],
         "2x50G": ["Eth1/1", "Eth1/2"],
         "4x25G[10G]": ["Eth1/1", "Eth1/2", "Eth1/3", "Eth1/4"],
         "2x25G(2)+1x50G(2)": ["Eth1/1", "Eth1/2", "Eth1/3"],
         "1x50G(2)+2x25G(2)": ["Eth1/1", "Eth1/2", "Eth1/3]"
     }
 }

 "Ethernet28": {
    "index": "1,1,1,1",
    "lanes": "28,29,30,31",
     "breakout_modes": {
         "1x100G[40G]": ["Eth1"],
         "2x50G": ["Eth1/1", "Eth1/2"],
         "4x25G[10G]": ["Eth1/1", "Eth1/2", "Eth1/3", "Eth1/4"],
         "2x25G(2)+1x50G(2)": ["Eth1/1", "Eth1/2", "Eth1/3"],
         "1x50G(2)+2x25G(2)": ["Eth1/1", "Eth1/2", "Eth1/3]"
     }
 }```
jimthedj65 commented 1 year ago

We did not implement DPB (Dynamic Port Breakout) for this product so you won't be able to use the helpers for that purpose. It's not in our plans to invest time doing this. It essentially requires 2 things:

  • platform.json should have the interfaces section filled in with the list of possible breakouts
  • The HwSku folder should have a hwsku.json configuration file with the default breakouts.

config interface breakout will read platform.json to check the breakout parameter and will update the BREAKOUT_CFG table in CONFIG_DB. However on barefoot duts, you should be able to perform the breakouts you want manually by populating the correct information in the PORT table of CONFIG_DB and reloading your configuration.

To convert a 1x100G to 2x50G you need to split the lanes and keep the index value:

Ethernet0:
   lanes: 1,2,3,4
   index: 1
   alias: Ethernet1/1
   speed: 100000

would become

Ethernet0:
  lanes: 1,2
  index: 1
  alias: Ethernet1/1
  speed: 50000
Ethernet2:
  lanes: 3,4
  index: 1
  alias: Ethernet1/3
  speed: 50000

Glad to know that config interface ip add eth0 works, thanks for sharing.

Thank you for all the responses; I had the 7170 looped to my layer 2 unifi switch, so it allowed me local access at least. I will post my port breakout updates and good to know barefoot seems simpler.

jimthedj65 commented 1 year ago

Hi Staphylos, I am getting a bit lost on the alias


        "Ethernet8": {
            "admin_status": "up",
            "alias": "Ethernet3/1",
            "index": "3",
            "lanes": "8,9",
            "mtu": "9100",
            "speed": "50000"
        },
        "Ethernet9":  {
            "admin_status": "up",
            "alias": "Ethernet3/2",
            "index": "3",
            "lanes": "10,11",
            "mtu": "9100",
            "speed": "50000",
        },```

Do I have that correct, or is it sequential after the lanes like below?

```},
        "Ethernet8": {
            "admin_status": "up",
            "alias": "Ethernet3/1",
            "index": "3",
            "lanes": "8,9",
            "mtu": "9100",
            "speed": "50000"
        },
        "Ethernet9":  {
            "admin_status": "up",
            "alias": "Ethernet3/10",
            "index": "3",
            "lanes": "10,11",
            "mtu": "9100",
            "speed": "50000",
        },```
Staphylo commented 1 year ago

So the alias column used to be the vendor specific interface name. I believe it was intended at a smoother transition from the proprietary vendor NOS to SONiC (in our case EOS) It's still mostly true however the host interface renaming work that is happening is using the alias column to help with the transition.

So you could pretty much do what you want in the alias column as long as it doesn't conflict with host interface. The EOS naming convention is:

In your example you want Ethernet3/1 and Ethernet3/2. And for the host interfaces you want to use Ethernet8 and Ethernet10 because if you want to now breakout in 4x25 it'll be a problem because Ethernet9 is already allocated.

jimthedj65 commented 1 year ago

Thanks for the sanity check, all makes sense now.

jimthedj65 commented 1 year ago

I updated the config_db.json to break out the ports with 50G x 2 on port 3 in the ports section of config_db.json with below

        "Ethernet4": {
            "admin_status": "up",
            "alias": "Ethernet2/1",
            "index": "2",
            "lanes": "4,5,6,7",
            "mtu": "9100",
            "speed": "100000"
        },
        "Ethernet8": {
            "admin_status": "up",
            "alias": "Ethernet3/1",
            "index": "3",
            "lanes": "8,9",
            "mtu": "9100",
            "speed": "50000"
        },
        "Ethernet10":  {
            "admin_status": "up",
            "alias": "Ethernet3/2",
            "index": "3",
            "lanes": "10,11",
            "mtu": "9100",
            "speed": "50000",
        },
        "Ethernet12": {
            "admin_status": "up",
            "alias": "Ethernet4/1",
            "index": "4",
            "lanes": "12,13,14,15",
            "mtu": "9100",
            "speed": "100000"

And added the necessary port commands for their breakout, and whilst the config loads, the interfaces don't come up.

sudo vi /etc/sonic/config_db.json sync sudo reboot

Am I missing a step or do you think the platform.json and HWSKU files need to be populated on the 7170?