Mellanox / docker-sriov-plugin

Docker networking plugin for SRIOV and passthrough interfaces
Apache License 2.0
79 stars 17 forks source link

[question]: sriov CreateEndpoint failure #4

Open joaomsoares opened 6 years ago

joaomsoares commented 6 years ago

Hi, I have a Mellanox Innova IPsec card and I am trying to set up docker with SR-IOV. I managed to start a container via passthrough, however I get the following error when trying to boot one with SR-IOV:

"docker: Error response from daemon: failed to create endpoint kind_heisenberg on network mynet-sriov: NetworkDriver.CreateEndpoint: All devices in use [ f53229e321b1a7fdce364b6e8b7c749f34000b40075cd13839dc7d6eb98326ab ].."

Any help to overcome this would be appreciated. I tried to understand the problem and according to the code it seems to be related to the MAC Address assignment. Below the log of the plugin:

time="2018-03-08T16:57:16Z" level=debug msg="CreateNetwork IPv4Data len : [ 1 ]\n" time="2018-03-08T16:57:16Z" level=debug msg="parseNetworkGenericOptions map[mode:sriov netdevice:enp4s0]" max_vfs = 4 cur_vfs = 0 max_vfs = 4 time="2018-03-08T16:57:25Z" level=debug msg="DiscoverVF vfDev list length : [4]" time="2018-03-08T16:57:25Z" level=debug msg="SRIOV CreateNetwork : [f53229e321b1a7fdce364b6e8b7c749f34000b40075cd13839dc7d6eb98326ab] IPv4Data : [ &{AddressSpace:LocalDefault Pool:194.168.1.0/24 Gateway:194.168.1.1/24 AuxAddresses:map[]} ]\n" time="2018-03-08T16:57:38Z" level=debug msg="CreateEndpoint Called: [ &{NetworkID:f53229e321b1a7fdce364b6e8b7c749f34000b40075cd13839dc7d6eb98326ab EndpointID:ebb3c7d220ade467b8174e70ebe39232faecb98ce0bee7369e48851896173d5c Interface:0xc4201b20c0 Options:map[com.docker.network.endpoint.exposedports:[] com.docker.network.portmap:[]]} ]" time="2018-03-08T16:57:38Z" level=debug msg="r.Interface: [ &{Address:194.168.1.2/24 AddressIPv6: MacAddress:} ]"

As well as the output of the "ip link show"

6: enp4s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 24:8a:07:ad:54:f2 brd ff:ff:ff:ff:ff:ff vf 0 MAC 00:22:33:44:55:66, spoof checking off, link-state auto vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

paravmellanox commented 6 years ago

Hi,

I need some more information. Did you assign the mac address using #ip link set vf command before starting the container? 00:22:33:44:55:66 seems human assigned mac address.

Did you start the container with --mac-address= option? Currently plugin inspects the mac-address of the netdevice of the VF is considered and not the the assigned using ip link set command.

Something looks wrong with rest of the VF mac addresses being zero.

Some notes: When using the plugin, user should not modify the mac addresses of the VFs anytime. Plugin does the assignment of mac addresses. If you wish to pick a specific VF by mac address than, you should do #ip link show. This will give you all the list of netdevs for the VFs and pick one of the netdev's mac address.

To avoid such hazzle, you can use this support script, https://github.com/Mellanox/container_scripts/blob/master/docker_sriov_roce_mgmt

Such as below, docker network create -d passthrough --subnet=194.168.1.0/24 -o netdevice=enp4s0 -o mode=sriov nw1 (you already successfully did this) Now, ./docker_sriov_roce_mgmt list_netdevs enp4s0

Now that you know the interested netdev to use, ./docker_sriov_roce_mgmt netdev2mac This give you the mac address of the netdev you want to use. docker run --mac-address= --net=nw1

Or you can avoid above steps, and use this wrapper,

Now you can do either, ./docker_sriov_roce_mgmt run --netdev= --net=nw1

If you are not choosy about which VF to use than you can completely depend on plugin to find free VF for you. In simpler configurations, you can just do docker run --net=nw1

joaomsoares commented 6 years ago

Hi, Thanks for the super quick answer. But I still can't make it work. You were right, the first address had been assigned manually via the #ip link set, but this was a mistake of mine. I tried to set it back to 0 (#ip link set ... mac 0), and do #docker run, but the error is still there. In fact, I even removed the VFs, and brought them back up, but still no luck.

docker run --net=mynet-sriov -it a1a3b055c1f9 bin/bash docker: Error response from daemon: failed to create endpoint relaxed_almeida on network mynet-sriov: NetworkDriver.CreateEndpoint: All devices in use [ e4df4be7ea439460dc16a9952e4b3e508482abd49700603d5ae307da3b918769 ]..

I even tried to run the manual script, and also found another "error", which makes me thing there might be some other issue (?)

./docker_sriov_roce_mgmt list_netdevs enp4s0 list_netdevs enp4s0 ls: cannot access '/sys/class/net/enp4s0/device/virtfn0/net': No such file or directory ls: cannot access '/sys/class/net/enp4s0/device/virtfn1/net': No such file or directory ls: cannot access '/sys/class/net/enp4s0/device/virtfn2/net': No such file or directory ls: cannot access '/sys/class/net/enp4s0/device/virtfn3/net': No such file or directory

Any further hints to overcome this are most welcome!

paravmellanox commented 6 years ago

Seems like issue that is not related to this plugin. Can you please share the output of

  1. uname -a
  2. ls -l /sys/class/net/enp4s0/
  3. ls -l /sys/class/net/enp4s0/virtfn0/
  4. ip link show
joaomsoares commented 6 years ago

Here they are:

  1. uname -a

Linux ct-analytcis-2 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

  1. ls -l /sys/class/net/enp4s0/

total 0 -r--r--r-- 1 root root 4096 Mar 9 16:40 addr_assign_type -r--r--r-- 1 root root 4096 Mar 9 16:40 address -r--r--r-- 1 root root 4096 Mar 9 16:40 addr_len -r--r--r-- 1 root root 4096 Mar 9 16:40 broadcast -rw-r--r-- 1 root root 4096 Mar 9 16:40 carrier -r--r--r-- 1 root root 4096 Mar 9 16:40 carrier_changes drwxr-xr-x 2 root root 0 Mar 9 16:40 debug lrwxrwxrwx 1 root root 0 Mar 9 16:40 device -> ../../../0000:04:00.0 -r--r--r-- 1 root root 4096 Mar 9 16:40 dev_id -r--r--r-- 1 root root 4096 Mar 9 16:40 dev_port -r--r--r-- 1 root root 4096 Mar 9 16:40 dormant -r--r--r-- 1 root root 4096 Mar 9 16:40 duplex drwxr-xr-x 4 root root 0 Mar 9 16:40 ecn -rw-r--r-- 1 root root 4096 Mar 9 16:40 flags -rw-r--r-- 1 root root 4096 Mar 9 16:40 gro_flush_timeout -rw-r--r-- 1 root root 4096 Mar 9 16:40 ifalias -r--r--r-- 1 root root 4096 Mar 9 16:40 ifindex -r--r--r-- 1 root root 4096 Mar 9 16:40 iflink -r--r--r-- 1 root root 4096 Mar 9 16:40 link_mode -rw-r--r-- 1 root root 4096 Mar 9 16:40 mtu -r--r--r-- 1 root root 4096 Mar 9 16:40 name_assign_type -rw-r--r-- 1 root root 4096 Mar 9 16:40 netdev_group -r--r--r-- 1 root root 4096 Mar 9 16:40 operstate -r--r--r-- 1 root root 4096 Mar 9 16:40 phys_port_id -r--r--r-- 1 root root 4096 Mar 9 16:40 phys_port_name -r--r--r-- 1 root root 4096 Mar 9 16:40 phys_switch_id drwxr-xr-x 2 root root 0 Mar 9 16:40 power -rw-r--r-- 1 root root 4096 Mar 9 16:40 proto_down drwxr-xr-x 2 root root 0 Mar 9 16:40 qos drwxr-xr-x 66 root root 0 Mar 9 16:40 queues drwxr-xr-x 2 root root 0 Mar 9 16:40 settings -r--r--r-- 1 root root 4096 Mar 9 16:40 speed drwxr-xr-x 2 root root 0 Mar 9 16:40 statistics lrwxrwxrwx 1 root root 0 Mar 9 16:40 subsystem -> ../../../../../../class/net -rw-r--r-- 1 root root 4096 Mar 9 16:40 tx_queue_len -r--r--r-- 1 root root 4096 Mar 9 16:40 type -rw-r--r-- 1 root root 4096 Mar 9 16:40 uevent

  1. I assume you meant ls -l /sys/class/net/enp4s0/device/virtfn0/ total 0

-rw-r--r-- 1 root root 4096 Mar 9 16:42 broken_parity_status -r--r--r-- 1 root root 4096 Mar 9 16:42 class -rw-r--r-- 1 root root 4096 Mar 9 16:42 config -r--r--r-- 1 root root 4096 Mar 9 16:42 consistent_dma_mask_bits -rw-r--r-- 1 root root 4096 Mar 9 16:42 d3cold_allowed -r--r--r-- 1 root root 4096 Mar 9 16:42 device -r--r--r-- 1 root root 4096 Mar 9 16:42 dma_mask_bits -rw-r--r-- 1 root root 4096 Mar 9 16:42 driver_override -rw-r--r-- 1 root root 4096 Mar 9 16:42 enable -r--r--r-- 1 root root 4096 Mar 9 16:42 irq -r--r--r-- 1 root root 4096 Mar 9 16:42 local_cpulist -r--r--r-- 1 root root 4096 Mar 9 16:42 local_cpus -r--r--r-- 1 root root 4096 Mar 9 16:42 modalias -rw-r--r-- 1 root root 4096 Mar 9 16:42 msi_bus -rw-r--r-- 1 root root 4096 Mar 9 16:42 numa_node lrwxrwxrwx 1 root root 0 Mar 9 16:42 physfn -> ../0000:04:00.0 drwxr-xr-x 2 root root 0 Mar 9 16:42 power --w------- 1 root root 4096 Mar 9 16:42 reset -r--r--r-- 1 root root 4096 Mar 9 16:42 resource -rw------- 1 root root 2097152 Mar 9 16:42 resource0 -rw------- 1 root root 2097152 Mar 9 16:42 resource0_wc lrwxrwxrwx 1 root root 0 Mar 9 11:08 subsystem -> ../../../../bus/pci -r--r--r-- 1 root root 4096 Mar 9 16:42 subsystem_device -r--r--r-- 1 root root 4096 Mar 9 16:42 subsystem_vendor -rw-r--r-- 1 root root 4096 Mar 9 11:08 uevent -r--r--r-- 1 root root 4096 Mar 9 11:08 vendor -rw------- 1 root root 32768 Mar 9 16:42 vpd

  1. ip link show

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 54:9f:35:20:8f:f8 brd ff:ff:ff:ff:ff:ff 3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 54:9f:35:20:8f:f9 brd ff:ff:ff:ff:ff:ff 4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 54:9f:35:20:8f:fa brd ff:ff:ff:ff:ff:ff 5: eth3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 54:9f:35:20:8f:fb brd ff:ff:ff:ff:ff:ff 6: enp4s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 24:8a:07:ad:54:f2 brd ff:ff:ff:ff:ff:ff vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state auto vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto 7: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default link/ether 02:42:e4:81:98:33 brd ff:ff:ff:ff:ff:ff

paravmellanox commented 6 years ago

From output of command 4, it appears that netdevices for the VF are not created for some reason. I suggest you that you talk to Mellanox tech support first to see that these netdevices are seen. You should share /var/log/messages along with output of ls -l /sys/class/net/enp4s0/device/

joaomsoares commented 6 years ago

Thanks for the reply. You mean output of command 4 or command 3? What should be the expected outcome of the command? In the meantime I'll reach out to Mellanox tech support as well.

paravmellanox commented 6 years ago

4th command - ip link show This needs to show list of netdevices which belong to this VFs. Sometime ufio driver takes over the VFs if there is past KVM setup/configuration exist. In that case netdevices may not be created. So let us first that netdevices of the VFs are created. If you can share /var/log/messages, it will give some quick hint.

paravmellanox commented 6 years ago

if you share the output (pretty long) of lspci -vvv it will reflect which driver (mlx5_core) or vfio driver owns the VFs that might throw light on why netdevices are not created.

joaomsoares commented 6 years ago

trying to short the output (include the native card and one VF - seems mlx5_core owns both):

04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 66 Region 0: Memory at 33ffc000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [48] Vital Product Data Product Name: Innova IPsec 4 Lx EN Adapter, single-port QSFP, 10/40GbE, PCIe3.0 x8, HHHL, tall bracket, ROHS R6 Read-only fields: [PN] Part number: MNV101511A-BCIT [EC] Engineering changes: A6 [V2] Vendor specific: MNV101511A-BCIT [SN] Serial number: MT1712X01617
[V3] Vendor specific: bef70ecc3f0fe7118000248a07ad54f2 [VA] Vendor specific: MLX:MODL=CX4732A:MN=MLNX:CSKU=V2:UUID=V3:PCI=0 [V0] Vendor specific: PCIeGen3 x8 [RV] Reserved: checksum good, 0 byte(s) reserved End Capabilities: [9c] MSI-X: Enable+ Count=64 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Capabilities: [c0] Vendor Specific Information: Len=18 <?> Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ AERCap: First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+ Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy+ IOVSta: Migration- Initial VFs: 4, Total VFs: 4, Number of VFs: 4, Function Dependency Link: 00 VF offset: 1, stride: 1, Device ID: 1016 Supported Page Size: 000007ff, System Page Size: 00000001 Region 0: Memory at 0000033ffe000000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Capabilities: [1c0 v1] #19 Capabilities: [230 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Kernel driver in use: mlx5_core Kernel modules: mlx5_core

04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx- Region 0: [virtual] Memory at 33ffe000000 (64-bit, prefetchable) [size=2M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [48] Vital Product Data Product Name: Innova IPsec 4 Lx EN Adapter, single-port QSFP, 10/40GbE, PCIe3.0 x8, HHHL, tall bracket, ROHS R6 Read-only fields: [PN] Part number: MNV101511A-BCIT [EC] Engineering changes: A6 [V2] Vendor specific: MNV101511A-BCIT [SN] Serial number: MT1712X01617
[V3] Vendor specific: bef70ecc3f0fe7118000248a07ad54f2 [VA] Vendor specific: MLX:MODL=CX4732A:MN=MLNX:CSKU=V2:UUID=V3:PCI=0 [V0] Vendor specific: PCIeGen3 x8 [RV] Reserved: checksum good, 0 byte(s) reserved End Capabilities: [9c] MSI-X: Enable- Count=12 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Kernel modules: mlx5_core

joaomsoares commented 6 years ago

as to the /var/log/messages ... which one are we talking about?

ls -l /var/log/ total 42472 -rw-r--r-- 1 root root 0 Feb 1 07:35 alternatives.log -rw-r--r-- 1 root root 49904 Jan 31 15:22 alternatives.log.1 -rw-r--r-- 1 root root 1731 Jun 1 2015 alternatives.log.2.gz -rw-r--r-- 1 root root 3401 May 27 2015 alternatives.log.3.gz -rw-r----- 1 root adm 0 Mar 10 07:35 apport.log -rw-r----- 1 root adm 113 Mar 9 11:05 apport.log.1 -rw-r----- 1 root adm 354 Jul 2 2015 apport.log.2.gz -rw-r----- 1 root adm 341 Jun 23 2015 apport.log.3.gz -rw-r----- 1 root adm 305 Jun 17 2015 apport.log.4.gz -rw-r----- 1 root adm 339 Jun 16 2015 apport.log.5.gz -rw-r----- 1 root adm 270 Jun 15 2015 apport.log.6.gz -rw-r----- 1 root adm 430 Jun 3 2015 apport.log.7.gz drwxr-xr-x 2 root root 4096 Mar 1 07:35 apt -rw-r----- 1 syslog adm 91118 Mar 14 18:17 auth.log -rw-r----- 1 syslog adm 97196 Mar 11 07:30 auth.log.1 -rw-r----- 1 syslog adm 8516 Mar 5 07:30 auth.log.2.gz -rw-r----- 1 syslog adm 1662 Feb 25 07:30 auth.log.3.gz -rw-r----- 1 syslog adm 2167 Feb 19 07:30 auth.log.4.gz -rw-r--r-- 1 root root 141 Mar 14 18:03 boot.log -rw-r--r-- 1 root root 61499 Feb 18 2015 bootstrap.log -rw------- 1 root utmp 4992 Mar 14 17:49 btmp -rw-rw---- 1 root utmp 768 Feb 27 12:29 btmp.1 drwxr-xr-x 2 root root 4096 Mar 14 18:35 containers drwxr-xr-x 2 root root 4096 Mar 14 07:35 cups drwxr-xr-x 3 root root 4096 Jan 31 11:37 dist-upgrade -rw-r----- 1 root adm 107486 Jan 31 10:52 dmesg -rw-r----- 1 root adm 109481 Jan 31 09:37 dmesg.0 -rw-r----- 1 root adm 20574 Dec 16 13:52 dmesg.1.gz -rw-r----- 1 root adm 20393 Sep 21 09:13 dmesg.2.gz -rw-r----- 1 root adm 20927 Sep 4 2017 dmesg.3.gz -rw-r----- 1 root adm 20854 Mar 2 2017 dmesg.4.gz -rw-r--r-- 1 root root 507015 Mar 13 11:47 dpkg.log -rw-r--r-- 1 root root 12259 Feb 27 14:46 dpkg.log.1 -rw-r--r-- 1 root root 216904 Jan 31 15:27 dpkg.log.2.gz -rw-r--r-- 1 root root 431 Sep 17 2015 dpkg.log.3.gz -rw-r--r-- 1 root root 17023 Jun 1 2015 dpkg.log.4.gz -rw-r--r-- 1 root root 117158 May 27 2015 dpkg.log.5.gz -rw-r--r-- 1 root root 32288 Jan 31 11:28 faillog -rw-r--r-- 1 root root 4303 Jan 31 11:37 fontconfig.log drwxr-xr-x 2 root root 4096 Feb 18 2015 fsck -rw-r--r-- 1 root root 1860 Mar 14 18:03 gpu-manager.log drwxr-xr-x 3 root root 4096 Feb 18 2015 hp drwxrwxr-x 2 root root 4096 May 26 2015 installer -rw-r----- 1 syslog adm 2518583 Mar 14 18:35 kern.log -rw-r----- 1 syslog adm 2399042 Mar 11 07:33 kern.log.1 -rw-r----- 1 syslog adm 101856 Mar 4 07:29 kern.log.2.gz -rw-r----- 1 syslog adm 2766 Feb 27 15:00 kern.log.3.gz -rw-r----- 1 syslog adm 900 Feb 22 15:39 kern.log.4.gz -rw-rw-r-- 1 root utmp 294628 Mar 14 18:04 lastlog drwxr-xr-x 2 root root 4096 Mar 14 07:35 lightdm -rw-r--r-- 1 root root 0 Feb 1 07:35 pm-powersave.log -rw-r--r-- 1 root root 16078 Jan 31 10:52 pm-powersave.log.1 -rw-r--r-- 1 root root 870 Dec 16 13:52 pm-powersave.log.2.gz -rw-r--r-- 1 root root 870 Sep 21 09:13 pm-powersave.log.3.gz -rw-r--r-- 1 root root 841 Sep 4 2017 pm-powersave.log.4.gz drwxr-xr-x 9 root root 4096 Mar 14 16:43 pods drwxr-xr-x 2 root root 4096 Jun 1 2015 rstudio-server drwxr-x--- 2 root adm 4096 Jan 29 2015 samba drwx------ 2 speech-dispatcher root 4096 Feb 19 2014 speech-dispatcher -rw-r----- 1 syslog adm 10109449 Mar 14 18:36 syslog -rw-r----- 1 syslog adm 20227869 Mar 14 07:35 syslog.1 -rw-r----- 1 syslog adm 1032663 Mar 13 07:35 syslog.2.gz -rw-r----- 1 syslog adm 868848 Mar 12 07:35 syslog.3.gz -rw-r----- 1 syslog adm 872004 Mar 11 07:35 syslog.4.gz -rw-r----- 1 syslog adm 1055977 Mar 10 07:35 syslog.5.gz -rw-r----- 1 syslog adm 947589 Mar 9 07:35 syslog.6.gz -rw-r----- 1 syslog adm 765619 Mar 8 07:35 syslog.7.gz -rw-r--r-- 1 root root 631464 Jan 31 10:51 udev drwxr-x--- 2 root adm 4096 May 26 2015 unattended-upgrades drwxr-xr-x 2 root root 12288 Feb 2 07:35 upstart -rw-rw-r-- 1 root utmp 95616 Mar 14 18:04 wtmp -rw-rw-r-- 1 root utmp 5376 Feb 27 18:30 wtmp.1 -rw-r--r-- 1 root root 24489 Mar 14 18:03 Xorg.0.log -rw-r--r-- 1 root root 24721 Mar 14 17:53 Xorg.0.log.old

paravmellanox commented 6 years ago

/var/log/syslog and /var/log/dmesg should have some driver failure logs for the VFs.

joaomsoares commented 6 years ago

Right...dmesg shows some errors:

Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.310139] (0000:04:00.0): E-Switch: E-Switch enable SRIOV: nvfs(4) mode (1) Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.458403] (0000:04:00.0): E-Switch: SRIOV enabled: active vports(5) Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.562856] pci 0000:04:00.1: [15b3:1016] type 00 class 0x020000 Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.563383] pci 0000:04:00.1: Max Payload Size set to 256 (was 128, max 512) Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.563963] iommu: Adding device 0000:04:00.1 to group 48 Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564165] mlx5_core 0000:04:00.1: enabling device (0000 -> 0002) Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564634] mlx5_core 0000:04:00.1: firmware version: 14.98.3410 Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564674] mlx5_core 0000:04:00.1: mlx5_pcie_print_link_status:411:(pid 143482): PCIe width is lower than device's capability Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564678] mlx5_core 0000:04:00.1: PCIe link speed is 8.0GT/s, device supports 8.0GT/s Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564681] mlx5_core 0000:04:00.1: PCIe link width is x0, device supports x8 Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564751] DMAR: 64bit 0000:04:00.1 uses identity mapping Mar 15 14:21:58 ct-analytcis-2 kernel: [73145.237041] mlx5_core 0000:04:00.1: mlx5_cmd_check:731:(pid 143482): ACCESS_REG(0x805) op_mod(0x1) failed, status bad parameter(0x3), syndrome (0x5a98c0) Mar 15 14:21:58 ct-analytcis-2 kernel: [73145.237048] mlx5_core 0000:04:00.1: FPGA: mlx5_fpga_device_load_check:152:(pid 143482): Failed to query status: -22 Mar 15 14:21:58 ct-analytcis-2 kernel: [73145.237051] mlx5_core 0000:04:00.1: fpga device start failed -22 Mar 15 14:21:58 ct-analytcis-2 kernel: [73145.259140] mlx5_core 0000:04:00.1: tools char device 243:2 destroyed Mar 15 14:21:59 ct-analytcis-2 kernel: [73145.637372] mlx5_core 0000:04:00.1: mlx5_load_one failed with error code -22 Mar 15 14:21:59 ct-analytcis-2 kernel: [73145.637538] mlx5_core: probe of 0000:04:00.1 failed with error -22

paravmellanox commented 6 years ago

Now it make sense. It seems like driver fail to load on VF with given error. This is helpful. I suggest you please contact the tech support to get this error resolved without bringing any plugin/container things in picture to get faster results. Once that is done, it is likely that plugin will work. I do not have access to Innova cards; This piece of software error is not in my domain.

I will add more check at plugin level to make sure that network creation fails if it encounters this kind of unexpected error (instead of failing at container creation time). Thanks for the logs.