CumulusNetworks / ptm

Prescriptive Topology Daemon
Eclipse Public License 1.0
82 stars 12 forks source link

ptmd stops with certain lldp output #4

Closed cpalmer9 closed 9 years ago

cpalmer9 commented 9 years ago

Hi there, I am finding that my ptmd service is dying whenever we have a lldpd client output that includes something like this:

Interface: fpti1_0_3, via: LLDP, RID: 14, Time: 0 day, 02:23:49 Chassis:
ChassisID: mac 00:1e:67:c5:66:2e Port:
PortID: mac 00:1e:67:c5:66:2e

(No other fields are seen) My current workaround is to ignore the affected interfaces using LLDPD, however I'm more interested in seeing if PTM can withstand this. Thanks

kanrajag commented 9 years ago

Hi Chris Can you show me how your topology file looks like? a snapshot of the gdb core stack trace and ptmd.log would be great.

cpalmer9 commented 9 years ago

Hi, can you instruct me on how to generate the gdb core stack trace?

kanrajag commented 9 years ago

Chris I assumed when you said the process is dying that a core dump is generated? if yes - you can open the core file in gdb and run "bt" - that would tell me which part of the code is crashing the process.

If there is no core file, then just give me the ptmd.log and i will try to figure it out

please send me the topology file and ptmd.log to begin with.

thanks

cpalmer9 commented 9 years ago

kanrajag, Can I email you these files (and I need your address)?

kanrajag commented 9 years ago

chris, please just attach them to this thread. you should be seeing the "selecting them" link in the bottom of the edit box. thanks

cpalmer9 commented 9 years ago

I just get: Unfortunately, we don't support that file type. Try again with a PNG, GIF, or JPG. Can I email them? I can provide a core file.

kanrajag commented 9 years ago

OK. I might not be able to receive them via email if the file attachments are too big. please use our ftp site to upload the tarball (logs + core) ftp.cumulusnetworks.com, authenticating with user "anonymous" and providing your email address as the password.

thanks

cpalmer9 commented 9 years ago

Thanks - uploaded MD5 (issue4.tgz) = 5c3f8894e3bfe4faf99137362de00ce5

kanrajag commented 9 years ago

Chris I looked at the topology file. I am not sure why you are putting in two entries per "myswitch":"port" ? we only act on one of the entries. it would help to explain your workflow here a bit

For eg. "xxx.tor001.xxx":"fpti1_0_52" -- "xxx.ln007.xxx":"swp1"; "xxx.tor001.xxx":"fpti1_0_52" -- "xxx.ln007.xxx":"fpti1_0_1";

We recently fixed a similar issue in our Cumulus repo when we detect "duplicate" edge descriptions. I will try and repro your situation in-house as well.

If you need help in making this work on Cumulus switches - please go through our regular support channels and you will get the latest code from our repo. if you need this on a non-Cumulus platform, what is the urgency for a fix? can it wait until we push the latest from our repo to github?

Btw, We are way behind on updating github. I plan on updating github when we are done with the current work items , hopefully by 1st week of april.

thanks for testing PTM and feedback is welcome!

-k

cpalmer9 commented 9 years ago

Hi kanrajag, I don't think the topology file is the issue. It's only when lldpd contains neighbors that I provided in the original comment.

The reason we included duplicate entries was that we don't always know the interface name of the adjacent host (can be different vendor, with different naming), so we include the variations. ptmd seems quite happy with this and ptmctl shows a 'pass' whenever it matches one of the endpoints (it doesn't show a 'fail' for the duplicates). In my opinion, this is great. If that behavior will change, then we'll need to figure out something else.

Thanks, Christopher

kanrajag commented 9 years ago

fixed with commit c9e9605e6d4f8d8eb3db7997152dcc313d4f4378

basically lldpd was returning NULL portdescr (which is why the output was blank in your original post). This caused ptm to crash while processing it. Added a few checks to handle this condition

Regarding the duplicate interfaces in topology.dot. I would like to understand how it is working for you. Based on the current design, PTM will only store one of the entries in its database and use that to compare with the LLDP neighbor. So depending on which entry is stored, the LLDP check could pass or fail when the nbr info is retrieved.

cpalmer9 commented 9 years ago

Thanks for fixing the issue I reported.

Regarding duplicate entries in topology.dot, this is how it's working for me currently:

[christopher@tor001 ~]$ ptmctl | grep 49
fpti1_0_49  pass    N/A     N/A   
[christopher@tor001 ~]$ grep 49 /etc/ptm.d/topology.dot 
         "tor001.example.com":"fpti1_0_49" -- "ln001.example.com":"swp1";
         "tor001.example.com":"fpti1_0_49" -- "ln001.example.com":"fpti1_0_1";
[christopher@tor001 ~]$ sudo lldpcli show neighbors | grep -A13 49
Interface:    fpti1_0_49, via: LLDP, RID: 5, Time: 8 days, 20:19:02
  Chassis:     
    ChassisID:    mac 00:e0:ec:31:4d:ae
    SysName:      ln001.example.com
    SysDescr:     ICOS Linux
    MgmtIP:       10.143.16.32
    Capability:   Bridge, off
    Capability:   Router, off
    Capability:   Wlan, off
    Capability:   Station, on
  Port:        
    PortID:       ifname fpti1_0_1
    PortDescr:    fpti1_0_1
-------------------------------------------------------------------------------

... as we don't always know what OS the remote end will be running (but will know the interface number). So we configure for both options. This is working great.

Would it be possible to allow a topology.dot config that supports 'duplicate' entries like this?

kanrajag commented 9 years ago

Can you show me your outputs for swp1 as well? (the same way you showed me for fpti1_0_1 ?) The above works because PTM is storing the nbr info "ln001.example.com:fpti1_0_1" (second line in topo file over-rides the first) - so I am curious to see how the "swp1" entry works for you..

also each time the nbr changes - do you do anything on the host side PTM ? (restart, reconfig etc)

Do you need this to be supported - just on github version or on Cumulus platform as well? if on Cumulus then please raise a FR (feature request) through the support channels and we will take a look at it (eventually make its way into github)

but to be clear - this is not a supported scenario and am not sure (yet) how it is working for you.

cpalmer9 commented 9 years ago

Hi, you made a good comment. I reversed the 2 topology.dot entries for fpti1_0_49 then restarted ptmd. ptmctl fails as ptm seems to be using the "last seen" entry (to your point).

1428690900.895435 2015-04-10 18:35:00 ptm_lldp.c:499 Port fpti1_0_49 NOT matched with remote - Expected [ln001.example.com.swp1] != [ln001.example.com.fpti1_0_1]

Out of curiosity, does topology.dot support pattern matching or wildcards?

kanrajag commented 9 years ago

No it does not support wildcarding/pattern matching

So how did it work for you in the first place? I am not able to figure that out

cpalmer9 commented 9 years ago

I think it worked because we were hitting the 'last duplicate' entry for an interface, which happened to match what was actually there.

kanrajag commented 9 years ago

so Chris were you (re)generating the topo file each time the nbr would change and it would just happen to be the last entry?

cpalmer9 commented 9 years ago

It would just happen to be the last entry. We were not regenerating the topo file. We had even put hostnames on both sides of the "--" in the topo file that didn't belong to the local host. PTM didn't seem to mind that, so I think I assumed duplicate interface lines were OK too.

cpalmer9 commented 9 years ago

Does ptm's topology dot try to match against PortID or PortDescr ?

  Port:        
    PortID:       ifname fpti1_0_1
    PortDescr:    port_1
1428708291.968670 2015-04-10 23:24:51 ptm_lldp.c:499 Port fpti1_0_52 NOT matched with remote - Expected [ln007.example.com.port_1] != [ln007.example.com.fpti1_0_1]
[christopher@tor001 ~]$ sudo lldpcli show neighbors port fpti1_0_52
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    fpti1_0_52, via: LLDP, RID: 9, Time: 0 day, 00:37:25
  Chassis:     
    ChassisID:    mac 00:e0:ec:27:bd:42
    SysName:      ln007.example.com
    SysDescr:     ICOS Linux
    MgmtIP:       10.143.16.42
    Capability:   Bridge, off
    Capability:   Router, off
    Capability:   Wlan, off
    Capability:   Station, on
  Port:        
    PortID:       ifname fpti1_0_1
    PortDescr:    port_1
-------------------------------------------------------------------------------
kanrajag commented 9 years ago

Default is ifName But if you want to compare on PortDescr you need to add this to the edge description LLDP="match_type=portdescr"

cpalmer9 commented 9 years ago

Ah, that's really good to know. I'm looking at using that in additon to 'ip link set dev XXX_1 alias YYY_1' to match on the PortDescr that 'ip link' can control.

kanrajag commented 9 years ago

So you will be able to set the alias on your upstream switch port to the same PortDescr value (irrespective of upstream switch OS)

cpalmer9 commented 9 years ago

That's the intention. On upstream switches we don't control the description, we'll try to use an LLDP template to match on ifName.