Juniper / open-nti

Open Network Telemetry Collector build with open source tools
Apache License 2.0
232 stars 93 forks source link

Logical interface sensor doesn't work any more #125

Closed WhistMo closed 6 years ago

WhistMo commented 7 years ago

Upgrade the open-nti with the latest one. But the logical interface sensor doesn't work any more. Before it use to have GUI output, but with the latest open-nti, no data show in Grafana GUI. tcpdump show the packets from the routers are reach to the open-nti server. Change the sensor to the interface level, the Grafana GUI will have data.

Try methods in Issue #106 No entries in influxdb, it seems it doesn't work as there is no devel Tag, (only master and consul), try to change it to consul in the Makefile, the logical interfaces doesn't work too.

Software version: 16.1R2.11 (logical interfaces usage used to work with older version of open-nti) 16.2R1.6 (latest version of Junos software, but no luck)

Please help me out.

Many thanks

3fr61n commented 7 years ago

Hi @WhistMo

A few questions,

1.- Are you using other sensors, are they working ok? 2.- Are you using the netconf collector, is this working ok? 3.- In your upgrade did you kept the influxDB? or is a fresh install?, Could you please try to delete the influxdb (via influxdb web ui) then create it again?

Regards Efrain

WhistMo commented 7 years ago

Hello Efrain,

Please see below answers: 1: The other sensors used to work. But after I upgrade to the latest open-nti, it become very unstable. For example, if I add some new sensors in the routers, the GUI will stop to show any more data. I have to restart the open-nti services to make it work again.

2: The netconf collector is working. It defines a corn job to run several command to pull the data from the routers and decodes it. It works as before.

3: Delete the influxdb database doesn't help. (Also I don't start and stop the open-nti persistently, so I think those data will be deleted after I stop open-nti)

Looking forward to your feedback and help!

Many thanks

ghost commented 7 years ago

Hi WhistMo,

2: Does the RPM parser works for you? I tried the latest and the rpm parser does not work for me.

thx Al

3fr61n commented 7 years ago

Hi Guys,

Sorry about my late answer, I was trying to reproduce the issue and so far I having issues with 'docker' itself, I mean somehow sensor information is not reaching the jti container.

So could you please trying to do the following task in order to see if we are facing the same issue?

1.- Open a terminal inside the jti container

docker exec -i -t opennti_input_jti /bin/bash

2.- Capture jti packets using tcpdump

sudo tcpdump -i eth0 -nn -v host

3.- Wait and see if jti any packets arrives

4.- Repeat Step 2/3 for your host (where containers are running)

The main idea is to check if docker port forwarding is working properly, somehow in my setup no packets arrives or nat is not properly done.

PS: If packets arrives, next step is enable debugging at fluentd

regards Efrain

WhistMo commented 7 years ago

Hi Efrain,

Yes, the packets are arrived. bash-4.3$ sudo tcpdump -i eth0 -nn -v host 172.20.98.234 tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 12:10:38.312445 IP (tos 0xc0, ttl 254, id 8969, offset 0, flags [+], proto UDP (17), length 1500) 172.20.98.234.1000 > 172.17.0.4.50000: UDP, bad length 3199 > 1472 12:10:38.312447 IP (tos 0xc0, ttl 254, id 8969, offset 1480, flags [+], proto UDP (17), length 1500) 172.20.98.234 > 172.17.0.4: ip-proto-17 12:10:38.312449 IP (tos 0xc0, ttl 254, id 8969, offset 2960, flags [none], proto UDP (17), length 267) 172.20.98.234 > 172.17.0.4: ip-proto-17 12:10:38.315608 IP (tos 0xc0, ttl 254, id 8970, offset 0, flags [+], proto UDP (17), length 1500) 172.20.98.234.1000 > 172.17.0.4.50000: UDP, bad length 3199 > 1472 12:10:38.315610 IP (tos 0xc0, ttl 254, id 8970, offset 1480, flags [+], proto UDP (17), length 1500) 172.20.98.234 > 172.17.0.4: ip-proto-17 12:10:38.315612 IP (tos 0xc0, ttl 254, id 8970, offset 2960, flags [none], proto UDP (17), length 267) 172.20.98.234 > 172.17.0.4: ip-proto-17

Could you please tell me how to debug at fluentd?

Many thanks.

3fr61n commented 7 years ago
1.- Modify docker-compose.yml file you're using so, change the flag OUTPUT_STDOUT from false to true, 
....
input-jti:
  image: $INPUT_JTI_IMAGE_NAME:$IMAGE_TAG
  container_name: $INPUT_JTI_CONTAINER_NAME
  environment:
   - "INFLUXDB_ADDR=opennti"
   - "OUTPUT_INFLUXDB=true"
   - "OUTPUT_STDOUT=true"  <---------------
  ports:
   - "$LOCAL_PORT_JTI:50000/udp"
   - "$LOCAL_PORT_ANALYTICSD:50020/udp"
  volumes:
   - /etc/localtime:/etc/localtime
  links:
   - opennti
.....

2.- Rebuild the container 

juniper@naraku:~/scripts/open-nti$ make build

3.- Restart the container(s)

juniper@naraku:~/scripts/open-nti$ make start
echo "Use docker compose file : docker-compose.yml"
Use docker compose file : docker-compose.yml
IMAGE_TAG=consul docker-compose -f docker-compose.yml up -d
Starting opennti_con
Starting opennti_input_syslog
Starting opennti_input_jti
juniper@naraku:~/scripts/open-nti$ 

4.- Check logs from container console.. and wait for sensors logs.

juniper@naraku:~/scripts/open-nti$ docker logs opennti_input_jti -f
...

2017-02-07 12:23:36 +0100 [info]: reading config file path="/tmp/fluent.conf"
2017-02-07 12:23:36 +0100 [info]: starting fluentd-0.12.29
2017-02-07 12:23:36 +0100 [info]: gem 'fluent-plugin-juniper-telemetry' version '0.3.0'
2017-02-07 12:23:36 +0100 [info]: gem 'fluentd' version '0.12.29'
2017-02-07 12:23:36 +0100 [info]: adding match pattern="jnpr.**" type="copy"
2017-02-07 12:23:36 +0100 [info]: adding match pattern="debug.**" type="stdout"
2017-02-07 12:23:36 +0100 [info]: adding match pattern="fluent.**" type="stdout"
2017-02-07 12:23:36 +0100 [info]: adding source type="forward"
2017-02-07 12:23:36 +0100 [info]: adding source type="udp"
2017-02-07 12:23:36 +0100 [info]: adding source type="udp"
2017-02-07 12:23:36 +0100 [info]: adding source type="monitor_agent"
2017-02-07 12:23:36 +0100 [info]: adding source type="debug_agent"
2017-02-07 12:23:37 +0100 [info]: using configuration file: <ROOT>
  <source>
    @type forward
    @id forward_input
  </source>
  <source>
    @type udp
    tag jnpr.jvision
    format juniper_jti
    port 50000
    bind 0.0.0.0
  </source>
  <source>
    @type udp
    tag jnpr.analyticsd
    format juniper_analyticsd
    port 50020
    bind 0.0.0.0
  </source>
  <match jnpr.**>
    type copy
    <store>
      @type stdout
      @id stdout_output
    </store>
    <store>
      type influxdb
      host opennti
      port 8086
      dbname juniper
      user juniper
      password xxxxxx
      value_keys ["value"]
      buffer_type memory
      flush_interval 2
    </store>
  </match>
  <source>
    @type monitor_agent
    @id monitor_agent_input
    port 24220
  </source>
  <source>
    @type debug_agent
    @id debug_agent_input
    bind 127.0.0.1
    port 24230
  </source>
  <match debug.**>
    @type stdout
    @id stdout_output
  </match>
  <match fluent.**>
    @type stdout
  </match>
</ROOT>
2017-02-07 12:23:37 +0100 [info]: listening fluent socket on 0.0.0.0:24224
2017-02-07 12:23:37 +0100 [info]: listening udp socket on 0.0.0.0:50000
2017-02-07 12:23:37 +0100 [info]: listening udp socket on 0.0.0.0:50020
2017-02-07 12:23:37 +0100 [info]: listening dRuby uri="druby://127.0.0.1:24230" object="Engine"

5.- Sensor logs looks like

2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":6,"type":"egress_queue_info.rl_drop_bytes","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":6,"type":"egress_queue_info.red_drop_packets","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":6,"type":"egress_queue_info.red_drop_bytes","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":6,"type":"egress_queue_info.avg_buffer_occupancy","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":6,"type":"egress_queue_info.cur_buffer_occupancy","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":6,"type":"egress_queue_info.peak_buffer_occupancy","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":6,"type":"egress_queue_info.allocated_buffer_size","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":7,"type":"egress_queue_info.packets","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":7,"type":"egress_queue_info.bytes","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":7,"type":"egress_queue_info.tail_drop_packets","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":7,"type":"egress_queue_info.rl_drop_packets","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":7,"type":"egress_queue_info.rl_drop_bytes","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":7,"type":"egress_queue_info.red_drop_packets","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":7,"type":"egress_queue_info.red_drop_bytes","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":7,"type":"egress_queue_info.avg_buffer_occupancy","value":0}
2017-02-02 18:03:25 +0100 jnpr.jvision: {"device":"R1:1.1.1.1","interface":"ge-0/0/8","egress_queue":7,"type":"egress_queue_info.cur_buffer_occupancy","value":0}
WhistMo commented 7 years ago

Hi Efrain,

Thanks for the support. Below is the output, non of the expected log was seen.

BTW, it seems the latest open-nti is very strange, most of the function doesn't work for me(I also have issue #129 ). The only workaround for me to make open-nti (including the rpm probe passer) work again is use the old open-nti. And I can't use it directly (just start and stop the old open-nti won't work), I need to start the latest one first, stop it and start the old open-nti to make it work. After I stop the old open-nti. In order to make it work again, I need to start the latest one, stop it, and start the old one. -_- very very strange. Try on two server, same problem. Tcpdump show the packet have arrive the server.

But with this workaround, the sub-interface doesn't work. -_-

2017-02-07 11:36:34 -0800 [info]: using configuration file:

@type forward
@id forward_input

@type udp
tag jnpr.jvision
format juniper_jti
port 50000
bind 0.0.0.0

@type udp
tag jnpr.analyticsd
format juniper_analyticsd
port 50020
bind 0.0.0.0

<match jnpr.**> type copy

type influxdb host opennti port 8086 dbname juniper user juniper password xxxxxx value_keys ["value"] buffer_type memory flush_interval 2

@type monitor_agent
@id monitor_agent_input
port 24220

@type debug_agent
@id debug_agent_input
bind 127.0.0.1
port 24230

<match debug.> @type stdout @id stdout_output <match fluent.> @type stdout 2017-02-07 11:36:34 -0800 [info]: listening fluent socket on 0.0.0.0:24224 2017-02-07 11:36:34 -0800 [info]: listening udp socket on 0.0.0.0:50000 2017-02-07 11:36:34 -0800 [info]: listening udp socket on 0.0.0.0:50020 2017-02-07 11:36:34 -0800 [info]: listening dRuby uri="druby://127.0.0.1:24230" object="Engine"

3fr61n commented 7 years ago

Hi @WhistMo

Thanks for doing this troubleshooting with us, I have a few questions.

1.- Which release/branch are you using? 2.- Regarding #129 do you have the problem with all commands or just with rpm related measures? 3.- In the logs you mentioned I did not see sensor (jti) packets arriving to the jti containers, so it seems they are dropped somewhere 4.- Can we have a live troubleshooting session?

3fr61n commented 7 years ago

Hi @WhistMo

Could you try to test this branch and give any feedback?

https://github.com/Juniper/open-nti/tree/debugging-jti

WhistMo commented 7 years ago

Hi Efrain,

Thank you very much for your effort.

The debugging-jti is working (At lease sometime, but still very unstable). The logical interfaces are working on this release. But it may have some other issue, like after modify the sensor in the router(switch from logical interfaces to interface or vice versa), it will stop working. need to stop and restart the open-nti. Also not every time is working and it may stop working very quickly.

Yes, live troubleshooting session is welcome. Just ping me when you are free.

Many thanks!

3fr61n commented 7 years ago

Hi @WhistMo

I sent you yesterday an email, regarding the troubleshooting session, meanwhile when you've mentioned 'latest' version do you really mean 'consul branch'?

Regards