M0r13n / mikrotik_monitoring

Monitor your Mikrotik router with Prometheus and Grafana
MIT License
240 stars 33 forks source link

Multiple node #2

Closed adammau2 closed 1 year ago

adammau2 commented 2 years ago

how to add multiple nodes? will this feature be available in the future?

M0r13n commented 2 years ago

Hey @adammau2,

what do you mean with multiple nodes?

adammau2 commented 2 years ago

hi, i mean add some mikrotik ip to monitor, is there any parameter to add another ip?

M0r13n commented 2 years ago

I see. I am only using a single device and therefore the project supports only a single device. But I will take a look at this.

M0r13n commented 2 years ago

I looked into this. In my opinion it should be enough to change the MKTXP configuration file mktxp/mktxp.conf.

You can add as many devices as you want:


[Router_1]
    capsman_clients = True
    firewall = True
    ...

[Router_2]
    capsman_clients = True
    firewall = True
    ...

[Router_3]
    capsman_clients = True
    firewall = True
    ...

These should then show up under different IPs inside Prometheus. You should then be able to select the different nodes inside Grafana:

Screenshot from 2021-10-29 13-53-10

Because I only own a single Mikrotik device that offers an API, I can not verify it myself. So I would love to hear if it works for you @adammau2 ! :-)

inakhai commented 2 years ago

I looked into this. In my opinion it should be enough to change the MKTXP configuration file mktxp/mktxp.conf.

You can add as many devices as you want:


[Router_1]
    capsman_clients = True
    firewall = True
    ...

[Router_2]
    capsman_clients = True
    firewall = True
    ...

[Router_3]
    capsman_clients = True
    firewall = True
    ...

These should then show up under different IPs inside Prometheus. You should then be able to select the different nodes inside Grafana:

Screenshot from 2021-10-29 13-53-10

Because I only own a single Mikrotik device that offers an API, I can not verify it myself. So I would love to hear if it works for you @adammau2 ! :-)

Try multi node but still show only first node

DemianBerg commented 2 years ago

I have more than 10 MikroTik devices, a router and switches, when I add more than 4 devices to the list, my nodes list in the dashboard is empty... All credentials of MT devices is correct... How to fix it?

M0r13n commented 2 years ago

@DemianBerg

Puuuh. I need to think about a simple way to simulate at least five devices to reproduce this issue. Personally, I only monitor two distinct devices.

DemianBerg commented 2 years ago

@M0r13n

I noticed when I added more than 2 devices (MT router and switch), the Grafana dashboard started somehow not correctly loading device information... And when I added all 10+/- devices in the list, all devices disappeared, I even tried to add one device separately, but anyway, after 2 devices, conflicts begin.. At first I thought maybe because the names/models of devices are the same, but even changed them too, since there are several switches of the same model, but then I tried to add another switch model, but the problem is the same... But the system/idea itself is very cool, just to understand how to add a large number of devices 😅

M0r13n commented 2 years ago

@DemianBerg Currently, I run this setup with two devices. But I had it running with three devices. I never had any hiccups, once I configured everything correctly. So it is definitely possible to use multiple devices.

Could you post an example configuration for me? A copy of your mktxp/mktxp.conf? But be sure to redact any passwords :-D

M0r13n commented 2 years ago

Another thing worth trying is to update the dependencies. I created a PR that bumps Grafana and Prometheus to the latest stable version. See #4

DemianBerg commented 2 years ago

@M0r13n , Did you receive a letter from me in the mail?

M0r13n commented 2 years ago

Yes. It landed in my spam folder, but i finally found it. Your configuration looks alright.

But I thought about another potential solution: Do your different devices share the same identity?

Your can get the systems identity by executing /system/identity print in the Mikrotik Terminal.

DemianBerg commented 2 years ago

Oh, okay 😅 All devices has the same name as the configuration of each device in square bracket..

M0r13n commented 2 years ago

image

I am not able to reproduce the issue. I added five devices. Each device is individually configured in mktxp.

When I open Grafana I can see each device as a single node. And for each node the correct metrics are displayed.

DemianBerg commented 2 years ago

@M0r13n , which version Python is installed in your system?

M0r13n commented 2 years ago

The Python version of the system is irrelevant, because this project uses Docker.

I think that you refer to the version of Python used to run mktxp, right? I use my own pre-built image leonmorten/mktxp:latest, which uses 3.10.5

/mktxp # python --version
Python 3.10.5
DemianBerg commented 2 years ago

Yes, exactly, I meant your version for mktxp.. 😄 Hmm, interesting... But why is it so, like the whole list of devices appears in Grafana, but after about a minute the graphs disappear, reappear and then completely disappear ... What could be the problem? 🤔

M0r13n commented 2 years ago

Good question. I am guessing that something gets messed up.

Could you run mktxp standalone and share its output? You can access the data via port 49090

DemianBerg commented 2 years ago

I runned mktxp standalone and seen in logs about python: Address already in use... Then i look grep python and see many mktxp processes... What do you think about this? That is the the problem, why dashboard don't get stable metrics from exporter and have conflicts?

image

M0r13n commented 2 years ago

Then there might be an instance of mktxp already running. Otherwise it is normal to have multiple processes running. For me it looks the same:

leon       92883   92570  2 16:09 pts/0    00:00:00 python3 mktxp/cli/dispatch.py export
leon       92897   92883 11 16:09 pts/0    00:00:00 python3 mktxp/cli/dispatch.py export
leon       92898   92883  0 16:09 pts/0    00:00:00 python3 mktxp/cli/dispatch.py export
leon       92899   92883  0 16:09 pts/0    00:00:00 python3 mktxp/cli/dispatch.py export
leon       92900   92883  0 16:09 pts/0    00:00:00 python3 mktxp/cli/dispatch.py export
leon       92901   92883  0 16:09 pts/0    00:00:00 python3 mktxp/cli/dispatch.py export
leon       92902   92883  0 16:09 pts/0    00:00:00 python3 mktxp/cli/dispatch.py export
leon       92903   92883  0 16:09 pts/0    00:00:00 python3 mktxp/cli/dispatch.py export
leon       92904   92883  0 16:09 pts/0    00:00:00 python3 mktxp/cli/dispatch.py export
M0r13n commented 2 years ago

But I thought about another source of trouble. The exporter (mktxp) collects all metrics for all routers at once:

I can imagine, that this takes too long. Depending on the speed of your local network, the CPU of the mikrotik devices and the CPU of the device running mktxp, this may take multiple seconds.

Does your setup work reliably if you only configure two or three devices?

DemianBerg commented 2 years ago

At the expense of network speed, this is a 10 Gbps office LAN network 😅, and at the expense of computer resources where I launched it, I already tried 6 GB and 8 GB and 12 GB and an increase in processor cores... When i run with 2 or 3 devices, all works! When run with all devices, logs of mktxp container shows - BrokenPipeError 🥲

image

M0r13n commented 2 years ago

Well that goes into the same direction. The broken pipe error usually occurs if your request is blocked or takes too long and after request-side timeout, it'll close the connection and then, when the respond-side (server) tries to write to the socket, it will throw a broken pipe error.

In your example above only 172.20.0.2 seems to be affected. Does this error occur for all monitored devices or only for some?

M0r13n commented 2 years ago

Fittingly mktxp comes with a socket timeout configuration parameter. You could try to increase the timeout and see if anything improves.

Could you also share the HTTP response of the exporter? You could get this text by opening localhost:49090 in your browser.

DemianBerg commented 2 years ago

Okay, I tried all sorts of options with changing values, now I realized that mktxp works stably and updates all device values, but prometheus itself conflicts with mktxp , since the scrape_interval value of 5 seconds is too small, and therefore no values reach grafana due to prometheus, when I set the value to 30 seconds, then everything works, but then the graphs of grafana are not displayed correctly, apparently due to the fact that they need the data to be updated every 5 seconds 🧐

1 2

DemianBerg commented 2 years ago

So, after 12 hours - prometheus realized that it would not withstand the value of 30 seconds, but mktxp is still updating data from the device, so here the problem is in prometheus...

image

M0r13n commented 2 years ago

I think that neither of the two services is responsible for the issue alone. Rather, the interaction of the two services has some hiccups. The way I see it, the exporter (mktxp) takes too long to collect the metrics. The exporter is designed to collect the metrics from each device as soon as an HTTP request comes in. Collecting metrics for many devices then takes too long and Prometheus thinks that the exporter is down.

I notice this delay myself when I call http://mktxp:49090. Even with three devices, there is a noticeable delay before the response appears in the browser/terminal. Unfortunately, the runtime of the exporter scales linearly with the number of monitored devices.

The only real workaround is to partially redesign the exporter or to switch to another exporter. Neither of which is easy to implement.

M0r13n commented 2 years ago

You could increase the value of scrape_timeout in Prometheus to something like 20s. This could help you get the metrics before Prometheus gives up.

M0r13n commented 2 years ago

I am curios if things would work better if the exporter parallelizes things. I implemented a simple proof of concept using a ThreadPoolExecutor to fetches five (5) devices simultaneously.

Could you try to use the version of mktxp that I modified?:

git clone https://github.com/M0r13n/mktxp.git
cd mktxp
git checkout poc-fetch-multiple-devices-in-parallel 
docker build . -t leonmorten/mktxp:latest
cd /path/to/your/mikrotik-monitoring
docker-compose down 
docker-compose up -d
chunyianliew commented 2 years ago

I am also experiencing issues after about a day, when I restart mktxp running in a docker container the metrics retrieval time for 2 devices is less than 6 seconds:

Screenshot 2022-09-04 235701

But after a day the metrics retrieval time has increased to 1 minute:

Screenshot 2022-09-04 235613

Seems to be similar to this #issue

I will give the poc version a try later this week.

DemianBerg commented 2 years ago

@M0r13n - I've been thinking about problems with a large number of devices and have done a lot of testing. By the way, your latest mktxp version helped fix a bug with CPU temperature display. And so, in general, I thought, maybe there is an opportunity to do mktxp (like worker) for each device, so that each device has its own port on which there will be data about it? Because I noticed if I add one device to mktxp , it reloads all the data in 3-4 seconds, well, the more devices, the longer it will take to load the page, so I thought maybe this would help? And in prometheus in targets add each port - one device, where do you get the metrics from? 🤔

M0r13n commented 2 years ago

@DemianBerg Thats exactly what I said here: https://github.com/M0r13n/mikrotik_monitoring/issues/2#issuecomment-1223863338 😅

But I dont think that this is actually the cause of the problem. I experienced this problem myself after 6 weeks of uptime. And my setup only monitors two devices at the moment. For me the cause seems to be related to these two issues:

M0r13n commented 2 years ago

I may have an idea: There could some kind of request stacking at play here. If the scrape interval of Prometheus is smaller than than the worst time that mktxp needs to answer, the application may become non-responsive. The reason is that mktxp processes the requests synchronously. The HTTPServer instance processes one request at a time. So if the server receives a requests every other request has to wait until the previous one is finished: First Come, First serve. But if that happens, the next request is processes with a few seconds delay. Thus, it has less time to finish, before the next request comes.

Example

Let scrape_interval=5s and the average_processing_time=3s. As long as mktxp finishes each request in less that 5s everything is fine. But if something causes a one or more answers from mktxp to be delayed, it may get stuck:

gantt
    title Example Gant diagram
    dateFormat s
    axisFormat %s

section Prometheus
scrape1: 0, 5s
scrape2: 5, 5s
scrape3: 10, 5s
scrape4: 15, 5s
scrape5: 20, 5s
scrape6: 25, 5s
scrape7: 30, 5s

section MKTXP
answer1: 0, 18s
answer2: 18, 12s
answer3: 30, 8s
answer4: 38, 3s
answer5: 41, 3s

Prometheus sends a request every 5 seconds and closes the socket afterwards. If the first few answers take more time than that, mktxp is never going to catch up. By the time it starts to process request3 it has already 4 additional requests in its queue. And because Prometheus closes the socket after 5s all responses are worthless, because they never reach Prometheus.

I was able to actually reproduce this issue. I added an artificial delay of 20 seconds to mktxp for every fifth request. All other requests were set to 3s. If i send a response very 5 seconds (while sleep 5; do curl localhost:49090 --max-time 5; done) mktxp is deadlocked.

M0r13n commented 1 year ago

@DemianBerg

One part of the solution would be to reduce the absolute time mktxp requires to fetch all metrics. As you correctly guessed: The optimal solution would be to have a single worker for each device.

Another option is to let mktxp fetch multiple devices in parallel. This is what I tried to say here: https://github.com/M0r13n/mikrotik_monitoring/issues/2#issuecomment-1223863338

This is a proof of concept of five worker threads. Each thread fetches the metrics for one monitored device. Therefore, the total time for mktxp to serve the request should be reduced by 80% / 4/5th. I strongly suggest that you follow the steps outlined above. If it works for you, I will release a more polished version.

DemianBerg commented 1 year ago

Heyo, @M0r13n! I tried your variant to monitor some devices. I set it for five devices, tried to come up with all sorts of variations, so I didn’t answer for a long time 😄. So after 2 days when i launched... it failed me again, and now I noticed that prometheus itself is losing connection with mktxp, since the mktxp waiting interval for updating metrics is increasing 🤨

M0r13n commented 1 year ago

It might be caused by this: https://github.com/akpw/mktxp/issues/34

There was a bug which caused a slow but steady deterioration of performance. It is fixed since yesterday. You might try your luck with that.

DemianBerg commented 1 year ago

Yaay, it works! In Prometheus everything also started successfully with a 5 second interval - and with all 10+ devices! So far, I just launched it for testing for a couple of minutes, I left it running now, I will see if it breaks again 😄 ... But now another problem has appeared, when I started it with the DHCP metrics, it shows an error for me ... And with the rest of the parameters, everything goes as it should!

image

M0r13n commented 1 year ago

Lets see whether it keeps working for you. It would be nice to be able to finally close this issue. 😁

Regarding your second question: Please open a dedicated issue in the mktxp project. Otherwise we mix and match different problems.

DemianBerg commented 1 year ago

And so everything seems to be working, but sometimes it happens that prometheus loses connection for one or two minutes (perhaps due to the fact that the waiting interval when loading metrics is a little more than 5 seconds) and then everything starts working again.

изображение

M0r13n commented 1 year ago

@DemianBerg

These gaps are most likely caused by Prometheus canceling a request because it took too long. You might be able to fix that by tweaking your configuration. Something like:

scrape_interval=15
scrape_timeout=30

That should fix the issue when you say, that loading metrics is a little more than 5 seconds.

I am closing this issue now, because the underlying bug is fixed. Feel free to open a new issue for additional bugs or features. :-)