Extending Trippy - Githubissues

c-git commented 1 year ago

Hi,

I am looking for a custom network monitoring tool. I find trippy very interesting. It meets a lot of my needs and I was looking to see if there was some easy way to extend trippy to the additional features that I need.

I need features like:

notifications on loss of connectivity
a background service mode which will support remote realtime monitoring
support for alternative front ends like GUI
logging of connection statistics to detect changes over time

Initially I was looking to try to use trippy as a library and build a separate application on top of it. My initial challenge was that a lot of the code is attached to main.rs instead of lib.rs and I don't know how to easily import code that is part of main.rs.

After thinking about it more I was wondering if some of these things would be beneficial to be added directly to Trippy with maybe runtime flags to avoid changing the current runtime behavior. Let me know what you think so I can know what the best way to proceed would be.

Thanks for taking the time to read my message.

c-git commented 1 year ago

I was looking at mode and wondering where if some of these could be "supported" by using the streaming mode.

fujiapple852 commented 1 year ago

Hi there @c-git,

I agree that adding this type of monitoring & alerting functionality to trippy would useful. It would also be possible to build a custom tool using the trippy library, we can revisit that option later if needed.

The existing stream mode is mostly there as a development and debugging aid, sometimes it is easier to run trippy "headless" and emit custom diagnostic logging. We could look to extend the capabilities of that mode or perhaps introduce a new mode, i.e. monitor or event or similar.

The way I imagine this would work is that this new mode would emit a stream of events, where the criteria for emitting each event would be configurable. A example of an event type may be the latency or packet loss of the target hop crosses a threshold. Another example event type is one which fires if the path between the source and target hosts changes between rounds. Many such event types could be defined.

Relating this back to your list:

notifications on loss of connectivity

This should be possible, we would have to define what loss of connectivity means precisely. Perhaps it would mean a one-time event which is emitted when a change in state occur for the target hop from "good" to "bad" (the meaning of which varies per protocol).

a background service mode which will support remote realtime monitoring

I think the existing stream mode and any new similar mode we create would meet this requirement? i.e. you can run trippy remotely (via ssh or rexec etc) and headless (no tui).

support for alternative front ends like GUI

I think this is something that would have to be build against the trippy library rather than any new streaming mode.

logging of connection statistics to detect changes over time

This should be possible to encode as some form of event, we'd have to define the details of what sort of changes we wanted to detect.

Let me know what you think.

c-git commented 1 year ago

Hi @fujiapple852 ,

Thanks for taking the time to respond. Really appreciate it.

I agree that adding this type of monitoring & alerting functionality to trippy would useful.

I'm very happy that you think so because I think it will be substantially easier to extend trippy as a long term solution instead of doing something separate and I think it would make it easier for the next person to come along to be able to make use of it. I've been very impressed with trippy so far, really good work.

It would also be possible to build a custom tool using the trippy library, we can revisit that option later if needed.

Agreed we can table this for now. I don't think there is a need if we do the event emitting that you suggested.

The way I imagine this would work is that this new mode would emit a stream of events, ....

I really like the idea of it emitting events, it keeps it very open for how the events are handled. And makes alternate options for handling the events a user choice. We can provide some ready made options. Like I'd need sending emails and sending discord messages so that's two we could provide ready made. And they could serve as reference implementations so others can extend. Maybe like a plugin system🤔? How would the events be able to be consumed?

Another example event type is one which fires if the path between the source and target hosts changes between rounds. Many such event types could be defined.

I like the idea of keeping the options open for types of triggers.

With regard to your responses to my list, they all seem very reasonable to me.

Happy to get started on this soon. Let me know what you think would be the best way to begin. Would trying to firm up the requirements / what would actually need to be done the best way to start? In the meanwhile I need something in the short term so I'm working on getting that going as a notification system was not a top priority but circumstances changed and I need a solution sooner rather than later. So currently working on a simplified notification system which I'm hoping will tide me over until this can get done. We can maybe use it to test out options for how we would want it to work, kinda like a playground. I've started trying to document what I need but it's still early days.

fujiapple852 commented 1 year ago

I really like the idea of it emitting events, it keeps it very open for how the events are handled. And makes alternate options for handling the events a user choice. We can provide some ready made options. Like I'd need sending emails and sending discord messages so that's two we could provide ready made.

The idea I had in mind was that trippy would emit events (i.e. as logs or metrics) such that they can be consumed and handled externally, following the https://opentelemetry.io approach. So rather than trippy having the ability to do things like send emails or discord messages directly, instead those actions are handled by your chosen observability tool (i.e. Prometheus or similar) with trippy being the source of the event.

c-git commented 1 year ago

Ok thanks, I'll look into the example they have and try to understand how it would work.

fujiapple852 commented 1 year ago

To add some more detail; trippy currently uses tracing to emit logs and spans.

You can enable trace logging (-v), output it in json format (--log-format json) and use silent mode (-m silent), which is similar to stream mode except it doesn't output anything except any tracing output (and only runs for N rounds):

trip example.com -m silent -C 5 -v --log-format json

You can image a new mode that is similar to that except that we output log entries that represent the various event types discussed above and nothing else (and it runs forever like stream mode).

It should be possible to use something like tracing-opentelemetry to export these to an opentelemetry compatible consumer rather than simply outputting to stdout as json. However I think that can be a separate piece of work, to start I'd focus on defining and emitting the events in a new mode.

Is this something you're keen to work on yourself? If so then feel free to ping me on discord (fujiapple852) to get into the details.

c-git commented 1 year ago

Yeah I'd be happy to working on it as I need it. Also I'm happy to do it in a way where other people can benefit. I'll reach out on discord.

ShaguarWKL commented 1 year ago

@fujiapple852 Will try to give my 2 cents but here's a bit of background on me so you can better understand my perspectives will differ from others with different specialties. Please note that my information is probably outdated in methodology and workflows today since I made a full exit from tech almost 10 years ago (3 IPOs and an acquisition, so I decided I was done with being 24/7/365 on call)

The operations I oversaw, being one of its architects, and handling customer issues mainly due to me speaking several languages proficiently encompassed literally layer 1-7 products and services. We had submarine cables, payment processing, SAAS, DDOS Mitigation/Network security to MMO publishing. So core infrastructure all the way to consumer products. We didn't plan the company growth that way, but after an IDC acquisition, the adage if you build it they will come happened and so every other product or services revolved and evolved around having a Tier 3 IDC located in 2 countries' major city banking districts. A tier 4 was being built when I sold and exited.

Most of our customers were mission critical types so monitoring was a massive part.

Here in bullet point form are the things I was most concerned with when it was my day job to worry about these things.

False Positives. Ping Plotter was notorious for it (namely where my annoyance for it comes from and also the licensing structure). This was never really resolved but a constant WIP tuning when I left. False Positive waste a lot of engineer time. This wasn't just a <1s link down thing, we had contracted latency and throughput tolerances that we have to monitor but sometimes the borderline cases would trigger an alarm and engineers would have to spend time responding to it. Given the scale of our ops and the nature of our customers, this was the biggest waste of resources issue that I was never able to resolve to satisfaction
Analytics. This helps us structure our resources in a better way. Especially putting different customers on different links to match their usage patterns and our overall resource consumption of our infrastructure
Security of monitoring systems. We would rotate our monitoring systems to different /24 on different ASNs periodically to mitigate malicious actors from attacking them. Not perfect but its something.
One input source able to multicast to multiple stations instead of opening new instances. We dealt with this by using remote-desktop features so screensharing to multiple stations. Its nice and all to have a wall of screens in the NOC but its much easier to read and decipher on the engineer's own workstation but of course it opens up a whole slew of other security concerns since we had to have external links due to having multiple NOCs in 3 continents.
The ability to structure/customise alert messages. Due to the inability to integrate all the different tools we used into a coherent single interface (we tried developing one but we weren't a dev studio so it never got to a point where we could deploy it). Ability to alert through multiple methods, although I think by now the tools should be there vs during my time. For example, most alert messages going through SMS resulted in off-site or off-duty engineers having to call the NOC up for more information which has created bad social life situations for me and my team, we lost count of the times and have some very interesting stories around those.

I think for Trippy,

Having the ability to export data should suffice so that other frontend tool platforms can easily parse that information. It wouldn't expand the scope of Trippy beyond what it was intended for but add some nice to have.
Colour coding tolerances could be something nice as well. Like say I start a Trip to 8.8.8.8 but i can specify anything above 30ms outputs in Red or Green (Red isn't a Negative Situation colour in quite a few Asian countries so be a nice touch to acknowledge cultural differences and what people associate as a colour denoting a negative situation) or don't "alert" unless X times happen in Y period.

Hope I made some sense and there's useful information for you.

Cheers!

PS this post has not been proof-read.

fujiapple852 commented 1 year ago

Thanks for the details @ShaguarWKL and congratulations on the IPOs and exit!

False Positives. Ping Plotter was notorious for it

Can you give some examples of the types of false positives you observed with Ping Plotter?

Having the ability to export data should suffice so that other frontend tool platforms can easily parse that information

I'm glad you agree with that as i'm keen to avoid adding to much "bloat" to Trippy (like an email client, slack client and so forth). The trick here is going to be defining a good set of "events" that can be consumed by these tools.

Colour coding tolerances could be something nice as well.

Trippy current has some (very naive...) status colours per hop that boil down to this logic:

fn render_status_cell(hop: &Hop, is_target: bool) -> Cell<'static> {
    let lost = hop.total_sent() - hop.total_recv();
    Cell::from(match (lost, is_target) {
        (lost, target) if target && lost == hop.total_sent() => "🔴",
        (lost, target) if target && lost > 0 => "🟡",
        (lost, target) if !target && lost == hop.total_sent() => "🟤",
        (lost, target) if !target && lost > 0 => "🔵",
        _ => "🟢",
    })
}

Like say I start a Trip to 8.8.8.8 but i can specify anything above 30ms outputs in Red or Green ... or don't "alert" unless X times happen in Y period.

Yes we'd certainly want to have configurable conditions on events. As a slight aside, we do have to be mindful of the significant different between responses from the target hop and non-target hop, especially for ICMP tracing given how routers treat this traffic (i.e. in software) and their tendency to rate limit and drop such traffic, which can certainly leads to false positives.

Red isn't a Negative Situation colour in quite a few Asian countries so be a nice touch to acknowledge cultural differences and what people associate as a colour denoting a negative situation

A good point; these status colours should be made configurable as part of the theme.

c-git commented 1 year ago

I'm really quite eager to get started on this in earnest but have been restricted by time and recent medical setbacks. The POC that I made to meet the my urgent needs worked out quite well IMO. It's a bit quickly thrown together due to time constraints but it worked as well as I'd have hoped and is sufficiently modularized that I think building the actual one to interface with trippy should work out well. The state machine based approach to handling notification made implementation simple and only depends on receiving individual timestamped events from trippy (or in my case the linux ping program, just a quick drop in for trippy to get a POC working). All the state is managed by the state machine. I actually think it's easy enough that it doesn't need to be hard coded and could be configured by the user (but with sensible defaults because it does require handling all possible cases in each state). That said if we can expose it as a user configurable option then it should allow a great deal of flexibility to determine how to handle weird edge cases that do not apply to me, and would relive the maintainers from having to handle cases they "don't care about". I don't really want to say they don't care but rather that it does apply to them. Talking for myself at this point I wouldn't want to have to maintain additional states that my use case doesn't call for.

Full disclosure I drew a lot if ideas from PA Server Monitor which I used to use at my previous job. It was sufficiently flexible and gave me what I wanted (as far as I can remember). However, it only works on Windows and the cost of running windows servers is not practical for my current needs especially since I only really need the network monitoring capabilities. I've had pretty bad experiences trying to run Windows VMs compared to Linux VMs. Windows VMs are just so resource hungry and they put so much strain on the hypervisor.

ShaguarWKL commented 1 year ago

@fujiapple852 please be aware I'm trying to recall all of these from memory that's over a decade ago now.

False Positives. Ping Plotter was notorious for it

Can you give some examples of the types of false positives you observed with Ping Plotter?

This was mainly due to Ping Plotter's default settings at 1s intervals when 3-5s more than suffice and I'm sure you encountered that the bulk of "engineers" just stick to default settings. This created excessive consumption of resources when you have thousands of ping instances from your edge (customers) coming to your core. As I mentioned, most of our customers were mission critical types so Jitters, a sudden spike in latency, alarms go off and customers start calling the NOC. By the time the NOC answers, those conditions were gone and also those false positives had 0 impact on end user experience (customers of my customers). So we had to deal with junior guys who are just doing a CYA (cover your ass) which waste our precious NOC engineer time.

I never did try to figure out why PP reported so much jitter variance when other platforms didn't.

I didn't have much luck getting customers to use Linux or BSD to handle the monitoring so Ping Plotter was used by iirc 80-90% of my customers' engineers and most of them don't actually quality to be network engineers, mostly sysadmin types.

Let's just say the situation was worse than having to deal with FPS or RTS or MOBA gamers when their latency goes up by 5ms.

Having the ability to export data should suffice so that other frontend tool platforms can easily parse that information

I'm glad you agree with that as i'm keen to avoid adding to much "bloat" to Trippy (like an email client, slack client and so forth). The trick here is going to be defining a good set of "events" that can be consumed by these tools.

Yes bloat sucks. As for what constitutes a good set of events, it really differs on the end user. I think a JSON or some sort of config file based on an API might be better so people can just tweak and also let other platforms integrate it.

I'm trying to say, just provide the framework and let other tools or users do the rest.

Colour coding tolerances could be something nice as well.

Trippy current has some (very naive...) status colours per hop that boil down to this logic:

I think it serves its purpose very well. What you are trying to gather comments and opinions on right now is expanding its flexibility and hopefully it results in better functionality.

Like say I start a Trip to 8.8.8.8 but i can specify anything above 30ms outputs in Red or Green ... or don't "alert" unless X times happen in Y period.

Yes we'd certainly want to have configurable conditions on events. As a slight aside, we do have to be mindful of the significant different between responses from the target hop and non-target hop, especially for ICMP tracing given how routers treat this traffic (i.e. in software) and their tendency to rate limit and drop such traffic, which can certainly leads to false positives.

Yup and most network "engineers" can't even tell the different. On a tangent, when my previous company was growing and I was trying to hire more network guys, I gave up and just hired fresh grads with or without Cisco certification since a huge number of the applicants when I interviewed them myself, couldn't answer the question "Please give me 1 reserved /24 range", CCIE applicants included. Much easier to just teach fresh grads the ropes.

I think starting with a base framework that accounts for the bulk of the USE cases is more than enough. 80-20 rule then iterate on a need to or interest basis is probably how I would go about it.

Red isn't a Negative Situation colour in quite a few Asian countries so be a nice touch to acknowledge cultural differences and what people associate as a colour denoting a negative situation

A good point; these status colours should be made configurable as part of the theme.

Don't do it if it takes more work than it needs to however.

I think @c-git seems to be of the same view I have. Framework, let users config themselves. I would even venture to say that 90% of the users will just use defaults and 10% of the users will want that extra functionality. So with that, you would be doing a lot of work for a small subset of Trippy users.

I am way too rusty to be of any help beyond high level feedback since the last time I actually wrote any code was in Pascal in 1991 and the last time I was in a professional NOC was in 2013.

and lastly, expanding all these functionality to Trippy also means expanding documentation and maintaining documentation. That is a huge enterprise in and of itself =P

fujiapple852 commented 11 months ago

May be of interest: https://cloudevents.io/

c-git commented 11 months ago

Does indeed look interesting but might be a bit heavier on the trippy side. I think we should at least look at it. Might give ppl another option to connect to for notifications instead of the program I'm going to write. For my use case I'd still need to send the messages to another computer on the local network but I don't mind giver other users more options especially if there is no additional "cost" (performance, size, or other relevant metric).

fujiapple852 / trippy

Extending Trippy #636