MONROE-PROJECT / Maintenance

MONROE Maintenance procedures, and mostly an issue tracker.
0 stars 0 forks source link

Add a "survailance system" for detecting anomalies #4

Open jonakarl opened 7 years ago

jonakarl commented 7 years ago

I think a mail or similar should go out if eg:

1 . We receive or import no "jsons" in an hour

  1. Diskspace runs out on the nodes
  2. The "well known" docker containers are not running
  3. ....

Lets discuss options.

For point 1, I can easily implement this in the importer and I am sure the other checks are juts as easily implemented (although on the nodes we might have trouble sending emails if we have no internet connectivity, or maybe can we get some of this info from the inventory ?)

The other way is to implement some more generic option that parse the database and search for the relevant info, if it exist in the db(s).

jonaswerme commented 7 years ago

For nodes we have finished an alarm system that can send email when a nodes modem stops working and/or a node is gone (alarms trigger is the same intervals as for the node and link colors in the maintenance portal/inventory interface). This is probably only useful for stationary nodes though. The alarms can be sent directly to both users of the maintenance portal and the contact for the node. We also have slack integration but not sure if I broke it on the last round of changes or not, if someone is using slack for monitoring I can have a look, if not I will spend the time on other things. Changes will be pushed once I'm back in Norway.

relet commented 7 years ago

I think it would also be good to have a human readable overview (in the visualization?) over which nodes are running the base experiments and reporting results. If we have to check the database, then no one will check with the frequency we need.

jonakarl commented 7 years ago

Can your system be extended to report the running docker containers, how many files where last sycnhed etc ?

I do not think slack integration is top priority right now, in my view an minimal alert system that can inform me (monroe-devel?) that things have gone pear-shaped without me having to monitor a surveillance system (which I will forget) would satisfy me needs (other may have other needs).

relet commented 7 years ago

The thing with alert systems is that you have to train them what your pears look like. I'm thinking that if we define metrics (e.g number of ping results/24h), we can either show them on a dashboard, send daily reports, or set alarm limits to send alerts based on the metric. Seems like the inventory can do some of this. The metrics is what we have to implement in any case.

jonakarl commented 7 years ago

I understand the problem but to start with I think we have pretty easy to spot "pears":

Top of my head I come up with these :

  1. Docker containers that always must be running after boot (noop, metadata-subscriber, 3xping, tstat?)
  2. Ping produceses 1 json per seconds and metadata god knows how many so if we do not continually get some output from atleast one of these two containers something is wrong.
  3. If we cannot sync data to the repository within 1 hour raise a flag

2016-08-30 9:27 GMT+02:00 Thomas Hirsch notifications@github.com:

The thing with alert systems is that you have to train them what your pears look like. I'm thinking that if we define metrics (e.g number of ping results/24h), we can either show them on a dashboard, send daily reports, or set alarm limits to send alerts based on the metric. Seems like the inventory can do some of this. The metrics is what we have to implement in any case.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4#issuecomment-243355854, or mute the thread https://github.com/notifications/unsubscribe-auth/AItalxhNUwdapHzZMLAnqzKxwW1OMy12ks5qk9t5gaJpZM4JvjEf .

Jonas Karlsson Senior Research Engineer

Karlstad University SE-651 88 Karlstad, Sweden Telephone: +46 54 700 15 64 Mobile: +46 70 672 06 20 Skype: karlsson.karl.jonas Hangout: karlsson.karl.jonas

twitter.com/kau facebook.com/karlstadsuniversitet KAU.SE

kristrev commented 7 years ago

None of these will be implemented in the inventory alert system, as the inventory alert system is only concerned with node state (i.e., node up, modems up + connectivity)..

However, implementing your own alert system shouldn't be do hard. Doesnt Cassandra have all these nice triggers/events you can listen to?

relet commented 7 years ago

Yes, I think you should be able to monitor all of these from the database side. The nodes can send SYSEVENT metadata, e.g. whenever they try to restart a container (when it is not up). If you get these events once a minute, the container is crashing.

kristrev commented 7 years ago

Be careful depending on timing ("once a minute"), since you might a slow, congested, ... connection. It is better to say not received withtin X.

mikepeon commented 7 years ago

I can write a small script to count the number of pings/https/etc that entered the DB in the last 30 minutes. If the numbers are way too low, we can generate a warning directly. We may even put this directly on a page in the web server such as by writing the number of events to a file...

On 30-Aug-16 09:27, Thomas Hirsch wrote:

The thing with alert systems is that you have to train them what your pears look like. I'm thinking that if we define metrics (e.g number of ping results/24h), we can either show them on a dashboard, send daily reports, or set alarm limits to send alerts based on the metric. Seems like the inventory can do some of this. The metrics is what we have to implement in any case.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4#issuecomment-243355854, or mute the thread https://github.com/notifications/unsubscribe-auth/AQhKcM69yn1QVMjP61gjy-rpJZeuDd5Gks5qk9t6gaJpZM4JvjEf.

Miguel Peón-Quirós IMDEA Networks Institute miguel.peon@imdea.org +34914816930 +34615607843

jonakarl commented 7 years ago

Maybe you can also list which nodes inserted the entries so we get a information on which nodes has entered data in the DB?

2016-08-30 12:37 GMT+02:00 mikepeon-imdea notifications@github.com:

I can write a small script to count the number of pings/https/etc that entered the DB in the last 30 minutes. If the numbers are way too low, we can generate a warning directly. We may even put this directly on a page in the web server such as by writing the number of events to a file...

On 30-Aug-16 09:27, Thomas Hirsch wrote:

The thing with alert systems is that you have to train them what your pears look like. I'm thinking that if we define metrics (e.g number of ping results/24h), we can either show them on a dashboard, send daily reports, or set alarm limits to send alerts based on the metric. Seems like the inventory can do some of this. The metrics is what we have to implement in any case.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4# issuecomment-243355854, or mute the thread https://github.com/notifications/unsubscribe- auth/AQhKcM69yn1QVMjP61gjy-rpJZeuDd5Gks5qk9t6gaJpZM4JvjEf.

Miguel Peón-Quirós IMDEA Networks Institute miguel.peon@imdea.org +34914816930 +34615607843

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4#issuecomment-243400406, or mute the thread https://github.com/notifications/unsubscribe-auth/AItal3bqeT93MzPt3UMfsGEHVT_qCUIDks5qlAf2gaJpZM4JvjEf .

Jonas Karlsson Senior Research Engineer

Karlstad University SE-651 88 Karlstad, Sweden Telephone: +46 54 700 15 64 Mobile: +46 70 672 06 20 Skype: karlsson.karl.jonas Hangout: karlsson.karl.jonas

twitter.com/kau facebook.com/karlstadsuniversitet KAU.SE

jonakarl commented 7 years ago

I am getting more and more found of your @mikepeon-imdea idea of extracting the data out of the (cassandra) database. I think the biggest gain with doing that instead of having support on the nodes or on the importer is that it is the data in the database that we really care about. It does not matter how well the nodes work or how slick the importer is unless we get the data into the database.

The node tests might still be interesting from a debugging viewpoint but if we can get a alert system based on the data we import that would be great.

mikepeon commented 7 years ago

I'll work on it a bit today and tomorrow...

On 30-Aug-16 16:30, Jonas Karlsson wrote:

I am getting more and more found of your @mikepeon-imdea https://github.com/mikepeon-imdea idea of extracting the data out of the (cassandra) database. I think the biggest gain with doing that instead of having support on the nodes or on the importer is that it is the data in the database that we really care about. It does not matter how well the nodes work or how slick the importer is unless we get the data into the database.

The node tests might still be interesting from a debugging viewpoint but if we can get a alert system based on the data we import that would be great.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4#issuecomment-243458301, or mute the thread https://github.com/notifications/unsubscribe-auth/AQhKcFB5tpVGIW9cjOBAU2dmWKmoutLrks5qlD6TgaJpZM4JvjEf.

Miguel Peón-Quirós IMDEA Networks Institute miguel.peon@imdea.org +34914816930 +34615607843

jonakarl commented 7 years ago

"As described on the mailing list" As a first step to get a more high level overview of the system status I developed a script for manually checking node status (from the experiment/database view) : https://github.com/MONROE-PROJECT/Database/blob/master/node_checkup/check_nodes.py.

The script parses the database and extract the data inserted into the db by node/operator and validates the timestamps against what it should be (gps information, modem metadata, HTTP download and RTT results for 3 operators etc) for the specified timespan.