Open jonakarl opened 7 years ago
For nodes we have finished an alarm system that can send email when a nodes modem stops working and/or a node is gone (alarms trigger is the same intervals as for the node and link colors in the maintenance portal/inventory interface). This is probably only useful for stationary nodes though. The alarms can be sent directly to both users of the maintenance portal and the contact for the node. We also have slack integration but not sure if I broke it on the last round of changes or not, if someone is using slack for monitoring I can have a look, if not I will spend the time on other things. Changes will be pushed once I'm back in Norway.
I think it would also be good to have a human readable overview (in the visualization?) over which nodes are running the base experiments and reporting results. If we have to check the database, then no one will check with the frequency we need.
Can your system be extended to report the running docker containers, how many files where last sycnhed etc ?
I do not think slack integration is top priority right now, in my view an minimal alert system that can inform me (monroe-devel?) that things have gone pear-shaped without me having to monitor a surveillance system (which I will forget) would satisfy me needs (other may have other needs).
The thing with alert systems is that you have to train them what your pears look like. I'm thinking that if we define metrics (e.g number of ping results/24h), we can either show them on a dashboard, send daily reports, or set alarm limits to send alerts based on the metric. Seems like the inventory can do some of this. The metrics is what we have to implement in any case.
I understand the problem but to start with I think we have pretty easy to spot "pears":
Top of my head I come up with these :
2016-08-30 9:27 GMT+02:00 Thomas Hirsch notifications@github.com:
The thing with alert systems is that you have to train them what your pears look like. I'm thinking that if we define metrics (e.g number of ping results/24h), we can either show them on a dashboard, send daily reports, or set alarm limits to send alerts based on the metric. Seems like the inventory can do some of this. The metrics is what we have to implement in any case.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4#issuecomment-243355854, or mute the thread https://github.com/notifications/unsubscribe-auth/AItalxhNUwdapHzZMLAnqzKxwW1OMy12ks5qk9t5gaJpZM4JvjEf .
Jonas Karlsson Senior Research Engineer
Karlstad University SE-651 88 Karlstad, Sweden Telephone: +46 54 700 15 64 Mobile: +46 70 672 06 20 Skype: karlsson.karl.jonas Hangout: karlsson.karl.jonas
twitter.com/kau facebook.com/karlstadsuniversitet KAU.SE
None of these will be implemented in the inventory alert system, as the inventory alert system is only concerned with node state (i.e., node up, modems up + connectivity)..
However, implementing your own alert system shouldn't be do hard. Doesnt Cassandra have all these nice triggers/events you can listen to?
Yes, I think you should be able to monitor all of these from the database side. The nodes can send SYSEVENT metadata, e.g. whenever they try to restart a container (when it is not up). If you get these events once a minute, the container is crashing.
Be careful depending on timing ("once a minute"), since you might a slow, congested, ... connection. It is better to say not received withtin X.
I can write a small script to count the number of pings/https/etc that entered the DB in the last 30 minutes. If the numbers are way too low, we can generate a warning directly. We may even put this directly on a page in the web server such as by writing the number of events to a file...
On 30-Aug-16 09:27, Thomas Hirsch wrote:
The thing with alert systems is that you have to train them what your pears look like. I'm thinking that if we define metrics (e.g number of ping results/24h), we can either show them on a dashboard, send daily reports, or set alarm limits to send alerts based on the metric. Seems like the inventory can do some of this. The metrics is what we have to implement in any case.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4#issuecomment-243355854, or mute the thread https://github.com/notifications/unsubscribe-auth/AQhKcM69yn1QVMjP61gjy-rpJZeuDd5Gks5qk9t6gaJpZM4JvjEf.
Miguel Peón-Quirós IMDEA Networks Institute miguel.peon@imdea.org +34914816930 +34615607843
Maybe you can also list which nodes inserted the entries so we get a information on which nodes has entered data in the DB?
2016-08-30 12:37 GMT+02:00 mikepeon-imdea notifications@github.com:
I can write a small script to count the number of pings/https/etc that entered the DB in the last 30 minutes. If the numbers are way too low, we can generate a warning directly. We may even put this directly on a page in the web server such as by writing the number of events to a file...
On 30-Aug-16 09:27, Thomas Hirsch wrote:
The thing with alert systems is that you have to train them what your pears look like. I'm thinking that if we define metrics (e.g number of ping results/24h), we can either show them on a dashboard, send daily reports, or set alarm limits to send alerts based on the metric. Seems like the inventory can do some of this. The metrics is what we have to implement in any case.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4# issuecomment-243355854, or mute the thread https://github.com/notifications/unsubscribe- auth/AQhKcM69yn1QVMjP61gjy-rpJZeuDd5Gks5qk9t6gaJpZM4JvjEf.
Miguel Peón-Quirós IMDEA Networks Institute miguel.peon@imdea.org +34914816930 +34615607843
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4#issuecomment-243400406, or mute the thread https://github.com/notifications/unsubscribe-auth/AItal3bqeT93MzPt3UMfsGEHVT_qCUIDks5qlAf2gaJpZM4JvjEf .
Jonas Karlsson Senior Research Engineer
Karlstad University SE-651 88 Karlstad, Sweden Telephone: +46 54 700 15 64 Mobile: +46 70 672 06 20 Skype: karlsson.karl.jonas Hangout: karlsson.karl.jonas
twitter.com/kau facebook.com/karlstadsuniversitet KAU.SE
I am getting more and more found of your @mikepeon-imdea idea of extracting the data out of the (cassandra) database. I think the biggest gain with doing that instead of having support on the nodes or on the importer is that it is the data in the database that we really care about. It does not matter how well the nodes work or how slick the importer is unless we get the data into the database.
The node tests might still be interesting from a debugging viewpoint but if we can get a alert system based on the data we import that would be great.
I'll work on it a bit today and tomorrow...
On 30-Aug-16 16:30, Jonas Karlsson wrote:
I am getting more and more found of your @mikepeon-imdea https://github.com/mikepeon-imdea idea of extracting the data out of the (cassandra) database. I think the biggest gain with doing that instead of having support on the nodes or on the importer is that it is the data in the database that we really care about. It does not matter how well the nodes work or how slick the importer is unless we get the data into the database.
The node tests might still be interesting from a debugging viewpoint but if we can get a alert system based on the data we import that would be great.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MONROE-PROJECT/Maintenance/issues/4#issuecomment-243458301, or mute the thread https://github.com/notifications/unsubscribe-auth/AQhKcFB5tpVGIW9cjOBAU2dmWKmoutLrks5qlD6TgaJpZM4JvjEf.
Miguel Peón-Quirós IMDEA Networks Institute miguel.peon@imdea.org +34914816930 +34615607843
"As described on the mailing list" As a first step to get a more high level overview of the system status I developed a script for manually checking node status (from the experiment/database view) : https://github.com/MONROE-PROJECT/Database/blob/master/node_checkup/check_nodes.py.
The script parses the database and extract the data inserted into the db by node/operator and validates the timestamps against what it should be (gps information, modem metadata, HTTP download and RTT results for 3 operators etc) for the specified timespan.
I think a mail or similar should go out if eg:
1 . We receive or import no "jsons" in an hour
Lets discuss options.
For point 1, I can easily implement this in the importer and I am sure the other checks are juts as easily implemented (although on the nodes we might have trouble sending emails if we have no internet connectivity, or maybe can we get some of this info from the inventory ?)
The other way is to implement some more generic option that parse the database and search for the relevant info, if it exist in the db(s).