Open RamRamez opened 4 months ago
@aminlatifi @mohammadranjbarz do you have any comments on this issue? @jainkrati Assigned me this issue and I'm interested in your comments
It's so general, can be seen as an epic.
@aminlatifi Can you please create issues and connect it to this epic? or who is the best person for this? because I don't know the scope of this issue.
lets meet tomorrow to sync on approach
We have full suite of monitoring, logging and alerting set up. It is constantly being expanded. It you need anything in this category please inquire with DEVops
invited devops for the call but no one showed up.
Anyway, I will create a chat to coordinate on this async.
invited devops for the call but no one showed up.
Anyway, I will create a chat to coordinate on this async.
yes - thank you! that is why I saw it, but I usually don't pay any attention to these invites as the giveth calendar changes all the time. I am all for this kind of session however, so let's try again.
As we discussed in the call, the initial solution to approach this task is creating levels of priority (or call it emergency) for logs and addressing each category appropriate to their severity. The levels would be like this:
The 3 level logs is not mandatory and we can define more or less groups, just we must agree on the tools and disciplines. The classification of logs in each service will be a separate task per service and will always open for refinement.
@RamRamez pls guide @ae2079 for this
@ae2079 please ask Kay or MoeShehab to grant you access to our logs dashboard (if you don't already have), then we can have a short call to discuss about it
let me follow up with @ae2079 in discord
@aminlatifi I started changing log levels in the impact-graph project and changed some of the error logs to fatal and warn, but I don't know exactly which error should change, can you provide me a list of errors for each type or a guide on how to select the log level for each case? And for the emergency logs, you said for example sending email, can you provide me more details about that?
@aminlatifi I started changing log levels in the impact-graph project and changed some of the error logs to fatal and warn, but I don't know exactly which error should change, can you provide me a list of errors for each type or a guide on how to select the log level for each case? And for the emergency logs, you said for example sending email, can you provide me more details about that?
Impact-graph is ever changing project, I assume @mohammadranjbarz is the tech lead of this project. He can pin point critical parts better than me.
@jainkrati I checked our logs it seems we have almost good logs in all the catch
statements and also inside of lots of functions we have informative logs, I don't know what do we want in this issue, do we want to add more informative logs or want to add new tool for logging/informing us about emergency things, if we want to integrate new tool first we should talk with @mhmdksh and @geleeroyale to find best solution then implement it, WDYT?
FYI @ae2079
alert
@mohammadranjbarz While you haven't mentioned me, I would get attention again to this comment to don't leave this technical dept incomplete https://github.com/Giveth/impact-graph/issues/1375#issuecomment-1991865403 Suppose the requests to the "balance aggregator" fail, we won't be informed/reached until the issue is reported by users which is bad for us! it's the same for other things like verification issues etc.
So here is what needs to be done @ae2079 @mohammadranjbarz
Thanks @aminlatifi for your given information
@jainkrati So we need this logging things for all backend projects, not only the impact-graph
We should schedule a call with @jainkrati @mhmdksh @geleeroyale to know whats the best way notify devops guys whenever any fatal error happens ( I assume just logging in the file would not help us much on this matter)
@ae2079 You can clone these repositories and make sure we have logs in all the catch
statements, till we decide what to do for fatal errors
@mohammadranjbarz For fatal errors I suggest using Ortto API and its dashboard that we already have. so it's like the logger system will log and Ortto will send emails to specified people in development team.
This issue seems to be very general. But from what I understood, we need a way to improve alerting on the logs based on the URGENCY or IMPORTANCE or CRITICALITY of the errors we sometimes receive.
@mohammadranjbarz For fatal errors I suggest using Ortto API and its dashboard that we already have. so it's like the logger system will log and Ortto will send emails to specified people in development team.
Thanks for the suggestion @RamRamez but sending logs to another third party is something we wanted to get away from for some time now. Partly for decreasing the dependency of our app to outside entities and partly for cost optimisation.
Like @geleeroyale said earlier, we can expand upon our current logging system so that the alerts are more advanced. However I'm open for suggestions once we meet over this
WDYT @jainkrati @geleeroyale
@mhmdksh I mean using Ortto just for sending emails for fatal errors. Something like dapp-mailer.
@mohammadranjbarz For fatal errors I suggest using Ortto API and its dashboard that we already have. so it's like the logger system will log and Ortto will send emails to specified people in development team.
Thanks for the suggestion @RamRamez but sending logs to another third party is something we wanted to get away from for some time now. Partly for decreasing the dependency of our app to outside entities and partly for cost optimisation.
Like @geleeroyale said earlier, we can expand upon our current logging system so that the alerts are more advanced. However I'm open for suggestions once we meet over this
WDYT @jainkrati @geleeroyale
I think we can use alert system of grafana for this purpose and send emails via that, or send alerts to a discord channel via grafana, this may resolve the problem you mentioned
@mohammadranjbarz For fatal errors I suggest using Ortto API and its dashboard that we already have. so it's like the logger system will log and Ortto will send emails to specified people in development team.
Thanks for the suggestion @RamRamez but sending logs to another third party is something we wanted to get away from for some time now. Partly for decreasing the dependency of our app to outside entities and partly for cost optimisation. Like @geleeroyale said earlier, we can expand upon our current logging system so that the alerts are more advanced. However I'm open for suggestions once we meet over this WDYT @jainkrati @geleeroyale
I think we can use alert system of grafana for this purpose and send emails via that, or send alerts to a discord channel via grafana, this may resolve the problem you mentioned
yes! this is how it currently works.
the channel is called #prometheus-alerts ... not everyone might have access. If something is posted in these channel devOPS people will actually get a DM by a bot so we make sure not to miss times of outages for example. These do trigger! We then usually check logs and inform devs in #all-dev
So it’s settled then, we continue with grafana triggering urgent issue logs in discord. Now we only need to review log message for each scenario
So it’s settled then, we continue with grafana triggering urgent issue logs in discord. Now we only need to review log message for each scenario
yes - to be honest identifying in which cases we need to throw alerts is the hardest part. Setting up the actual alerts is mostly trivial.
@ae2079
I think these places are needed to throw alert ( I don't know how you should throw alert)
syncDonationStatusWithBlockchainNetwork() error
to find the place, you can add a condition that id donation has been created more than 20 min ago and we failed to verify it we should throw alertError on creating donation for draftDonation with ID
continueDbSetup) error
bootstrap() error
If I missed something please add items yourself
@ae2079 I think these places are needed to throw alert ( I don't know how you should throw alert)
- When donation verification fails you can search
syncDonationStatusWithBlockchainNetwork() error
to find the place, you can add a condition that id donation has been created more than 20 min ago and we failed to verify it we should throw alert- When create donation fails you can search
Error on creating donation for draftDonation with ID
- When Activating cronJobs fails you can search
continueDbSetup) error
- bootstrap() error you can search
bootstrap() error
If I missed something please add items yourself
If the code will throw an error that we see in the logs I can try setting an alert on when the item appears in the logs. We need to modify our instance slightly, I can work on that.
So maybe @ae2079 check if the items described by @mohammadranjbarz are throwing errors when failing.
@ae2079 I think these places are needed to throw alert ( I don't know how you should throw alert)
- When donation verification fails you can search
syncDonationStatusWithBlockchainNetwork() error
to find the place, you can add a condition that id donation has been created more than 20 min ago and we failed to verify it we should throw alert- When create donation fails you can search
Error on creating donation for draftDonation with ID
- When Activating cronJobs fails you can search
continueDbSetup) error
- bootstrap() error you can search
bootstrap() error
If I missed something please add items yourself
If the code will throw an error that we see in the logs I can try setting an alert on when the item appears in the logs. We need to modify our instance slightly, I can work on that.
So maybe @ae2079 check if the items described by @mohammadranjbarz are throwing errors when failing. Great, so we don't need to make any changes in the codes, am I right? (except item 1)
We can now evaluate rules on incoming logs to create alerts (by instance ... i.e. staging, production ...)
Now we need to collect the queries in Grafana that we want to raise alerts on.
@ae2079 I think these places are needed to throw alert ( I don't know how you should throw alert)
* **When donation verification fails** you can search `syncDonationStatusWithBlockchainNetwork() error` to find the place, you can add a condition that id donation has been created more than 20 min ago and we failed to verify it we should throw alert * **When create donation fails** you can search `Error on creating donation for draftDonation with ID` * **When Activating cronJobs fails** you can search `continueDbSetup) error` * **bootstrap() error** you can search `bootstrap() error`
If I missed something please add items yourself
We have changed the level of these errors to fatal (now it is merged with staging) so that we can set an alert in Grafana based on the log level. @geleeroyale with this approach, we don't need to change Grafana queries to add new alerts; simply by adding fatal logs, we can add new alerts whenever needed.
oh sorry. I missed this. Amazing stuff! So we should query for the exact strings above?
Edit: I am not sure I can add conditions like that one for donation verification - we need to think in terms of the limits of Grafana Loki queries.
I added some new alerts on errors as following from @ae2079s suggestions above to staging
as well as production
. If you have any ideas for fleshing out descriptions or make the alerts as informative as possible, don't hesitate to comment.
oh sorry. I missed this. Amazing stuff! So we should query for the exact strings above?
Edit: I am not sure I can add conditions like that one for donation verification - we need to think in terms of the limits of Grafana Loki queries.
I implemented the condition for that one and now the error message for this one is: "donation verification failed", the others are correct
![]()
I added some new alerts on errors as following from @ae2079s suggestions above to
staging
as well asproduction
. If you have any ideas for fleshing out descriptions or make the alerts as informative as possible, don't hesitate to comment.
Can we define a general alert based on the log level? I suggested adding a query for alerting for all of the fatal logs.
@geleeroyale for example, as you can see in this image, the log level of syncDonationStatusWithBlockchainNetwork() error is 50 which means it is an error log, and I suggest alerting all logs with a 60 log level (that is a fatal log)
logging and monitoring - Handle failure of services in impact graph, jrc node, etherescan-like services (block explorers) for fetching chain information