Giveth / impact-graph

MIT License
47 stars 16 forks source link

[Tech Debt] Improve log message to indicate the priority #1375

Open RamRamez opened 4 months ago

RamRamez commented 4 months ago

logging and monitoring - Handle failure of services in impact graph, jrc node, etherescan-like services (block explorers) for fetching chain information

RamRamez commented 4 months ago

@aminlatifi @mohammadranjbarz do you have any comments on this issue? @jainkrati Assigned me this issue and I'm interested in your comments

aminlatifi commented 4 months ago

It's so general, can be seen as an epic.

RamRamez commented 4 months ago

@aminlatifi Can you please create issues and connect it to this epic? or who is the best person for this? because I don't know the scope of this issue.

jainkrati commented 3 months ago

lets meet tomorrow to sync on approach

geleeroyale commented 3 months ago

We have full suite of monitoring, logging and alerting set up. It is constantly being expanded. It you need anything in this category please inquire with DEVops

jainkrati commented 3 months ago

invited devops for the call but no one showed up.

Screenshot 2024-03-12 at 8 13 32 PM

Anyway, I will create a chat to coordinate on this async.

geleeroyale commented 3 months ago

invited devops for the call but no one showed up. Screenshot 2024-03-12 at 8 13 32 PM

Anyway, I will create a chat to coordinate on this async.

yes - thank you! that is why I saw it, but I usually don't pay any attention to these invites as the giveth calendar changes all the time. I am all for this kind of session however, so let's try again.

aminlatifi commented 3 months ago

As we discussed in the call, the initial solution to approach this task is creating levels of priority (or call it emergency) for logs and addressing each category appropriate to their severity. The levels would be like this:

  1. Emergency or Real Time Issues: Issues like "Givpower bot ran out of gas", or "impact-graph has issue fetching givpower balance from balance aggregator". This kind of issue should throw logs that should get attention immediately, like sending emails, SMS, etc.
  2. Less important issues that don't need real-time attention, but must be checked every 24-48 hours. Issues like "Etherscan API or Infura outage which stops donation validation", "Subgraph out of sync". The responsibility of checking these logs can be passed around the team.
  3. Regular logs, like what we are producing and it's nice to save them for future investigation.

The 3 level logs is not mandatory and we can define more or less groups, just we must agree on the tools and disciplines. The classification of logs in each service will be a separate task per service and will always open for refinement.

jainkrati commented 1 month ago

@RamRamez pls guide @ae2079 for this

RamRamez commented 1 month ago

@ae2079 please ask Kay or MoeShehab to grant you access to our logs dashboard (if you don't already have), then we can have a short call to discuss about it

geleeroyale commented 1 month ago

let me follow up with @ae2079 in discord

ae2079 commented 1 month ago

@aminlatifi I started changing log levels in the impact-graph project and changed some of the error logs to fatal and warn, but I don't know exactly which error should change, can you provide me a list of errors for each type or a guide on how to select the log level for each case? And for the emergency logs, you said for example sending email, can you provide me more details about that?

aminlatifi commented 1 month ago

@aminlatifi I started changing log levels in the impact-graph project and changed some of the error logs to fatal and warn, but I don't know exactly which error should change, can you provide me a list of errors for each type or a guide on how to select the log level for each case? And for the emergency logs, you said for example sending email, can you provide me more details about that?

Impact-graph is ever changing project, I assume @mohammadranjbarz is the tech lead of this project. He can pin point critical parts better than me.

mohammadranjbarz commented 3 weeks ago

@jainkrati I checked our logs it seems we have almost good logs in all the catch statements and also inside of lots of functions we have informative logs, I don't know what do we want in this issue, do we want to add more informative logs or want to add new tool for logging/informing us about emergency things, if we want to integrate new tool first we should talk with @mhmdksh and @geleeroyale to find best solution then implement it, WDYT?

FYI @ae2079

aminlatifi commented 3 weeks ago

alert

@mohammadranjbarz While you haven't mentioned me, I would get attention again to this comment to don't leave this technical dept incomplete https://github.com/Giveth/impact-graph/issues/1375#issuecomment-1991865403 Suppose the requests to the "balance aggregator" fail, we won't be informed/reached until the issue is reported by users which is bad for us! it's the same for other things like verification issues etc.

jainkrati commented 3 weeks ago

So here is what needs to be done @ae2079 @mohammadranjbarz

  1. Review current logs. Prefix "[urgent]" / "[important]" / "[notice]" in the log message to denote the important of the error and how quickly someone in Devops need to report after seeing it.
  2. If there are unneccessary logs, we should remove those.
mohammadranjbarz commented 3 weeks ago

Thanks @aminlatifi for your given information @jainkrati So we need this logging things for all backend projects, not only the impact-graph

We should schedule a call with @jainkrati @mhmdksh @geleeroyale to know whats the best way notify devops guys whenever any fatal error happens ( I assume just logging in the file would not help us much on this matter)

mohammadranjbarz commented 3 weeks ago

@ae2079 You can clone these repositories and make sure we have logs in all the catch statements, till we decide what to do for fatal errors

RamRamez commented 3 weeks ago

@mohammadranjbarz For fatal errors I suggest using Ortto API and its dashboard that we already have. so it's like the logger system will log and Ortto will send emails to specified people in development team.

mhmdksh commented 3 weeks ago

This issue seems to be very general. But from what I understood, we need a way to improve alerting on the logs based on the URGENCY or IMPORTANCE or CRITICALITY of the errors we sometimes receive.

mhmdksh commented 3 weeks ago

@mohammadranjbarz For fatal errors I suggest using Ortto API and its dashboard that we already have. so it's like the logger system will log and Ortto will send emails to specified people in development team.

Thanks for the suggestion @RamRamez but sending logs to another third party is something we wanted to get away from for some time now. Partly for decreasing the dependency of our app to outside entities and partly for cost optimisation.

Like @geleeroyale said earlier, we can expand upon our current logging system so that the alerts are more advanced. However I'm open for suggestions once we meet over this

WDYT @jainkrati @geleeroyale

RamRamez commented 3 weeks ago

@mhmdksh I mean using Ortto just for sending emails for fatal errors. Something like dapp-mailer.

ae2079 commented 3 weeks ago

@mohammadranjbarz For fatal errors I suggest using Ortto API and its dashboard that we already have. so it's like the logger system will log and Ortto will send emails to specified people in development team.

Thanks for the suggestion @RamRamez but sending logs to another third party is something we wanted to get away from for some time now. Partly for decreasing the dependency of our app to outside entities and partly for cost optimisation.

Like @geleeroyale said earlier, we can expand upon our current logging system so that the alerts are more advanced. However I'm open for suggestions once we meet over this

WDYT @jainkrati @geleeroyale

I think we can use alert system of grafana for this purpose and send emails via that, or send alerts to a discord channel via grafana, this may resolve the problem you mentioned

geleeroyale commented 3 weeks ago

@mohammadranjbarz For fatal errors I suggest using Ortto API and its dashboard that we already have. so it's like the logger system will log and Ortto will send emails to specified people in development team.

Thanks for the suggestion @RamRamez but sending logs to another third party is something we wanted to get away from for some time now. Partly for decreasing the dependency of our app to outside entities and partly for cost optimisation. Like @geleeroyale said earlier, we can expand upon our current logging system so that the alerts are more advanced. However I'm open for suggestions once we meet over this WDYT @jainkrati @geleeroyale

I think we can use alert system of grafana for this purpose and send emails via that, or send alerts to a discord channel via grafana, this may resolve the problem you mentioned

yes! this is how it currently works.

the channel is called #prometheus-alerts ... not everyone might have access. If something is posted in these channel devOPS people will actually get a DM by a bot so we make sure not to miss times of outages for example. These do trigger! We then usually check logs and inform devs in #all-dev

jainkrati commented 2 weeks ago

So it’s settled then, we continue with grafana triggering urgent issue logs in discord. Now we only need to review log message for each scenario

geleeroyale commented 2 weeks ago

So it’s settled then, we continue with grafana triggering urgent issue logs in discord. Now we only need to review log message for each scenario

yes - to be honest identifying in which cases we need to throw alerts is the hardest part. Setting up the actual alerts is mostly trivial.

mohammadranjbarz commented 2 weeks ago

@ae2079
I think these places are needed to throw alert ( I don't know how you should throw alert)

If I missed something please add items yourself

geleeroyale commented 2 weeks ago

@ae2079 I think these places are needed to throw alert ( I don't know how you should throw alert)

  • When donation verification fails you can search syncDonationStatusWithBlockchainNetwork() error to find the place, you can add a condition that id donation has been created more than 20 min ago and we failed to verify it we should throw alert
  • When create donation fails you can search Error on creating donation for draftDonation with ID
  • When Activating cronJobs fails you can search continueDbSetup) error
  • bootstrap() error you can search bootstrap() error

If I missed something please add items yourself

If the code will throw an error that we see in the logs I can try setting an alert on when the item appears in the logs. We need to modify our instance slightly, I can work on that.

So maybe @ae2079 check if the items described by @mohammadranjbarz are throwing errors when failing.

mohammadranjbarz commented 2 weeks ago

@ae2079 I think these places are needed to throw alert ( I don't know how you should throw alert)

  • When donation verification fails you can search syncDonationStatusWithBlockchainNetwork() error to find the place, you can add a condition that id donation has been created more than 20 min ago and we failed to verify it we should throw alert
  • When create donation fails you can search Error on creating donation for draftDonation with ID
  • When Activating cronJobs fails you can search continueDbSetup) error
  • bootstrap() error you can search bootstrap() error

If I missed something please add items yourself

If the code will throw an error that we see in the logs I can try setting an alert on when the item appears in the logs. We need to modify our instance slightly, I can work on that.

So maybe @ae2079 check if the items described by @mohammadranjbarz are throwing errors when failing. Great, so we don't need to make any changes in the codes, am I right? (except item 1)

geleeroyale commented 2 weeks ago
image

We can now evaluate rules on incoming logs to create alerts (by instance ... i.e. staging, production ...)

Now we need to collect the queries in Grafana that we want to raise alerts on.

ae2079 commented 1 week ago

@ae2079 I think these places are needed to throw alert ( I don't know how you should throw alert)

* **When donation verification fails** you can search `syncDonationStatusWithBlockchainNetwork() error` to find the place, you can add a condition that id donation has been created more than 20 min ago and we failed to verify it we should throw alert

* **When create donation fails** you can search `Error on creating donation for draftDonation with ID`

* **When Activating cronJobs fails** you can search `continueDbSetup) error`

* **bootstrap() error** you can search `bootstrap() error`

If I missed something please add items yourself

We have changed the level of these errors to fatal (now it is merged with staging) so that we can set an alert in Grafana based on the log level. @geleeroyale with this approach, we don't need to change Grafana queries to add new alerts; simply by adding fatal logs, we can add new alerts whenever needed.

geleeroyale commented 4 days ago

oh sorry. I missed this. Amazing stuff! So we should query for the exact strings above?

Edit: I am not sure I can add conditions like that one for donation verification - we need to think in terms of the limits of Grafana Loki queries.

geleeroyale commented 4 days ago
image

I added some new alerts on errors as following from @ae2079s suggestions above to staging as well as production. If you have any ideas for fleshing out descriptions or make the alerts as informative as possible, don't hesitate to comment.

ae2079 commented 4 days ago

oh sorry. I missed this. Amazing stuff! So we should query for the exact strings above?

Edit: I am not sure I can add conditions like that one for donation verification - we need to think in terms of the limits of Grafana Loki queries.

I implemented the condition for that one and now the error message for this one is: "donation verification failed", the others are correct

ae2079 commented 4 days ago
image

I added some new alerts on errors as following from @ae2079s suggestions above to staging as well as production. If you have any ideas for fleshing out descriptions or make the alerts as informative as possible, don't hesitate to comment.

Can we define a general alert based on the log level? I suggested adding a query for alerting for all of the fatal logs.

ae2079 commented 4 days ago
image

@geleeroyale for example, as you can see in this image, the log level of syncDonationStatusWithBlockchainNetwork() error is 50 which means it is an error log, and I suggest alerting all logs with a 60 log level (that is a fatal log)