Context
This is about this epic: ARCH-260 Next Generation Logging and Tracing .
The overall aim to improve logging and debugging.
This is how we came up with the steps
- Standardise logs
- Introduce a tool to visualize logs
- Distributed tracing
Will try to go through each steps a bit
Standardise logs
The first step we determined will be to standard logs
Solution
1. Updating logging object
This could be what I think would be helpful object
Response Error Object
```ts
{
application: string,
statusCode: number,
message: string,
level: 'info' | 'error' | 'warn' ,
environment: 'prod',
request: {
url: string,
headers: {}
},
response: {
statusCode: number,
message: string,
headers: {}
},
responseTime: string
}
```
Info log
```ts
{
application: string,
message: string,
level: 'info',
environment: 'prod'
}
```
Error log
```ts
{
application: string,
message: string,
level: 'error',
environment: 'prod'
}
```
Heres an example of what the change could look like, (before vs after)
We checked out replacing `bunyan` with `pinojs` for this.You can check out this https://github.com/comtravo/ct-backend/pull/11487
Using pinojs since its faster and has an ecosymtem of logging around it.So we woyuld be able to keep the logging format similar
2. Adding environment to logs
Currently we index logs by environment, this means we have different sources for different environments.This also means it makes it difficult to change sources.If we add environment to logs and index together, we could easily switch environments in logs from the logs itself
![image](https://user-images.githubusercontent.com/75316673/127350174-3a631dd1-2a8b-4eb4-8145-3ff96adf70b4.png)
3. Source for logs
For alignment and ease, we could still keep using elasticserahc as our source for different tool
4. Getting rid of debug lib
Currently we use a package called `debug`, which allows us to add a env variable `DEBUG: *` and once this is set these log are used for debugging lambdas and services.
We could instead make use [pino-debug](https://github.com/pinojs/pino-debug). This would allow us make use of the same library and make use of the same library like `logger.debug()`
5. Getting rid of ct_inspector
Currently we use `ct_inspector` to log and also track the amount of time for third party.
This is also used to in this [flight search dashboard](https://grafana.prod.comtravo.com/d/4xPM7fGr8/flight-search-api-details?orgId=1&refresh=5m)
![image](https://user-images.githubusercontent.com/75316673/127357499-f42e2e51-f196-4834-8627-223428d40635.png)
A idea from Puneeth was to use timeseries db for this.
6. Redacting customer info before using a 3rd party tool
Currently when swagger validation fails it logs the entire object that failed. This also includes stuff like `booking.guest_travelers` and `booking.booker` which has all info like email and phone number.
So we should try to redact these before actually using a 3rd party tool for visualization.
It becomes really easy with pinojs
```ts
redact: {
paths: [
'req.headers.authorization',
'request_data.body.booker',
'request_data.body.guest_travelers'
],
remove: true
},
```
Introduce a tool to visualise logs
Tools in consideration now
- Sentry
- Jaeger Tracing
Things to consider when choosing this
- Should we keep using existing ELK stack or use S3
- We also want a different way for alerting, sending all messages to slack is not really scalable
- Distributed tracing, although the last step would still be important to consider.This is where we would be able to see the flow of a request through the system, with some kind of correlation id
- Some dashboard like this may give you a final outcome https://grafana.infra.comtravo.com/d/BcyIAPz7k/test-dashboard-playground?orgId=1&refresh=10s