aws-powertools / powertools-lambda-python

A developer toolkit to implement Serverless best practices and increase developer velocity.
https://docs.powertools.aws.dev/lambda/python/latest/
MIT No Attribution
2.82k stars 391 forks source link

Leveraging Lambda Powertools Logger for querying in Athena or other AWS services #460

Closed rickyaws closed 3 years ago

rickyaws commented 3 years ago

Runtime: Python

Is your feature request related to a problem? Please describe IHAC who was looking into the benefits of the logger formatting but was having a hard time getting Athena to ingest the logs from Cloudwatch and create a schema out of them and was wondering if there was any consideration on the logger format to be more friendly to data ingestion by other services outside of Cloudwatch. For example as it is now this is how its outputting to cloudwatch with the powertools logger on:

2021-05-10T15:26:57.772Z {"level":"INFO","location":"checkOrderInput:50","message":"Checking required fields ['contact', 'address', 'id', 'expedite', 'installDate', 'addressValidationOverride', 'product', 'asset']","timestamp":"2021-05-10 15:26:57,771+0000","service":"new-order","xray_trace_id":"1-609950be-18655ee7e321f53ab8b4f629"}

with this format you could extract two columns

  1. timestamp, 2. the entire message as a json struc column

or use Grok serDer in athena with regex to try to grab a pattern, in either case you would still need to run this through some type of data processing no different than if you used the built in python logger with custom configurations but slightly more challenging due to the JSON structure.

Describe the solution you'd like

Either for Powertools Logger to just output eachline as a JSON object (removing the timestamp at the beginning of the log so Athena/Glue can just use the built in JsonSerDer to parse it and create columns or some documentation or examples on how to leverage this logger formatting in queries and creating Metric Filters of them. Describe alternatives you've considered

Using a messy DDL statement in Athena using Grok SerDer CREATE EXTERNAL TABLE ugi( loglevel string COMMENT 'from deserializer', timestamp string COMMENT 'from deserializer', service string COMMENT 'from deserializer', traceid string COMMENT 'from deserializer') ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe' WITH SERDEPROPERTIES ( 'input.format'='(?\"level\":\"(.{4,10})),([^.]+)(?\"timestamp\":\"(.{10,28}))\"([^.]+)(?\"service\":\"(.{3,10}))\",([^.]+)(?xray_trace_id\":\"(.{30,40}))\"') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://ugi/' TBLPROPERTIES ( 'CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='powertoollogs', 'averageRecordSize'='191', 'classification'='powertoollogs', 'compressionType'='none', 'grokPattern'='(?\"level\":\"(.{4,10})),([^.]+)(?\"timestamp\":\"(.{10,28}))\"([^.]+)(?\"service\":\"(.{3,10}))\",([^.]+)(?xray_trace_id\":\"(.{30,40}))\"', 'objectCount'='1', 'recordCount'='1', 'sizeKey'='191', 'typeOfData'='file')

Just using the built-in python logger and use the traditional CW-Firehose-S3 architecture for streaming the logs into an S3 bucket in Athena to avoid parsing the JSON structure. If you provide guidance, is this something you'd like to contribute? I am not the best developer but sure! I could help

Additional context

Providing some examples of how others have leverage the powertools logger or any use cases where this logger has made operational tasks easier would be very valuable and easier to sell to customers. Right now outside of the nice structured uniform formatting it creates in Cloudwatch logs I do not see another benefit of then using this data efficiently

heitorlessa commented 3 years ago

Hey @rickyaws - thanks for creating this detailed feature request much appreciated.

Reading from mobile, I don't think I fully understand the main issue -- is that with the timestamp field value? Or is it that plus the known limitation of Kinesis Firehose not adding a newline for JSON Objects which breaks Glue/Athena?

If it's the former, have you experimented with a different date format using "Logger(datefmt=)"?

There's an open discussion in the Python repo about creating a tutorial to educate customers on structured and canonical logging, and what common logging analytics tools are used across the board --- E.g. ElasticSearch, CloudWatch Log Analytics (cross-account limitation I know), Honeycomb, Datadog, Loggly etc.

That's something we've been discussing with the Well-Architected team since that spans multiple technologies as it's an industry practice.

For now, I'm interested in understanding if the timestamp is the issue so I can work with you and your customer on fitting this in Athena.

We might move this to Python repo as this seems more like docs vs a feature to be added/enhanced.

Thanks again!

heitorlessa commented 3 years ago

Cc @pcolazurdo for a discussion later

mwarkentin commented 3 years ago

Is the timestamp starting the message actually coming from lambda-powertools or is that something that cloudwatch is adding during the export to s3?

image
mwarkentin commented 3 years ago

Also not sure if you've tried Cloudwatch Logs Insights but it's quite easy to search and operate on the structured logs from powertools in there - maybe you could get away with that without having to jump through the s3 / athena hoops (unless this is driven by cost considerations, etc):

image
heitorlessa commented 3 years ago

It’s CloudWatch - we don’t do that. That seems to be the direct integration with Athena.

In other projects I exported CloudWatch Logs -> Lambda for pre-processing -> Kinesis Firehose -> S3 -> Glue -> Athena. Pre-processing Lambda would do two things: 1/ only use the payload nothing else, 2/ append a based64 newline to the payload because Kinesis Firehose doesn’t and it breaks Glue and Athena

On Thu, 20 May 2021 at 19:53, Michael Warkentin @.***> wrote:

Is the timestamp starting the message actually coming from lambda-powertools or is that something that cloudwatch is adding during the export to s3?

[image: image] https://user-images.githubusercontent.com/67560/119026037-b8a28100-b972-11eb-96b8-67cb309e8bb7.png

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/awslabs/aws-lambda-powertools-roadmap/issues/16#issuecomment-845340445, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZPQBDCHCQPGBX3NSQKF23TOVECVANCNFSM45HLKISA .

rickyaws commented 3 years ago

@heitorlessa Yeah I think your proposal makes sense, its more on documentation than feature per se.

I essentially was thinking that since Powertools already was structuring the logs as JSON that it would reduce the Pre-Processing steps that someone would have to do in order to create a centralized logging architecture with the least amount of cost (our customer is cost conscious) through the steps you mention below. So in my mind I thought it would be able to use the JSON SerDer that Athena/Glue have built in to quickly make the schema based on the JSON structure but that timestamp CW adds essentially messes it up. In short I was trying to leverage some type of documentation or use cases to show the customer how they can leverage this format to make their log aggregations easier and operational day to day analysis easier than just using the Python Built In logger and defining all the structure themselves similar to how the Metrics module significantly reduces the overhead of creating Custom Metrics and then being able to use those custom metrics for alarms and notifications.

"In other projects I exported CloudWatch Logs -> Lambda for pre-processing -> Kinesis Firehose -> S3 -> Glue -> Athena. Pre-processing Lambda would do two things: 1/ only use the payload nothing else, 2/ append a based64 newline to the payload because Kinesis Firehose doesn’t and it breaks Glue and Athena"

@mwarkentin -> I was trying to figure out how to query in cloudwatch Insights but was not having great success I will take your example and re-try the only limitation for our use case with insight is that you would have to go into each individual accounts to look at the logs as opposed to managing them all in a central Security account.

mwarkentin commented 3 years ago

@rickyaws that makes sense.. makes me think that cross-account cloudwatch insights (or sharing log groups with RAM?) could be pretty powerful too (hint hint). :)

pcolazurdo commented 3 years ago

@rickyaws querying with Cloudwatch Logs Insights works great [1] and you can automate queries via APIs, create dashboards, etc. The way Logs Insights work with the CWL string format directly makes querying this very easy. Also check using Contributor Insights as it is a very useful feature (watch out for the costs though)

Now, The idea of using Athena directly to parse these logs is going to become very complex (if at all possible) due to two facts:

  1. the "schema" might change with each logged line. This will be both from a code perspective (as new structure may be logged, for example, if the user outputs a dictionary) and also because the user doesn't have complete control of what will be logged from the environment: i.e. the log lines that are created by the worker itself which aren't handled by the powertools logger (i.e. START, REPORT and END lines, exception errors, print copmmands, etc) Athena doesn't handle variable schemas very well as the table definition is pretty static - Using ORC or Parquet may help a bit, but again, the nature of flexible logging makes it very hard for a more static tool to parse this at will.

  2. How the transport from CWL to Firehose to S3 will wrap the messages to pass along the different layers.

Trying to use Grok or regex to parse the logs, will make it very hard to maintain and process for the same reasons.

As an alternative, you could use the JSON format on the subscription filter which will properly wrap all the output from the lambda logs (including powertools logger) into a single field named "message" as an encoded string, but Athena isn't very powerful to handle this back to json AFAIK.

i.e.

[ ...
{
      "id": "36168440376909863750225378866398863959406407651950067771",
      "timestamp": 1621848958630,
      "message": "REPORT RequestId: 4fae6371-c5ce-432d-812e-67cd621e2819\tDuration: 298.08 ms\tBilled Duration: 299 ms\tMemory Size: 128 MB\tMax Memory Used: 90 MB\t\nXRAY TraceId: 1-60ab737e-47c4b2fd43ec6df90c90c3c4\tSegmentId: 59a6cccc6f62a3d8\tSampled: true\t\n"
},
{
      "id": "36168440376040134687482684563878970946773121553216831544",
      "timestamp": 1621848958591,
      "message": "{\"level\":\"INFO\",\"location\":\"lambda_handler:76\",\"message\":\"Completed\",\"timestamp\":\"2021-05-24 09:35:58,591+0000\",\"service\":\"crawler_frontapi_save\",\"cold_start\":false,\"function_name\":\"ServerlessCrawlerStack-CrawlerFrontAPISaveFunction-1C5FWAYTV559J\",\"function_memory_size\":\"128\",\"function_arn\":\"arn:aws:lambda:us-east-2:123456789012:function:ServerlessCrawlerStack-CrawlerFrontAPISaveFunction-1C5FWAYTV559J\",\"function_request_id\":\"4fae6371-c5ce-432d-812e-67cd621e2819\",\"parse_id\":\"#2021-05-24T09:35:58.491645\",\"url_site\":\"\",\"xray_trace_id\":\"1-60ab737e-47c4b2fd43ec6df90c90c3c4\"}\n"
},
...]

[1] As an example about the power of CloudWatch Logs Insights, this is a very useful query:

stats count() by level, service as cnt
| filter not isempty(service ) and not isblank(level)
| sort cnt desc
| limit 20
pcolazurdo commented 3 years ago

After doing some additional research, I don't think Athena will be a good alternative to query structured logs due to the variety of fields that can be produced on each output and there is nothing in Lambda Powertools itself that could improve on this situation. If any, additional work should be done on Kinesis Firehose to export only specific records and of these, only specific fields.

I suggest closing this request as it is my understanding this is not something that Lambda Powertools itself could or should handle.

heitorlessa commented 3 years ago

Agree, thanks a lot @pcolazurdo for taking this for a spin!

@rickyaws suggestion moving forward is to centralise these logs into ElasticSearch, and to add +1 to CloudWatch Log Insights cross-account search.

If ElasticSearch/ELK isn't an option here due to management, you can use Kinesis to centralize CloudWatch Logs into another account - helps with management but increases costs too.

Before we close it, I'll transfer it to Python repo, and update Logger FAQ to have this info more broadly available.

Thank you all!

boring-cyborg[bot] commented 3 years ago

Thanks for opening your first issue here! We'll come back to you as soon as we can.

heitorlessa commented 3 years ago

Just added to docs and will be available in the next docs release (hopefully next week). Tagging to share once it's available.