awslabs / athena-glue-service-logs

Glue scripts for converting AWS Service Logs for use in Athena
Apache License 2.0
142 stars 46 forks source link

Replace INSIDE_QS with QS #4

Open dacort opened 5 years ago

dacort commented 5 years ago

There are several places where INSIDE_QS is used instead of QS particularly in ALB regex. As noted in #3 this can cause issues in certain cases.

However, we have to be careful in replacing this - there are some fields where if they are empty, they will be a simple hyphen (-) or a quoted hyphen ("-") depending on the log entry. This pops up specifically with S3 access logs, where some entries can be quoted strings, a quoted hyphen, or a non-quoted hyphen. If the grok expression doesn't match, data won't be returned so could go missing.

Further, these situations can be difficult to detect. In a simple test, I replaced INSIDE_QS with QS and did a COUNT(*) of the two tables with Athena. The same number was returned. However, when I did a COUNT(*) with a WHERE statement filtering by request_id prefix...different results were returned. I'm guessing this is due to a COUNT(*) not deserializing everything(?).