There are several places where INSIDE_QS is used instead of QS particularly in ALB regex. As noted in #3 this can cause issues in certain cases.
However, we have to be careful in replacing this - there are some fields where if they are empty, they will be a simple hyphen (-) or a quoted hyphen ("-") depending on the log entry. This pops up specifically with S3 access logs, where some entries can be quoted strings, a quoted hyphen, or a non-quoted hyphen. If the grok expression doesn't match, data won't be returned so could go missing.
Further, these situations can be difficult to detect. In a simple test, I replaced INSIDE_QS with QS and did a COUNT(*) of the two tables with Athena. The same number was returned. However, when I did a COUNT(*) with a WHERE statement filtering by request_id prefix...different results were returned. I'm guessing this is due to a COUNT(*) not deserializing everything(?).
There are several places where
INSIDE_QS
is used instead ofQS
particularly in ALB regex. As noted in #3 this can cause issues in certain cases.However, we have to be careful in replacing this - there are some fields where if they are empty, they will be a simple hyphen (
-
) or a quoted hyphen ("-"
) depending on the log entry. This pops up specifically with S3 access logs, where some entries can be quoted strings, a quoted hyphen, or a non-quoted hyphen. If the grok expression doesn't match, data won't be returned so could go missing.Further, these situations can be difficult to detect. In a simple test, I replaced
INSIDE_QS
withQS
and did aCOUNT(*)
of the two tables with Athena. The same number was returned. However, when I did aCOUNT(*)
with aWHERE
statement filtering by request_id prefix...different results were returned. I'm guessing this is due to aCOUNT(*)
not deserializing everything(?).