Open mdschmitt opened 3 years ago
That would be ideal, but per the documentation, certain fields (like bytes_sent
) can be -
if zero. So there needs to be some additional logic in the conversion script to handle that. :\
I think object_size
is safe from that, but I have yet to do the investigation to confirm it.
Any field can be set to - to indicate that the data was unknown or unavailable, or that the field was not applicable to this request.
- https://docs.aws.amazon.com/AmazonS3/latest/userguide/LogFormat.html
Apparently not.
Wouldn't a translation from -
to 0
for a number field be reasonable though? Seems like this would be a relatively simple transform..
Not really because in each field the -
character can have a different meaning. So translating it to 0 can be a little misleading - no data is different than 0 bytes, for example.
It would probably be better to create new fields based off a case statement when we see hyphens. :/
Makes sense, and you're right. sigh. Especially annoyed by this datatype stuff b/c even in the official examples like this they just CAST a turnaroundtime to an INT in the sql query anyway_. 😡
I tried to create a grok-based glue crawler (based loosely on this post) and was moderately successful, but specifically with the s3 access logs since they're delivered as static flat files, I can't actually partition it without a little ETL magic to move the files themselves around. fwiw, my grok patt for the custom classifier originally looked like this
%{WORD:bucket_owner} %{NOTSPACE:bucket} \[%{TIMESPLIT:time:timestamp}\] %{IP:remote_ip} (?:-|%{NOTSPACE:requester}) %{NOTSPACE:request_id} %{NOTSPACE:operation} %{NOTSPACE:key} "%{WORD:request_method} %{URIPATH:request_uri}(?: HTTP/%{NUMBER:http_version})" %{NUMBER:http_status} (?:-|%{NOTSPACE:error_code}) (?:-|%{NUMBER:bytes_sent:double}) (?:-|%{NUMBER:object_size:double}) (?:-|%{NUMBER:total_time:int}) (?:-|%{NUMBER:turnaround_time:int}) "(?:-|%{NOTSPACE:referrer})" %{QUOTEDSTRING:user_agent} (?:-|%{NOTSPACE:version_id}) %{NOTSPACE:host_id} (?:-|%{NOTSPACE:signature_version}) %{NOTSPACE:cipher_suite} (?:-|%{NOTSPACE:authentication_type}) %{NOTSPACE:host_header} TLSv%{NUMBER:tls_version}
but it seems like Glue doesn't play nice with conditionals (or certain datatypes maybe?), so it ended up being this one that worked to create the schema correctly:
%{WORD:bucket_owner} %{NOTSPACE:bucket} \[%{TIMESPLIT:time}\] %{IP:remote_ip} %{NOTSPACE:requester} %{NOTSPACE:request_id} %{NOTSPACE:operation} %{NOTSPACE:key} "%{WORD:request_method} %{DATA:request_uri} HTTP/%{NUMBER:http_version}" %{NUMBER:http_status} %{NOTSPACE:error_code} %{DATA:bytes_sent} %{DATA:object_size} %{DATA:total_time} %{NOTSPACE:turnaround_time} "%{NOTSPACE:referrer}" %{QUOTEDSTRING:user_agent} %{NOTSPACE:version_id} %{NOTSPACE:host_id} %{NOTSPACE:signature_version} %{NOTSPACE:cipher_suite} %{NOTSPACE:authentication_type} %{NOTSPACE:host_header} TLSv%{NUMBER:tls_version}
custom Grok patt def (same as HTTPDATE, just keeps the values):
TIMESPLIT %{MONTHDAY:day:short}/%{MONTH:month:string}/%{YEAR:year:int}:%{TIME} %{INT}
Does everything it needs to. ....except make partitions.
Yea, S3 access logs (given their age) are particularly challenging.
re: conditionals, you can see something I did back in 2019 when there was an extra field briefly in the middle by searching the grok pattern for 2019_OPTION
.
Just wondering if bytes_sent and object_size could be switched from type string to int or bigint for the optimized table. Is there a reason these are set the way they are?
https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/s3_access.py#L112