Better datatypes for S3 access logs?

mdschmitt commented 3 years ago

Just wondering if bytes_sent and object_size could be switched from type string to int or bigint for the optimized table. Is there a reason these are set the way they are?

https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/s3_access.py#L112

dacort commented 3 years ago

That would be ideal, but per the documentation, certain fields (like bytes_sent) can be - if zero. So there needs to be some additional logic in the conversion script to handle that. :\

I think object_size is safe from that, but I have yet to do the investigation to confirm it.

mdschmitt commented 2 years ago

Any field can be set to - to indicate that the data was unknown or unavailable, or that the field was not applicable to this request. - https://docs.aws.amazon.com/AmazonS3/latest/userguide/LogFormat.html Apparently not.

Wouldn't a translation from - to 0 for a number field be reasonable though? Seems like this would be a relatively simple transform..

dacort commented 2 years ago

Not really because in each field the - character can have a different meaning. So translating it to 0 can be a little misleading - no data is different than 0 bytes, for example.

It would probably be better to create new fields based off a case statement when we see hyphens. :/

mdschmitt commented 2 years ago

Makes sense, and you're right. sigh. Especially annoyed by this datatype stuff b/c even in the official examples like this they just CAST a turnaroundtime to an INT in the sql query anyway_. 😡

I tried to create a grok-based glue crawler (based loosely on this post) and was moderately successful, but specifically with the s3 access logs since they're delivered as static flat files, I can't actually partition it without a little ETL magic to move the files themselves around. fwiw, my grok patt for the custom classifier originally looked like this

%{WORD:bucket_owner} %{NOTSPACE:bucket} \[%{TIMESPLIT:time:timestamp}\] %{IP:remote_ip} (?:-|%{NOTSPACE:requester}) %{NOTSPACE:request_id} %{NOTSPACE:operation} %{NOTSPACE:key} "%{WORD:request_method} %{URIPATH:request_uri}(?: HTTP/%{NUMBER:http_version})" %{NUMBER:http_status} (?:-|%{NOTSPACE:error_code}) (?:-|%{NUMBER:bytes_sent:double}) (?:-|%{NUMBER:object_size:double}) (?:-|%{NUMBER:total_time:int}) (?:-|%{NUMBER:turnaround_time:int}) "(?:-|%{NOTSPACE:referrer})" %{QUOTEDSTRING:user_agent} (?:-|%{NOTSPACE:version_id}) %{NOTSPACE:host_id} (?:-|%{NOTSPACE:signature_version}) %{NOTSPACE:cipher_suite} (?:-|%{NOTSPACE:authentication_type}) %{NOTSPACE:host_header} TLSv%{NUMBER:tls_version}

but it seems like Glue doesn't play nice with conditionals (or certain datatypes maybe?), so it ended up being this one that worked to create the schema correctly:

%{WORD:bucket_owner} %{NOTSPACE:bucket} \[%{TIMESPLIT:time}\] %{IP:remote_ip} %{NOTSPACE:requester} %{NOTSPACE:request_id} %{NOTSPACE:operation} %{NOTSPACE:key} "%{WORD:request_method} %{DATA:request_uri} HTTP/%{NUMBER:http_version}" %{NUMBER:http_status} %{NOTSPACE:error_code} %{DATA:bytes_sent} %{DATA:object_size} %{DATA:total_time} %{NOTSPACE:turnaround_time} "%{NOTSPACE:referrer}" %{QUOTEDSTRING:user_agent} %{NOTSPACE:version_id} %{NOTSPACE:host_id} %{NOTSPACE:signature_version} %{NOTSPACE:cipher_suite} %{NOTSPACE:authentication_type} %{NOTSPACE:host_header} TLSv%{NUMBER:tls_version}

custom Grok patt def (same as HTTPDATE, just keeps the values): TIMESPLIT %{MONTHDAY:day:short}/%{MONTH:month:string}/%{YEAR:year:int}:%{TIME} %{INT}

Does everything it needs to. ....except make partitions.

dacort commented 2 years ago

Yea, S3 access logs (given their age) are particularly challenging.

re: conditionals, you can see something I did back in 2019 when there was an extra field briefly in the middle by searching the grok pattern for 2019_OPTION.

awslabs / athena-glue-service-logs

Better datatypes for S3 access logs? #28