awslabs / athena-glue-service-logs

Glue scripts for converting AWS Service Logs for use in Athena
Apache License 2.0
142 stars 45 forks source link

Duplicate S3 access log request IDs and glue deduping #32

Open mikeplem opened 1 year ago

mikeplem commented 1 year ago

We are testing version 6.0.0 of the tool using Glue 3.0 and have noticed that some access log data is being deduped when files are converted into hive/parquet format.

An example from our access logs are

9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY BATCH.DELETE.OBJECT f1683055189142x766494105173435800/IMG_1138.jpeg - 204 - - - - - - - - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -
9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY REST.POST.MULTI_OBJECT_DELETE - "POST /BUCKET_NAME/?delete HTTP/1.1" 200 - 305 - 29 - "-" "-" - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -

The Athena query output is as follows

#   bucket_owner    bucket  time    remote_ip   requester   request_id  operation   key request_uri http_status error_code  bytes_sent  object_size total_time  turnaround_time referrer    user_agent  version_id  host_id signature_version   cipher_suite    authentication_type host_header tls_version year    month   day
1   9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8    BUCKET_NAME 2023-05-02 19:19:55.000 52.12.241.113   IAM_ARN_HERE    1C25EZNCB2HBMQQY    REST.POST.MULTI_OBJECT_DELETE       POST /BUCKET_NAME/?delete HTTP/1.1  200     305     29                  Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg==    SigV2   ECDHE-RSA-AES128-GCM-SHA256         TLSv1.2 - - 2023    05  02

This is concerning because the Athena query output does not show which file was deleted. It appears the second s3 access log entry is overwriting the first when the file is converted.

Thank you for taking the time to look into this.