DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

UnicodeDecodeError raised by log forwarder #5582

Open dsotirho-ucsc opened 1 year ago

dsotirho-ucsc commented 1 year ago

Occurred 7 times within 25 minutes on 2023/09/29 Link to CloudWatch logs)

[ERROR] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 420: ordinal not in range(128)
Traceback (most recent call last):
  File "/var/task/azul/chalice.py", line 166, in patched_event_source_handler
    return old_handler(self_, event, context)
  File "/var/task/chalice/app.py", line 1752, in __call__
    return self.handler(event_obj)
  File "/var/task/chalice/app.py", line 1708, in __call__
    return self.handler(request, self.next_handler)
  File "/var/task/azul/chalice.py", line 191, in _lambda_context_middleware
    return get_response(event)
  File "/var/task/app.py", line 293, in forward_s3_logs
    app.log_controller.forward_s3_access_logs(event)
  File "/var/task/azul/indexer/log_forwarding_controller.py", line 47, in forward_s3_access_logs
    self._forward_logs(event, self.s3)
  File "/var/task/azul/indexer/log_forwarding_controller.py", line 38, in _forward_logs
    for message in service.read_logs(event.bucket, event.key):
  File "/var/task/azul/indexer/log_forwarding_service.py", line 38, in read_logs
    for message in self._parse_log_lines(body):
  File "/var/task/azul/indexer/log_forwarding_service.py", line 130, in _parse_log_lines
    for message in super()._parse_log_lines(file_body):
  File "/var/task/azul/indexer/log_forwarding_service.py", line 54, in _parse_log_lines
    for row in csv.reader(file_body, delimiter=' ', quotechar='"'):
  File "/var/task/azul/indexer/log_forwarding_service.py", line 127, in _read_log
    yield line.decode('ascii')

dsotirho-ucsc commented 1 year ago

Spike to investigate and diagnose.

nadove-ucsc commented 1 year ago

It was very difficult to correlate the failures to specific log files. Perhaps we should consider finding a way to streamline that step.

The 7 log files that failed to forward are

2023-09-30-06-27-54-649044166FC4297A
2023-09-30-06-40-04-84E35196EF38F3BE
2023-09-30-06-28-15-19BBF71A613294D1
2023-09-30-06-20-12-DCA8C497A9E01293
2023-09-30-06-14-53-ADDCA5BBA34C010A
2023-09-30-06-13-59-A61E34742158854B
2023-09-30-06-26-12-8D2FC67E2A0AD99D

in folder s3://edu-ucsc-gi-platform-hca-prod-logs.us-east-1/s3/access/.data-browser/org-humancellatlas-data-portal-dcp2-prod/

Here is how the contents of the 1st such file would/should have been forwarded if the decoding error had not occurred:

{
    "bucket_owner": "37a70605c2e04b2d81c08054e4646a531acd4bff3e16d9f65b3cff2090399cdd",
    "bucket": "org-humancellatlas-data-portal-dcp2-prod",
    "remote_ip": "130.176.93.86",
    "requester": "-",
    "request_id": "21BDJ4WAGRPA7XN4",
    "operation": "REST.HEAD.OBJECT",
    "key": "%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F%253F.tar.gz",
    "request_uri": "HEAD /����.tar.gz HTTP/1.1",
    "http_status": "403",
    "error_code": "SignatureDoesNotMatch",
    "bytes_sent": "7350",
    "object_size": "-",
    "total_time": "7",
    "turn_around_time": "-",
    "referer": "-",
    "user_agent": "Amazon CloudFront",
    "version_id": "-",
    "host_id": "zHQ0T9q8qgZStZrzQQtCGKTKlHfjmHe0S3/ZLdjvovIQmMX3m9MXKC/F+G7F5mglKhNVK9efV2Qew+EVcZ2YJw==",
    "signature_version": "SigV4",
    "cipher_suite": "ECDHE-RSA-AES128-GCM-SHA256",
    "authentication_type": "AuthHeader",
    "host_header": "org-humancellatlas-data-portal-dcp2-prod.s3.us-east-1.amazonaws.com",
    "tls_version": "TLSv1.2",
    "access_point_arn": "-",
    "acl_required": "-",
    "time": "30/Sep/2023:05:54:35 +0000"
}

So, it appears that the S3 logs apply URL-encoding to the object key in the key field, but not in the request_uri field, resulting in the presence of non-ASCII characters.

nadove-ucsc commented 1 year ago

Switch to latin1 codec for both S3 access and ALB log forwarders.

bvizzier-ucsc commented 3 days ago

@hannes-ucsc: "De-prioritizing due to reduced prevalence. Probably because WAF is catching the requests that trigger this."