aws-solutions / aws-centralized-logging

Apache License 2.0
254 stars 104 forks source link

Ingestion will fail for VPC flowlogs #38

Closed t-jones closed 2 years ago

t-jones commented 2 years ago

Describe the bug

The account ID field in VPC flowlogs can sometime contain the word "unknown" when the log entry comes from a service with an ENI in your VPC. This causes ingestion for that line to fail due to a typing error from opensearch.

To Reproduce

  1. Deploy the solution.
  2. Deploy the demo.
  3. Check the destination error logs in kinesis data firehose.

Expected behavior

Any VPC flowlog can be ingested.

Please complete the following information about the solution:

Screenshots

{"type":"mapper_parsing_exception","reason":"failed to parse field [account_id] of type [long] in document with id \u002749623321316493998612182857965780928591523098155414454274.26\u0027. Preview of field\u0027s value: \u0027unknown\u0027",
   "caused_by": {
      "type":"illegal_argument_exception","reason":"For input string: \"unknown\""
   }
}

Subscription filter:

[version, account_id, interface_id, srcaddr != "-", dstaddr != "-", srcport != "-", dstport != "-", protocol, packets, bytes, start, end, action, log_status]

Additional context

I believe the reason this is happening is b/c the code in the transformer is setting the type of the accout_id field as a long for ingestion into OS/ES. Then, after the type has been set as long in the index, a string comes along and ingestion fails. Along with this problem, an AWS account ID should be ingested as a string anyway so it is available as a keyword in term queries (you'd never do a range query on account IDs in kibana as they are identifiers, not values). This would fix the issue in kibana as well where account IDs are rendered with commas as numbers.

Not sure the best way to fix this. Add an ENV var for the transformer lambda to whitelist field names which should not be typed as numerics regardless of their value? This would cause the lambda to get more complex as it would have to look as field names and not just values.

If you give me a suggestion, I'll send you a PR.

gsingh04 commented 2 years ago

@t-jones thanks for reporting the issue. Yes, you are correct, it certainly makes senses to ingest account_id as a string. I believe we can update the following transformer code where the log event is getting transformed so source["account_id"] is indexed as a string. https://github.com/aws-solutions/aws-centralized-logging/blob/a44dcfe5ac8f32e991e85a637a02dcabd7a68b80/source/services/transformer/index.ts#L72

if ("account_id" in source) source["account_id"] = "" + source["account_id"];

Please feel free to open the PR.

gsingh04 commented 2 years ago

This has been addressed with release v4.0.1