falcosecurity / falcosidekick

Connect Falco to your ecosystem
Apache License 2.0
543 stars 176 forks source link

AWS Security Lake Parquet File Schema Format Issues upon AWS Opensearch Ingestion & AWS Athena Querying #728

Open m00lav opened 9 months ago

m00lav commented 9 months ago

Background:

We are leveraging AWS security lake to ingest various log sources into OCSF, have this data be queryable via AWS Athena, as well as ingest this data into AWS OpenSearch. We are attempting to ingest Falco data by following by the following article: falcosidekick integration documentation.

Describe the bug:

After following the instructions provided in the article linked above we are receiving Falco data in our security lake s3 bucket and this data is queryable via S3 Select. However, the lake formation table generated by security lake returns a generic error of Unable to Read Parquet File when attempting to query via Athena. Additionally, we are leveraging the AWS OpenSearch Ingestion Pipeline with the Security Lake S3 parquet OCSF pipeline template. Native sources from security lake are ingested without error but we are seeing an error when Falco data is ingested. The error from OS ingestion pipeline (via CloudWatch) is as follows:

java.lang.UnsupportedOperationException: REPEATED not supported outside LIST or MAP. Type: repeated binary types (STRING) = 0

AWS support was contacted regarding this error. The following was their response:

"REPEATED" is a keyword in protobuf. It seems the files are being written from protobufs and the generated schema is not supported by the Avro parquet library used by OS ingestion. The source of this error is https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L303

There's some useful information in this stackoverflow post:  
https://stackoverflow.com/questions/72634350/parquetprotowriters-creates-an-unreadable-parquet-file

How to reproduce it:

Expected behaviour:

Environment:

Falco version

0.36.1 (x86_64) - from docker.io/falcosecurity/falco-no-driver:0.36.1

System info

{
  "machine": "x86_64",
  "nodename": "falco-6sck4",
  "release": "5.10.197-186.748.amzn2.x86_64",
  "sysname": "Linux",
  "version": "#1 SMP Tue Oct 10 00:30:07 UTC 2023"
}

Cloud provider or hardware configuration

AWS EKS - managed nodegroups

OS

FALCO CONTAINER:
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel:

Linux falco-6sck4 5.10.197-186.748.amzn2.x86_64 #1 SMP Tue Oct 10 00:30:07 UTC 2023 x86_64 GNU/Linux

Installation method:

Kubernetes

Additional context:

N/A

Issif commented 9 months ago

Thanks for this report, I'll work on it asap.

asuresh8 commented 9 months ago

Note that I was the one who mentioned that I thought it was an issue converting from proto to parquet. Upon going through the parquet library used to generate the files by this repo, it looks like REPEATED is a valid keyword in parquet. The issue is that the use of REPEATED is not correct. See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md for detailed description of how REPEATED should be used. I see an issue in these places:

If this field is repeated then OCSFSecurityFinding needs to be in a list or a map. I'm not sure if the top level of a parquet file counts as a list

If types is repeated then OCSFFIndingDetails needs to be in a list or a map. It is not

If tags is repeated then OCSFFIndingDetails needs to be in a list or a map. Is is not

See this tip in the parquet-go library.

poiana commented 6 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

m00lav commented 6 months ago

/remove-lifecycle stale

poiana commented 1 week ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale