Standardize BigQuery backend table formatting for multiple tables

jacks-reid commented 2 years ago

The queries for the BigQuery backend do not have a standardized format for dataset tables with multiple days of data. For example, BigQuery stores multiple days of Cloud Audit Logs in a dataset like the following:

[MY_PROJECT_ID].[MY_DATASET_ID].cloudaudit_googleapis_com_activity_YYYYMMDD
[MY_PROJECT_ID].[MY_DATASET_ID].cloudaudit_googleapis_com_activity_20220916
[MY_PROJECT_ID].[MY_DATASET_ID].cloudaudit_googleapis_com_activity_20220915
…

The queries provided in a majority of the samples only query on a single table, such as 3_01_logging_settings_modified.sql. I believe all the sample queries should accommodate both single day tables and multiple day formats to improve their ease of use. Some queries, like 4_01_unusually_high_api_usage_by_user_identity.sql, accommodate multiple tables already. However the _* suffix formatting used in 4_01 would not accommodate a single day table.

I propose that all queries should pull from a wildcard table to alleviate this issue like the following:

[MY_PROJECT_ID].[MY_DATASET_ID].cloudaudit_googleapis_com_activity*

To prevent the queries from running over massive amounts of data for tables spanning months of logs, we could also include the following delimiter to the queries themselves to allow users to granularly select the time slot for the queries to run.

WHERE 
  _TABLE_SUFFIX = FORMAT("_%s", FORMAT_DATE("%Y%m%d", CURRENT_DATE()))
  AND …

References: https://cloud.google.com/bigquery/docs/querying-wildcard-tables

rarsan commented 2 years ago

@jacks-reid - thanks for reporting this and sorry for the delay in responding.

Yes, BigQuery log sink can be configured to use either date-sharded tables (what you are referring to) or partitioned tables. See BigQuery routing schema.

CSA assumes the log export setup is made according to this Security and access analytics guide which sets up log sink with partitioned tables.

As detailed in the user doc and guide above, there are several reasons why partitioned tabled are preferred in the case of log analytics:

Simpler querying: no need to specify table date suffix to filter by. You can specify exact time interval (down to seconds) by using log timestamp field based on which the table is already partitioned.
Faster and cheaper querying: that automatic timestamp partitioning reduces the number of bytes read which increase speed and reduce costs for your queries.
More granularity for data retention control by setting a partition expiration

Hope this helps. I'll update the README to make it more clear that CSA queries assume partitioned tables.

Going forward, we do plan to templatize these queries and enable users to generate SQL replacing variable instances (e.g. MY_PROJECT_ID and MY_DATASET_ID to simplify usage). We could add a variable for optional TABLE_SUFFIX to address your use case. No planned date for that, but that might be an alternative for you if you still want to use date-sharded tables.

Note: CSA 4.01 SQL query has the wildcard _* as you noted, because it covers all audit logs including Admin Activity (_activity) and Data Access (_data_access)

Feel free to re-open if you got more questions or comments.

jacks-reid commented 2 years ago

Thanks for the detailed reply! That all makes sense.

GoogleCloudPlatform / security-analytics

Standardize BigQuery backend table formatting for multiple tables #10