Module 1005 - API Custom Batching - Tests

GeekSheikh commented 3 years ago

API batching for Module 1005, Bronze_ClusterEventLogs. The reason this is such an important component is because the ONLY place to get this information (which drives cost and cluster util / everything) is the API. Some customers have 100s of thousand cluster event states per day. This is too much data to pull to the driver all at once, thus I batch it (not talking API pagination, that's working) but I built a custom batcher to query number of cluster events between x and y dates and then batch them to 500K. Pull 500K events, write to table, repeat. This data expires after 60d (so I'm told) and once the data expires in the API, it's not accessible programmatically anywhere else again.

Historical loads and large customers will hit this upper bound and it's critical to ensure the process is working. I envision several tests. I believe unit tests can be used to simulate a small upper bound (set 20 as the upper bound to batch for testing) and ensure that the data is pulled, written and then continues to pull the next batch.

This test isn't as important as the higher-level tests you're working on now as errors here would likely present themselves later but in some cases they may not. I wanted to put this on the radar as a required bronze test set need.

Relevant code is below. The clusterEvents API call requires the clusterID, thus a list of all clusterIDs with new events must be created to send to the API calls as part of the API query. The section below is the section responsible for ensuring all clusters with new events are accounted for, this is a VERY CRITICAL component of Overwatch. https://github.com/databrickslabs/overwatch/blob/4e3daa9e7e54722f3b8692b0cd2e39703e41a4df/src/main/scala/com/databricks/labs/overwatch/pipeline/BronzeTransforms.scala#L297-L330

Batching Code: https://github.com/databrickslabs/overwatch/blob/4e3daa9e7e54722f3b8692b0cd2e39703e41a4df/src/main/scala/com/databricks/labs/overwatch/pipeline/BronzeTransforms.scala#L339-L382

GeekSheikh commented 3 years ago

Also, while you're deep in the weeds, if you have time, it'd be AWESOME to figure out why I'm having issues parallelizing the API calls. I don't want to serialize the calls and send them to the workers, this is unnecessary. The API concurrent limit is 32 so I would like to parallelize this to 12 or so (add configs later). When I try to par it on the driver, I get errors (as I recall). It'd be great to understand what the issue is if you have time.

If you don't get time, please break this out into another issue and we'll take it later.

GeekSheikh commented 3 years ago

UPDATE: I found the issue that started this thought process.

The issue was I made a change a while back to simplify the "cluster IDs with new states" logic and my logic was wrong. I forgot to account for clusters that may have changed states but were not edited, started, restarted, or terminated. So while I get most of the cluster events, I miss cluster events for long-running clusters; big miss for customers with large, long-running clusters for streaming or shared interactive work. oops.

GeekSheikh commented 2 years ago

@Sriram-databricks -- looks like most of this will be removed and so won't be needed. Please review and close this with the new APICall implementation for clsuterEvents

databrickslabs / overwatch

Module 1005 - API Custom Batching - Tests #90