censoredplanet / censoredplanet-analysis

Analysis of the CensoredPlanet data.
Apache License 2.0
14 stars 5 forks source link

Scaling issue with backfilling data #220

Closed ohnorobo closed 1 year ago

ohnorobo commented 1 year ago

We're currently having an issue where attempting to run a full backfill (over all data from 2018-2023) runs a job that succeeds, but which ends up having dropped 2/3 of the expected rows. The dropping is not spread out evenly, but is caused by 2/3 of server ips to be dropped entirely.

Here's an example of that the missing data look like: image

Running over only smaller amounts of data causes the jobs to correctly write all the data. In particular writing the data out one year per job causes it to work correctly.

Example that succeeded:

Example that dropped data:

One thing we're seeing in the jobs with issues is the scaling message Autoscaling: Unable to reach resize target in zone us-east1-c. QUOTA_EXCEEDED: Instance 'abc' creation failed: Quota 'IN_USE_ADDRESSES' exceeded. Limit: 575.0 in region us-east1. we're also seeing the error Autoscaling: Unable to reach resize target in zone us-east1-c. ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS: Instance 'abc' creation failed: The zone 'projects/censoredplanet-analysisv1/zones/us-east1-c' does not have enough resources available to fulfill the request. '(resource type:compute)'.

The job is also not scaling to as many workers at it wants image

ohnorobo commented 1 year ago

Here's the current shape of the production data image

ohnorobo commented 1 year ago

I've gone through and backfilled the dev data, here's the current state: image

ohnorobo commented 1 year ago

@agiix also updated prod to the same data as dev