developmentseed / bioacoustics-api

Google Bioacustics API that runs the backend for A2O Search
https://devseed.com/api-docs/?url=https://api.search.acousticobservatory.org/api/v1/openapi
MIT License
1 stars 0 forks source link

Milvus resource assessment #5

Open geohacker opened 1 year ago

geohacker commented 1 year ago

For the MVP we will work with 1% of the overall audio samples from A2O as embeddings. We should figure out what resources Milvus needs. This will determine:

  1. Cluster scaling and baseline resources
  2. Index strategy
  3. Partitioning and others

@leothomas I had poke around the Milvus Sizing tool https://milvus.io/tools/sizing/ — I really like that they give you recommended Helm config (cc @sunu @batpad). I'm struggling to arrive at the 30TB estimate @sdenton4 and you were discussing so perhaps we should outline that as well. Thank you!

leothomas commented 1 year ago

Thanks for getting this ticket started! There are two estimates I'd like to put together/clarify:

Number of total vectors / memory required for the raw embeddings:

Assumptions:

df = pd.DataFrame(data)

total number of embeddings:

len(df)

872994

number of unique files in the 0.1% sample:

len(df["filename"].unique())

914

max number of embeddings per file:

df.groupby("filename")[["offset"]].nunique().max()

offset 1440

min number of embeddings per file:

df.groupby("filename")[["offset"]].nunique().min()

offset 12

mean number of embeddings per file:

print(df.groupby("filename")[["offset"]].nunique().mean())

offset 926.745223

Assuming the 0.1% sample is representative of the overall dataset, we can assume there are 872_994 embeddings * 10 * 100 = 872_994_000 embeddings total.

At 1280 dimensions per embedding and 32 bit floats (4 bytes) per embedding, that makes: 872_994_000 embeddings * 1280 dimensions per embedding * 4 bytes per 32bit float = 4.4697293e+12 bytes ~= 4.47 Tb (1E12 bytes per Tb)

So my estimated dataset size would be:

@sdenton4 does that track with your estimates?

sdenton4 commented 1 year ago

Looks good to me - I double checked the embeddings count and shapes to be sure. We can get a similar number by taking the TFRecord file size (4.2Gb) and multiplying by 1000.

(We're going to end up with lots of ways to subsample if need be, so let's keep rolling with the maximalist estimates for now, and see how the algorithmic changes help/hurt exactly.)

sunu commented 1 year ago

Based on @leothomas's recommendations, here's the helm configuration for resource allocation generated by the Milvus sizing tool: image

In total we require ~41 vCPUs, ~168 GB memory and ~280 GB SSD storage to deploy the resources recommended by the Milvus sizing tool. We are using 6 n1-standard-8 vms which come with 8 vCPUs and 30 GB memory each.

sdenton4 commented 1 year ago

I added embeddings of a one-percent slice fo the data to the Cloud bucket:
a20_dropbox/one_percent_embeddings

Currently uploading the associated audio:
a20_dropbox/one_percent Should be done well before this evening.

geohacker commented 1 year ago

@sdenton4 our current google cloud estimates for the 0.1% is about ~$1800. @leothomas is working on a PCA dimensionality reduction and that should reduce the overall compute footprint but we think it will still be more than the initial estimates.

How should we handle this change? Is it ok to assume that we can bill the GCP charges outside the billable hours?

The estimate @sunu put together is here: google_cloud_pricing_calculator.pdf

sdenton4 commented 1 year ago

Lemme check that I understand correctly - is it $1800/mo for working with 100% of the data as estimated from the 0.1% slice? (as opposed to $1800/mo for the development work using the 0.1% slice? (which would be super-yikes))

sdenton4 commented 1 year ago

(confirmed that upload of 1% audio has completed.)

geohacker commented 1 year ago

@sdenton4 Yeah the $1800/mo is for 0.1% 😭 Once @leothomas makes progress on the PCA optimisation we'll figure out if that number comes down significantly.

sunu commented 1 year ago

We turned off replication on all the services that Milvus uses and switched from using Pulsar to Kafka. The changes bring down the estimated cost of our infrastructure to around $650/month.

sdenton4 commented 1 year ago

Ha, OK - thanks! That's in-bounds for our development budget.

It might be helpful to understand what's fixed cost (pulsar/kafka coordinator nodes?), and what scales with traffic (query nodes?) vs what scales with increased data (data nodes?).

geohacker commented 1 year ago

I'm adding @atruskie to this ticket so he can see some context on how we arrived at the Milvus resource assessment.

The current state of resources is this for 1% unseparated sample

Use GCP VM Type vCPU Memory Disk size Number of VM
Django API n1-standard-2 2 7.5GB 30GB 1
Milvus n1-standard-8 8 30GB 30GB 2

Current Milvus resource allocation is defined here

For the full dataset ~100 million vectors

I'd suggest starting with two VM similar to the n1-standard-16 and then adding a third one depending on requests. It would be best to horizontally scale queryNodes (we currently run 3, for the full embedding probably need 4 to start). Possible also to get better results if run multiple dataNodes.

Use GCP VM Type vCPU Memory Disk size Number of VM
Django API - 2 7.5GB 30GB 1
Milvus - 16 60GB 150GB 2 -> 3

@sunu @leothomas what are your thoughts? I think as an aside — I'd try to bring down our current staging size further once current development sprint is done.

atruskie commented 1 year ago

@geohacker this is great - thank you.

atruskie commented 1 year ago

Actually, a further question, from above:

4.47 Tb ( 872_994_000 ) for single channel embeddings 22.35 Tb (4_364_970_000) for separate embeddings (4 channels + original audio)

Your estimate doesn't include bulk storage for the full embedding set? Is the 22.35 TB upper bound still accurate?

geohacker commented 1 year ago

@atruskie The storage size of raw embedding, I'm not entirely sure. @sdenton4 @leothomas are you able to advice here? The raw embeddings are stored on Google Cloud Storage. Which I think will continue to be the case.

For the embedding memory footprint, those numbers are outdated. We have made lots of optimisations since then. So I think the memory footprint should be roughly what I outlined in the table above.

sdenton4 commented 1 year ago

Yes, setting aside around 25TB bulk storage for separated embeddings sounds reasonable. I'm also happy to take a bug to look into cutting this down; we should be able to ID 'blank' channels reliably and skip adding them to the DB.

The index itself should be much smaller, as Sajjad mentioned.

leothomas commented 1 year ago

The index is indeed quite a bit smaller, since we're running the raw embeddings through a PCA dimensionality reduction step before adding them to the Milvus database. Vectors stored are reduced from 1280 to 256 dimensions (5x reduction). Additionally we are use an Milvus index that performs scalar quantization to 8bits (from 32bit floats) for each dimension of the ingested vectors. With both of these optimizations, the disk usage for the vectors in Milvus should be ~1Tb.

See this issue for an overview of the different indexing strategies and their tradeoffs in terms of memory footprint vs search recall vs search speed