geohacker commented 1 year ago

For the MVP we will work with 1% of the overall audio samples from A2O as embeddings. We should figure out what resources Milvus needs. This will determine:

Cluster scaling and baseline resources
Index strategy
Partitioning and others

@leothomas I had poke around the Milvus Sizing tool https://milvus.io/tools/sizing/ — I really like that they give you recommended Helm config (cc @sunu @batpad). I'm struggling to arrive at the 30TB estimate @sdenton4 and you were discussing so perhaps we should outline that as well. Thank you!

leothomas commented 1 year ago

Thanks for getting this ticket started! There are two estimates I'd like to put together/clarify:

Number of embeddings (/quantity of storage required for the raw embeddings)
Memory consumption for the IVF_PQ index, since in several different GH tickets/discussions maintainers mention that it should be a similar memory consumption to the IVF_SQ8 index (25%-30% of the raw embeddings' size), which is corroborated by the fact that the sizing tool only has estimates for the IVF_SQ8 index. However, my understanding is that SQ8 performs a scalar quantization from 32 to 8 bits for each float in the embedding (hence the SQ8 index being 25% of the raw embeddings) whereas the memory consumption of the PQ index depends largely on the build parameters: m, the number of subvectors to quantize the original vector into, nlist the number of centroids to cluster the subvectors and nbits the number of bit used to represent the centroids. I'll open an issue with a more in-depth discussion of this in the Milvus repo.

Number of total vectors / memory required for the raw embeddings:

Assumptions:

most audio (.flac) files are 2 hrs long. This put the upper limit of embeddings per file at 1440 (number of 5 second "windows" in 2hrs). This seems corroborated by the number of embeddings per file in the point_one_percent sample:
```
import pandas as pd
```

df = pd.DataFrame(data)

total number of embeddings:

len(df)

872994

number of unique files in the 0.1% sample:

len(df["filename"].unique())

914

max number of embeddings per file:

df.groupby("filename")[["offset"]].nunique().max()

offset 1440

min number of embeddings per file:

df.groupby("filename")[["offset"]].nunique().min()

offset 12

mean number of embeddings per file:

print(df.groupby("filename")[["offset"]].nunique().mean())

offset 926.745223

Assuming the 0.1% sample is representative of the overall dataset, we can assume there are 872_994 embeddings * 10 * 100 = 872_994_000 embeddings total.

At 1280 dimensions per embedding and 32 bit floats (4 bytes) per embedding, that makes: 872_994_000 embeddings * 1280 dimensions per embedding * 4 bytes per 32bit float = 4.4697293e+12 bytes ~= 4.47 Tb (1E12 bytes per Tb)

So my estimated dataset size would be:

4.47 Tb ( 872_994_000 ) for single channel embeddings
22.35 Tb (4_364_970_000) for separate embeddings (4 channels + original audio)

@sdenton4 does that track with your estimates?

sdenton4 commented 1 year ago

Looks good to me - I double checked the embeddings count and shapes to be sure. We can get a similar number by taking the TFRecord file size (4.2Gb) and multiplying by 1000.

(We're going to end up with lots of ways to subsample if need be, so let's keep rolling with the maximalist estimates for now, and see how the algorithmic changes help/hurt exactly.)

sunu commented 1 year ago

Based on @leothomas's recommendations, here's the helm configuration for resource allocation generated by the Milvus sizing tool:

In total we require ~41 vCPUs, ~168 GB memory and ~280 GB SSD storage to deploy the resources recommended by the Milvus sizing tool. We are using 6 n1-standard-8 vms which come with 8 vCPUs and 30 GB memory each.

sdenton4 commented 1 year ago

I added embeddings of a one-percent slice fo the data to the Cloud bucket:
a20_dropbox/one_percent_embeddings

Currently uploading the associated audio:
a20_dropbox/one_percent Should be done well before this evening.

geohacker commented 1 year ago

@sdenton4 our current google cloud estimates for the 0.1% is about ~$1800. @leothomas is working on a PCA dimensionality reduction and that should reduce the overall compute footprint but we think it will still be more than the initial estimates.

How should we handle this change? Is it ok to assume that we can bill the GCP charges outside the billable hours?

The estimate @sunu put together is here: google_cloud_pricing_calculator.pdf

sdenton4 commented 1 year ago

Lemme check that I understand correctly - is it $1800/mo for working with 100% of the data as estimated from the 0.1% slice? (as opposed to $1800/mo for the development work using the 0.1% slice? (which would be super-yikes))

sdenton4 commented 1 year ago

(confirmed that upload of 1% audio has completed.)

geohacker commented 1 year ago

@sdenton4 Yeah the $1800/mo is for 0.1% 😭 Once @leothomas makes progress on the PCA optimisation we'll figure out if that number comes down significantly.

sunu commented 1 year ago

We turned off replication on all the services that Milvus uses and switched from using Pulsar to Kafka. The changes bring down the estimated cost of our infrastructure to around $650/month.

sdenton4 commented 1 year ago

Ha, OK - thanks! That's in-bounds for our development budget.

It might be helpful to understand what's fixed cost (pulsar/kafka coordinator nodes?), and what scales with traffic (query nodes?) vs what scales with increased data (data nodes?).

geohacker commented 1 year ago

I'm adding @atruskie to this ticket so he can see some context on how we arrived at the Milvus resource assessment.

The current state of resources is this for 1% unseparated sample

Use	GCP VM Type	vCPU	Memory	Disk size	Number of VM
Django API	n1-standard-2	2	7.5GB	30GB	1
Milvus	n1-standard-8	8	30GB	30GB	2

Current Milvus resource allocation is defined here

For the full dataset ~100 million vectors

I'd suggest starting with two VM similar to the n1-standard-16 and then adding a third one depending on requests. It would be best to horizontally scale queryNodes (we currently run 3, for the full embedding probably need 4 to start). Possible also to get better results if run multiple dataNodes.

Use	GCP VM Type	vCPU	Memory	Disk size	Number of VM
Django API	-	2	7.5GB	30GB	1
Milvus	-	16	60GB	150GB	2 -> 3

@sunu @leothomas what are your thoughts? I think as an aside — I'd try to bring down our current staging size further once current development sprint is done.

atruskie commented 1 year ago

@geohacker this is great - thank you.

atruskie commented 1 year ago

Actually, a further question, from above:

4.47 Tb ( 872_994_000 ) for single channel embeddings 22.35 Tb (4_364_970_000) for separate embeddings (4 channels + original audio)

Your estimate doesn't include bulk storage for the full embedding set? Is the 22.35 TB upper bound still accurate?

geohacker commented 1 year ago

@atruskie The storage size of raw embedding, I'm not entirely sure. @sdenton4 @leothomas are you able to advice here? The raw embeddings are stored on Google Cloud Storage. Which I think will continue to be the case.

For the embedding memory footprint, those numbers are outdated. We have made lots of optimisations since then. So I think the memory footprint should be roughly what I outlined in the table above.

sdenton4 commented 1 year ago

Yes, setting aside around 25TB bulk storage for separated embeddings sounds reasonable. I'm also happy to take a bug to look into cutting this down; we should be able to ID 'blank' channels reliably and skip adding them to the DB.

The index itself should be much smaller, as Sajjad mentioned.

leothomas commented 1 year ago

The index is indeed quite a bit smaller, since we're running the raw embeddings through a PCA dimensionality reduction step before adding them to the Milvus database. Vectors stored are reduced from 1280 to 256 dimensions (5x reduction). Additionally we are use an Milvus index that performs scalar quantization to 8bits (from 32bit floats) for each dimension of the ingested vectors. With both of these optimizations, the disk usage for the vectors in Milvus should be ~1Tb.

See this issue for an overview of the different indexing strategies and their tradeoffs in terms of memory footprint vs search recall vs search speed

developmentseed / bioacoustics-api

Milvus resource assessment #5

Number of total vectors / memory required for the raw embeddings:

total number of embeddings:

number of unique files in the 0.1% sample:

max number of embeddings per file:

min number of embeddings per file:

mean number of embeddings per file:

The current state of resources is this for 1% unseparated sample

For the full dataset ~100 million vectors