Open geohacker opened 1 year ago
Thanks for getting this ticket started! There are two estimates I'd like to put together/clarify:
m
, the number of subvectors to quantize the original vector into, nlist
the number of centroids to cluster the subvectors and nbits
the number of bit used to represent the centroids. I'll open an issue with a more in-depth discussion of this in the Milvus repo. Assumptions:
.flac
) files are 2 hrs long. This put the upper limit of embeddings per file at 1440 (number of 5 second "windows" in 2hrs). This seems corroborated by the number of embeddings per file in the point_one_percent
sample:
import pandas as pd
df = pd.DataFrame(data)
len(df)
872994
len(df["filename"].unique())
914
df.groupby("filename")[["offset"]].nunique().max()
offset 1440
df.groupby("filename")[["offset"]].nunique().min()
offset 12
print(df.groupby("filename")[["offset"]].nunique().mean())
offset 926.745223
Assuming the 0.1% sample is representative of the overall dataset, we can assume there are 872_994 embeddings * 10 * 100 = 872_994_000
embeddings total.
At 1280 dimensions per embedding and 32 bit floats (4 bytes) per embedding, that makes: 872_994_000 embeddings * 1280 dimensions per embedding * 4 bytes per 32bit float = 4.4697293e+12 bytes ~= 4.47 Tb (1E12 bytes per Tb)
So my estimated dataset size would be:
@sdenton4 does that track with your estimates?
Looks good to me - I double checked the embeddings count and shapes to be sure. We can get a similar number by taking the TFRecord file size (4.2Gb) and multiplying by 1000.
(We're going to end up with lots of ways to subsample if need be, so let's keep rolling with the maximalist estimates for now, and see how the algorithmic changes help/hurt exactly.)
Based on @leothomas's recommendations, here's the helm configuration for resource allocation generated by the Milvus sizing tool:
In total we require ~41 vCPUs, ~168 GB memory and ~280 GB SSD storage to deploy the resources recommended by the Milvus sizing tool. We are using 6 n1-standard-8 vms which come with 8 vCPUs and 30 GB memory each.
I added embeddings of a one-percent slice fo the data to the Cloud bucket:
a20_dropbox/one_percent_embeddings
Currently uploading the associated audio:
a20_dropbox/one_percent
Should be done well before this evening.
@sdenton4 our current google cloud estimates for the 0.1% is about ~$1800. @leothomas is working on a PCA dimensionality reduction and that should reduce the overall compute footprint but we think it will still be more than the initial estimates.
How should we handle this change? Is it ok to assume that we can bill the GCP charges outside the billable hours?
The estimate @sunu put together is here: google_cloud_pricing_calculator.pdf
Lemme check that I understand correctly - is it $1800/mo for working with 100% of the data as estimated from the 0.1% slice? (as opposed to $1800/mo for the development work using the 0.1% slice? (which would be super-yikes))
(confirmed that upload of 1% audio has completed.)
@sdenton4 Yeah the $1800/mo is for 0.1% 😭 Once @leothomas makes progress on the PCA optimisation we'll figure out if that number comes down significantly.
We turned off replication on all the services that Milvus uses and switched from using Pulsar to Kafka. The changes bring down the estimated cost of our infrastructure to around $650/month.
Ha, OK - thanks! That's in-bounds for our development budget.
It might be helpful to understand what's fixed cost (pulsar/kafka coordinator nodes?), and what scales with traffic (query nodes?) vs what scales with increased data (data nodes?).
I'm adding @atruskie to this ticket so he can see some context on how we arrived at the Milvus resource assessment.
Use | GCP VM Type | vCPU | Memory | Disk size | Number of VM |
---|---|---|---|---|---|
Django API | n1-standard-2 | 2 | 7.5GB | 30GB | 1 |
Milvus | n1-standard-8 | 8 | 30GB | 30GB | 2 |
Current Milvus resource allocation is defined here
I'd suggest starting with two VM similar to the n1-standard-16 and then adding a third one depending on requests. It would be best to horizontally scale queryNodes (we currently run 3, for the full embedding probably need 4 to start). Possible also to get better results if run multiple dataNodes.
Use | GCP VM Type | vCPU | Memory | Disk size | Number of VM |
---|---|---|---|---|---|
Django API | - | 2 | 7.5GB | 30GB | 1 |
Milvus | - | 16 | 60GB | 150GB | 2 -> 3 |
@sunu @leothomas what are your thoughts? I think as an aside — I'd try to bring down our current staging size further once current development sprint is done.
@geohacker this is great - thank you.
Actually, a further question, from above:
4.47 Tb ( 872_994_000 ) for single channel embeddings 22.35 Tb (4_364_970_000) for separate embeddings (4 channels + original audio)
Your estimate doesn't include bulk storage for the full embedding set? Is the 22.35 TB upper bound still accurate?
@atruskie The storage size of raw embedding, I'm not entirely sure. @sdenton4 @leothomas are you able to advice here? The raw embeddings are stored on Google Cloud Storage. Which I think will continue to be the case.
For the embedding memory footprint, those numbers are outdated. We have made lots of optimisations since then. So I think the memory footprint should be roughly what I outlined in the table above.
Yes, setting aside around 25TB bulk storage for separated embeddings sounds reasonable. I'm also happy to take a bug to look into cutting this down; we should be able to ID 'blank' channels reliably and skip adding them to the DB.
The index itself should be much smaller, as Sajjad mentioned.
The index is indeed quite a bit smaller, since we're running the raw embeddings through a PCA dimensionality reduction step before adding them to the Milvus database. Vectors stored are reduced from 1280 to 256 dimensions (5x reduction). Additionally we are use an Milvus index that performs scalar quantization to 8bits (from 32bit floats) for each dimension of the ingested vectors. With both of these optimizations, the disk usage for the vectors in Milvus should be ~1Tb.
See this issue for an overview of the different indexing strategies and their tradeoffs in terms of memory footprint vs search recall vs search speed
For the MVP we will work with 1% of the overall audio samples from A2O as embeddings. We should figure out what resources Milvus needs. This will determine:
@leothomas I had poke around the Milvus Sizing tool https://milvus.io/tools/sizing/ — I really like that they give you recommended Helm config (cc @sunu @batpad). I'm struggling to arrive at the 30TB estimate @sdenton4 and you were discussing so perhaps we should outline that as well. Thank you!