Ingest separated a2o embeddings

geohacker commented 1 year ago

@sdenton4 has uploaded separated a2o embeddings to GCS (gs://a20_dropbox/one_percent_sep_embeddings). These are pretty much same as the earlier embeddings but they now have 5 more channels (and 5x increase in volume). The data is generated by running through the audio separation model (part of chirp). We want to ingest a small sample and see the search results we get.

@leothomas could you outline next steps to ingest this? I think it may be a good moment to invest some time into building a more efficient ingestion setup that doesn't rely on notebooks that you have to run locally. If that stays within a reasonable amount of time, we should have budget to do it.

@leothomas I'm also wondering if we should look into taking a dump of the current index so in case this doesn't workout we can swap relatively quickly. Let me know if you want me to look into that.

cc @oliverroick @willemarcel

leothomas commented 1 year ago

Sorry for the delay.

I've added 3 scripts + a requirements file to an ingest folder in this new branch.

The idea is that these scripts should be runable from any compute instance with python>=3.8 and a kubectl connection to the milvus instance.

The only thing the scripts are missing, is the ability to read/write to and from GCP, in order to discover data files and write the processed embeddings and metadata files. I've marked all the places where a GCP storage operation is required with a # TODO: comment.

The first part of the process (01_process_embeddings.py) is highly parallelise-able and could be run in the equivalent of a lambda function, with small modification to have each run only pick up a single file, rather than list the contents of the directory and process all the files.

I'm happy to pair program with @sunu or @geohacker to get this up and running.

Important note:

There are 2 deviations between this dataset and then original (un-separated) data sample:

The filename no longer contains the site_id (it does still contain the site name). Eg: 20200229T060000+0800_Uunguu-Indigenous-Protected-Area-Wunambal-Gaambera-Wet-B_194100.flac (note the absence of site_XXXX/ at the beginning of the filepath.
The metadata for each record now contains a channel index, since each 5 second clip now contains an embedding for each of the 5 audio channels (I think it might be 4 separated channels + 1 original audio channel, but unsure how to figure out which is which within the embedding. It might be interested to add that as a metadata field, if we can figure out how to identify them.

leothomas commented 1 year ago

As for backing up the milvus instance, I think that's a good idea, however, I don't think it's possible to update a collection in place with the new metadata fields, so I think we will have to drop and re-create the collection anyways. (Still, I think it would be quicker to restore from a backup, if we do want to, rather than re-ingest the originally processed data)

geohacker commented 1 year ago

@leothomas this is looking great to me, thank you! I'd suggest we wrap this in a Dockerfile with the milvus connection and gcs url as environment. Is that something you'd be able to get to?

I'm inclined to then put this up as a k8s job because we have the infra already and that would be really easy to setup.

sunu commented 1 year ago

Once we have a Dockerfile, I can help wrap that up into a k8s job

sdenton4 commented 1 year ago

BTW, I'm regenerating the embeddings with the site id included, and will send a ping once I've got them in GCP.

The raw-audio embedding in on channel 0, and separated data is on channels 1-4.

sunu commented 1 year ago

@sdenton4 Can you please provide us with a service-account key that we can use to give the ingestion job access to the embeddings bucket?

sdenton4 commented 1 year ago

Hi, Sunu - I shared a service account with Sajjad on slack last week - feel free to ping me if you haven"t got it yet or something else is needed.

sunu commented 1 year ago

Thanks @sdenton4! Sajjad has shared the service account with me already

sunu commented 1 year ago

@sdenton4 During the ingestion of separated embeddings, we found some services required extra storage and memory due to the increased volume. To stay within cost limit, we ingested around 25% of the embeddings. There was distinct peak resource usage between data ingestion and normal search in Milvus services. To accommodate that, we adjusted resource allocations based on the higher demands during ingestion. Also, Milvus query nodes now require ~16GB of memory to load the index containing 25% of the embeddings. So we have scaled them down to 1 replica only. cc @geohacker

sdenton4 commented 1 year ago

Thanks for the update! If the separated embeddings are what's live now, it seems like the search quality has greatly decreased.

Was the index recomputed for the new embeddings set or is it using the same index as before?

leothomas commented 1 year ago

We've used the same index definition (PCA dimensionality reduction from 1280 --> 256 dimensions, Inverted File Index (IVF) with 4096 centroids), however the actual PCA matrix used the reduce the embeddings and IVF region clustering was recomputed for the new data. (ie: same ANN algorithm, trained on the new data).

With the increased number of vectors, 2 things we should try tweaking:

increasing the number of IVF regions we're clustering the data to. There are no official guidelines for this, but a common recommendation I've seen is between sqrt(N) and 16*sqrt(N) where N is the number of vectors. This would put us between ~7k regions and ~125k regions, knowing that the max number of regions supported by Milvus is ~65k. I think we can start by bumping up ~16k to try
increasing the nprobe parameters when searching. When searching against an index, Milvus will only search the nprobe closest regions to the input vector. More regions search == higher recall, but also slower search performance.

Lastly, I can re-run the index evaluation notebook with some of the embeddings from the separated audio channels to see if we can get a better memory vs. recall vs. search time tradeoff

geohacker commented 1 year ago

@sdenton4 the index is fully recomputed during every ingestion job at the moment. I'm noticing that the search quality has dropped too. One key thing I notice when using a search result in the search (for example: https://ecoecho.ds.io/search/index.html?q=https://api.bioacoustics.ds.io/api/v1/a2o/audio_recordings/download/flac/115841?start_offset%3D35%26end_offset%3D40) is that we seem to get mostly the background channel as search results.

sdenton4 commented 1 year ago

Here's what I got from a brute-force search over the separated embeddings using the non-separated audio from the 'I'm Feeling Lucky' sample.

source file: 20200728T120000+1000_Eungella-Dry-B_501900.flac
offset:        0.15
distance:      1.76

In the brute-force search, the distances vary between ~2 and ~4 over the entire set of results. (In the EcoEcho results, the minimum distance is ~4.6, so seems like something's gone spicy.)

The brute-force search is also successfully surfacing some interesting (mixture / low SNR) results in the top-100:

source file: 20200908T160000+1000_Scottsdale-Wet-B_663100.flac
offset:        0.11
distance: 2.25

source file: 20200623T140000+1000_Little-Llangothlin-Reserve-Warra-National-Park-Dry-B_19000.flac
offset:        0.10
distance:      2.25

sdenton4 commented 1 year ago

I looked over the code a bit and had two thoughts:

a) Seeing lots of random noise could happen if the data alignment is off - eg, the wrong metadata is getting matched with the wrong vectors. I looked pretty closely at the code and don't /think/ this is the case, but it might not hurt to explicitly check that things are lining up as expected.

b) A more nebulous fear: All of the extra background audio in isolated channels could be skewing the PCA computation, by getting it to focus more on the (suddenly much more prevalent) background noise. One way to hotfix this might be training the PCA matrix on just the combined audio (ie, either re-using the old PCA matrix, or training only on channel 0 with the new embeddings.)

c) The difference in distance distributions does stand out as a potentially important clue. I seem to remember lower distances with the single-channel embeddings, as well.

sdenton4 commented 1 year ago

ooooone more thought: All of the top results for the 'I'm Feeling Lucky' query contain a transient in the first second of audio:

This suggests we are getting actually-similar audio in the results. Hypothesis (a) (poor data alignment) seems unlikely to produce such consistent wrong results, so we can discount it.

geohacker commented 1 year ago

@sdenton4 thank you for digging into this a bit. My fear is also about the background noise channel getting boosted during PCA. i think @leothomas is going to try using the PCA based on the original audio instead of separated and see what that gives us.

developmentseed / bioacoustics-api

Ingest separated a2o embeddings #36

Important note: