Add GCP Vertex search index

isaac-chung commented 4 months ago

Add GCP Vertex index support.

Key files

run_vertex_index.py: Example script to call retrieval using gcp index. Can also be used to trigger an index build.
retrieval/gcp_index.py: Stores the VertexIndex class. After instantiating the class, calling search() should load the endpoint if it exists or build from scratch if not.

Checklist

A GCP project in REGION_A
A GCS bucket in REGION_A
Update the attributes in the VertexIndex class (this can be updated to read from repo vars)

Setup FULL index and endpoint with deployment by manual trigger (Caution! will take a long time)

run gcloud auth application-default login. Follow the instructions in the terminal
Download corpus.json via wget https://huggingface.co/datasets/mteb/nq/resolve/main/corpus.jsonl at the root of the repo
in run_vertex_index.py, update MODEL_META_PATH = "model_meta.yml". This removes the limit on passages.
in the root of this repo, run run_vertex_index.py

Notes

Creating an index directly from the GCS bucket is quite fast. So is creating an endpoint. Deploying an endpoint takes the longest (llama-index quotes ~30min).
Any updates to an index would NOT require redeploying an endpoint. See here.

Limitations

Currently does not support incremental updating the index (though easy to implement). This index uses batch updates, which takes quite a long time to perform.
the VertexIndex class uses self.doc_map to get raw passage texts as the index does not support storing them as metadata.
tmp.json file that is written to the GCS bucket is ~6x the size of embeddings.0.pt on the same passages. The cause could be that 1) use of string to repr each embedding float and 2) added overhead like "id" and "embedding" (smaller)
```
In [1]: import sys
```

In [2]: sys.getsizeof("0.07250326871871948") Out[2]: 68

In [3]: sys.getsizeof(0.07250326871871948) Out[3]: 24

Muennighoff commented 4 months ago

This is amazing!! Do you know if it produces the same results as our local index implementation?

isaac-chung commented 4 months ago

@Muennighoff yep, they are! I've randomly sampled 10 queries from mteb/nq in 1ed393b and dumped the results from both indices into a yaml file.

e.g.

what is the name of the lymphatic vessels located in the small intestine:
  gcp: 'Title: Can''t Help Falling in Love

    Passage: European 2-track CD single'
  local: 'Title: Can''t Help Falling in Love

    Passage: European 2-track CD single'

Muennighoff commented 4 months ago

Amazing, should I follow the instructions in your setup message to create it? // I just send you the GCP keys?

isaac-chung commented 4 months ago

For me, there was already a project set up via the UI, and I created the GCS bucket via the UI as well. When the script was running, I was also checking the UI to confirm that the index/buckets/endpoint were created, and I found that very helpful.

If you're able to access the GCP UI, could you try the instructions first? I could try the keys if you run out of time.

orionw commented 4 months ago

This is awesome, thanks @isaac-chung!

scalability

I think the largest we're considering for now is Wikipedia. Do you think their setup won't scale to Wikipedia sized collections?

Setup FULL index and endpoint with deployment by manual trigger (Caution! will take a long time)

At some point do you think you could you dump these instructions into a indexing README so that others could create one later, if need be?

support incremental updating the index

I don't think we will need this, so no sweat.

isaac-chung commented 4 months ago

Do you think their setup won't scale to Wikipedia sized collections?

Rather, I've yet to find out how to use their setup work with Wikipedia sized collections. The Vertex Index returns the doc ID, and the source passage is read via an in-memory dict self.doc_map (also used in the local index, except it's sharded in the local version). That in-memory dict will eventually hit the limit of whatever machine that serves this app.

Muennighoff commented 4 months ago

The current Wikipedia texts are only ~1GB (https://huggingface.co/datasets/mteb/nq/tree/main) so it should be fine to have them in RAM no?

But agreed that it'd be better if not - Surprised there's no option to just have it return the actual text instead? Maybe if we make the doc ids the passage instead? (though a bit hacky) Or maybe we can shard the files & then read from disk on the fly?

isaac-chung commented 4 months ago

1GB is most likely fine. Then we can use this as is for the time being.

I think there's a limit to doc id lengths. Usually in other vector DBs we can stuff the passage in the metadata. Maybe we could try doing that in the "vector attributes"?

Or maybe we can shard the files & then read from disk on the fly?

[edit] This is actually a great idea!

isaac-chung commented 4 months ago

At some point do you think you could you dump these instructions into a indexing README so that others could create one later, if need be?

Yeah! For sure. Could include suggested steps to debug if things go sideways too, e.g.

gcloud ai operations describe OPERATION_ID

Muennighoff commented 4 months ago

Sounds great! I guess we still need to update this PR a bit or should we already start creating indices for the models https://github.com/embeddings-benchmark/arena/issues/6?

Another point is maybe we want to update the Wikipedia index? The one I took is from NQ, but maybe we should take the latest Wikipedia dump? also cc @orionw as you know better than me I think!

orionw commented 4 months ago

I like the idea of having a Wikipedia index that's newer for the Arena. On the other hand, most models/datasets are on older indexes like NQ.

This is what BEIR said about the NQ dataset:

We filtered out queries without an answer, or having a table as an answer, or with conflicting Wikipedia pages. We retain 2,681,468 passages as our corpus T and 3452 test queries Q from the original dataset.

Not sure what this means exactly, but IIRC Wikipedia usually has many more passages than 2.6 million (I know the DPR one with 100 word chunks has 21 million). I think BEIR subsampled it to be faster.

If we do want one that is newer, they have an easy format in JSON for indexing here: https://dumps.wikimedia.org/other/cirrussearch/ I can download it and chunk it if we want a newer one. The only question would be what size chunks? 100 words feels small these days but if we probably don't want to go too long? And do we want a Wikipedia index with ~15 million passages?

Muennighoff commented 4 months ago

Are we still limited by RAM @isaac-chung? If not, then I think we should go with the latest Wikipedia dump?

RE: Chunk size: For the NQ one we currently use, it seems to split on newlines/passages? Maybe there's also a word limit, but I think always splitting after the end of a sentence / passage is great such that users don't vote for one result just because it finishes in a cleaner way. Below an image of the title/passage and how it looks like on Wikipedia (page: https://en.wikipedia.org/wiki/Atomic_bombings_of_Hiroshima_and_Nagasaki).

I think it would be great to also have the subtitles (i.e. Nagasaki > Bombing of Nagasaki) not only the title of the entire page (Atomic bombings of Hiroshima and Nagasaki), but otherwise this format seems good to me, what do you think?

isaac-chung commented 4 months ago

We should be fine with the current implementation if the space can handle 1.3GB * 15M/2.6M = 7.5GB for loading all of the passages/chunks from the latest wikipedia dump.

[Maybe for the next iteration/PR] Sharding the passages and reading on the fly after pinging the gcp index (? hope I didn't misunderstand) could present a different bottleneck as reading takes quite a long time. I tried reading a 500MB json file, and on average over 5 runs it took 5.7s. Something we could try later:

try to see what attributes we can fit the passage string into within the MatchNeighbor class. If that fails,
see if another GCP solution can be helpful, e.g. GBQ. Hoping the latency won't longer than loading a json file.

Muennighoff commented 4 months ago

Okay let's see how much it will be @orionw and then decide if we need to change the implementation? I will set up the index on GCP once the dump is ready and then we can test it live.

orionw commented 4 months ago

We should be fine with the current implementation if the space can handle 1.3GB * 15M/2.6M = 7.5GB for loading all of the passages/chunks from the latest wikipedia dump.

I am a little confused, where are these numbers coming from? @isaac-chung

RE: Chunk size: For the NQ one we currently use, it seems to split on newlines/passages?

Yes, we can definitely split on newlines and keep the title in every instance. Getting the subtitles is actually much harder, since we'd have to get the MediaWiki version of Wikipedia and parse it in a way that preserves the hierarchy. Someone in my lab is doing that for MegaWika v2, so should be done sometime this summer-ish if we want to update it later.

Maybe there's also a word limit

Are we not so worried about this? I suspect some Wikipedia paragraphs will be much longer than 512 tokens. However, there may only be one or two models that have that limit these days (E5/BGE type models).

If so, I will write some chunking code that greedily takes at least one paragraph or N paragraphs up to a max word limit. Perhaps 500 ish words, so ~800 tokens?

Muennighoff commented 4 months ago

so should be done sometime this summer-ish if we want to update it later.

Cool; def not high prior but would be a nice-to-have later!

If so, I will write some chunking code that greedily takes at least one paragraph or N paragraphs up to a max word limit. Perhaps 500 ish words, so ~800 tokens?

Sounds good to me; amazing 🚀

Let's get ready for launching the arena soooon!! 🙌🚀

isaac-chung commented 4 months ago

And do we want a Wikipedia index with ~15 million passages?

@orionw the NQ wiki corpus had 2.6M passages at 1.3GB, and you mentioned the newest wiki had 15M passages. This is assuming the passages are chunked the same way but that might not be true.

orionw commented 4 months ago

wiki_extracted.json

How does this size seem (a sample of 100)? They are chunked by paragraphs and can contain multiple paragraphs if they are less than 500 words. I don't include any overlap since we're chunking by paragraph.

I wasn't able to use the Cirrus text, which is a shame. They are meant for ElasticSearch and don't have newlines, but the text doesn't have the scraping problems that other Wikipedia extractors have. We could use it and split on sentences, but that isn't ideal either.

the NQ wiki corpus had 2.6M passages at 1.3GB, and you mentioned the newest wiki had 15M passages. This is assuming the passages are chunked the same way but that might not be true.

Thanks for explaining @isaac-chung! Makes sense.

orionw commented 4 months ago

If we like that sample, the remainder of the docs are uploaded here: https://huggingface.co/datasets/orionweller/wikipedia-2024-06-24-docs

Roughly 5GB and only 2.67M passages (although much longer than DPR style ones)

isaac-chung commented 4 months ago

Any idea if the footer/reference are cleaned for NQ? e.g.

References\nExternal links\nPaul Davids' official website\nPaul Davids UFO (link broken \u2013 July 8, 2010)\nReview of The Sci-Fi Boys documentary at The Thunder Child\nAn exchange with Cecil Adams, of \"The Straight Dope\", on the Roswell Incident\nCategory:20th-century American novelists\nCategory:20th-century American male writers\nCategory:American male novelists\nCategory:American science fiction writers\nCategory:Living people\nCategory:Place of birth missing (living people)\nCategory:Year of birth missing (living people)
...
...
Category:Triple J announcers\nCategory:Australian women radio presenters\nCategory:Year of birth missing (living people)\nCategory:Living people"

orionw commented 4 months ago

Looking at the NQ HuggingFace dataset viewer it has a lot of one liner docs and some other quirks, but no references.

The Cirrus file has no references already, but again -- no newlines.

Good catch though @isaac-chung! There's no good functionality that says "this is a references section" from Wikipedia. I can take every header and if it has the word "references" or "reference" or "notes" skip it? This will probably remove some valid sections but at least it would remove all references.

Wikipedia is always deceptively hard to parse, which is why I was hoping the Cirrus file would work since Wikimedia has already parsed it.

orionw commented 4 months ago

Okay new sample here: wiki_extracted.json

It's unfortunately very "whack-a-mole" style. I remove all the errors I could find, but it's probably when I generate it all that there will be some. And perhaps we can remove most of those. I'll generate the full set and upload it to HF when it's done. I also made a PR about the extraction code.

If we don't want to deal with this, I suggest we take someone's existing Wikipedia (like NQs) which has parsing errors also but at least it's not our parsing errors.

isaac-chung commented 4 months ago

Thanks @orionw ! This looks good to me. If @Muennighoff agrees, then I can update the code to read from your HF dataset instead of a local file.

orionw commented 4 months ago

I've gone through a couple rounds of creating the full version and grep'd for different types of errors in the full set of docs. I was able to fix a few more in each attempt. I think this last set is looking like it covers all the major problems, so I'll push that when it finishes tonight.

The dataset has the most recent previous version if you want to look at the viewer, the new fix should be the same except for fixing ending "}}" phrases which are relatively rare.

Muennighoff commented 4 months ago

Sounds good to me! I would have thought that the creation script for the NQ Wikipedia dump would be available somewhere but also couldn't find it :/

isaac-chung commented 4 months ago

Thanks @orionw ! This looks good to me. If @Muennighoff agrees, then I can update the code to read from your HF dataset instead of a local file.

Updated the PR to read from the HF dataset.

Muennighoff commented 4 months ago

We now have three corpora thanks to @orionw : Wikipedia, StackExchange, arXiv (see https://github.com/embeddings-benchmark/arena/issues/5)

Will the in-memory approach from this PR still work then or do we need to adapt it @isaac-chung ? I can also just go ahead create the indices and we can try if you want? Are the steps for that still up-to-date?

isaac-chung commented 4 months ago

Great! ~Wikipedia and arxiv should be fine but StackExchange at 35GB might be challenging.~ It should be fine with a big enough machine (fine on an A10).

Steps are still up-to-date, though it currently defaults to the wikipedia repo. I'll add a param to specify the corpus at the class __init__ level. That way, we can specify which corpus to load when creating/loading the index:

gcp_index = VertexIndex(
+   corpus="wikipedia",
    dim=dim, 
    model_name=model_name, 
    model=model, 
    limit=limit,
)

Muennighoff commented 4 months ago

Do you mean if we get an A10G large?

isaac-chung commented 4 months ago

Yeah something like that. I've got a small running. Also

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          196Gi       7.0Gi        94Gi       5.0Mi        94Gi       187Gi
Swap:            0B          0B          0B

Muennighoff commented 4 months ago

Okay sounds good; I will go through the instructions and create the indices!

Muennighoff commented 3 months ago

Merging this preliminarily so I can directly debug with the arena space

embeddings-benchmark / arena