Closed isaac-chung closed 3 months ago
This is amazing!! Do you know if it produces the same results as our local index implementation?
@Muennighoff yep, they are! I've randomly sampled 10 queries from mteb/nq
in 1ed393b and dumped the results from both indices into a yaml file.
e.g.
what is the name of the lymphatic vessels located in the small intestine:
gcp: 'Title: Can''t Help Falling in Love
Passage: European 2-track CD single'
local: 'Title: Can''t Help Falling in Love
Passage: European 2-track CD single'
Amazing, should I follow the instructions in your setup message to create it? // I just send you the GCP keys?
For me, there was already a project set up via the UI, and I created the GCS bucket via the UI as well. When the script was running, I was also checking the UI to confirm that the index/buckets/endpoint were created, and I found that very helpful.
If you're able to access the GCP UI, could you try the instructions first? I could try the keys if you run out of time.
This is awesome, thanks @isaac-chung!
scalability
I think the largest we're considering for now is Wikipedia. Do you think their setup won't scale to Wikipedia sized collections?
Setup FULL index and endpoint with deployment by manual trigger (Caution! will take a long time)
At some point do you think you could you dump these instructions into a indexing README so that others could create one later, if need be?
support incremental updating the index
I don't think we will need this, so no sweat.
Do you think their setup won't scale to Wikipedia sized collections?
Rather, I've yet to find out how to use their setup work with Wikipedia sized collections. The Vertex Index returns the doc ID, and the source passage is read via an in-memory dict self.doc_map
(also used in the local index, except it's sharded in the local version). That in-memory dict will eventually hit the limit of whatever machine that serves this app.
The current Wikipedia texts are only ~1GB (https://huggingface.co/datasets/mteb/nq/tree/main) so it should be fine to have them in RAM no?
But agreed that it'd be better if not - Surprised there's no option to just have it return the actual text instead? Maybe if we make the doc ids the passage instead? (though a bit hacky) Or maybe we can shard the files & then read from disk on the fly?
1GB is most likely fine. Then we can use this as is for the time being.
I think there's a limit to doc id lengths. Usually in other vector DBs we can stuff the passage in the metadata. Maybe we could try doing that in the "vector attributes"?
Or maybe we can shard the files & then read from disk on the fly?
[edit] This is actually a great idea!
At some point do you think you could you dump these instructions into a indexing README so that others could create one later, if need be?
Yeah! For sure. Could include suggested steps to debug if things go sideways too, e.g.
gcloud ai operations describe OPERATION_ID
Sounds great! I guess we still need to update this PR a bit or should we already start creating indices for the models https://github.com/embeddings-benchmark/arena/issues/6?
Another point is maybe we want to update the Wikipedia index? The one I took is from NQ, but maybe we should take the latest Wikipedia dump? also cc @orionw as you know better than me I think!
I like the idea of having a Wikipedia index that's newer for the Arena. On the other hand, most models/datasets are on older indexes like NQ.
This is what BEIR said about the NQ dataset:
We filtered out queries without an answer, or having a table as an answer, or with conflicting Wikipedia pages. We retain 2,681,468 passages as our corpus T and 3452 test queries Q from the original dataset.
Not sure what this means exactly, but IIRC Wikipedia usually has many more passages than 2.6 million (I know the DPR one with 100 word chunks has 21 million). I think BEIR subsampled it to be faster.
If we do want one that is newer, they have an easy format in JSON for indexing here: https://dumps.wikimedia.org/other/cirrussearch/ I can download it and chunk it if we want a newer one. The only question would be what size chunks? 100 words feels small these days but if we probably don't want to go too long? And do we want a Wikipedia index with ~15 million passages?
Are we still limited by RAM @isaac-chung? If not, then I think we should go with the latest Wikipedia dump?
RE: Chunk size: For the NQ one we currently use, it seems to split on newlines/passages? Maybe there's also a word limit, but I think always splitting after the end of a sentence / passage is great such that users don't vote for one result just because it finishes in a cleaner way. Below an image of the title/passage and how it looks like on Wikipedia (page: https://en.wikipedia.org/wiki/Atomic_bombings_of_Hiroshima_and_Nagasaki).
I think it would be great to also have the subtitles (i.e. Nagasaki
> Bombing of Nagasaki
) not only the title of the entire page (Atomic bombings of Hiroshima and Nagasaki
), but otherwise this format seems good to me, what do you think?
We should be fine with the current implementation if the space can handle 1.3GB * 15M/2.6M = 7.5GB for loading all of the passages/chunks from the latest wikipedia dump.
[Maybe for the next iteration/PR] Sharding the passages and reading on the fly after pinging the gcp index (? hope I didn't misunderstand) could present a different bottleneck as reading takes quite a long time. I tried reading a 500MB json file, and on average over 5 runs it took 5.7s. Something we could try later:
MatchNeighbor
class. If that fails,Okay let's see how much it will be @orionw and then decide if we need to change the implementation? I will set up the index on GCP once the dump is ready and then we can test it live.
We should be fine with the current implementation if the space can handle 1.3GB * 15M/2.6M = 7.5GB for loading all of the passages/chunks from the latest wikipedia dump.
I am a little confused, where are these numbers coming from? @isaac-chung
RE: Chunk size: For the NQ one we currently use, it seems to split on newlines/passages?
Yes, we can definitely split on newlines and keep the title in every instance. Getting the subtitles is actually much harder, since we'd have to get the MediaWiki version of Wikipedia and parse it in a way that preserves the hierarchy. Someone in my lab is doing that for MegaWika v2, so should be done sometime this summer-ish if we want to update it later.
Maybe there's also a word limit
Are we not so worried about this? I suspect some Wikipedia paragraphs will be much longer than 512 tokens. However, there may only be one or two models that have that limit these days (E5/BGE type models).
If so, I will write some chunking code that greedily takes at least one paragraph or N paragraphs up to a max word limit. Perhaps 500 ish words, so ~800 tokens?
so should be done sometime this summer-ish if we want to update it later.
Cool; def not high prior but would be a nice-to-have later!
If so, I will write some chunking code that greedily takes at least one paragraph or N paragraphs up to a max word limit. Perhaps 500 ish words, so ~800 tokens?
Sounds good to me; amazing ๐
Let's get ready for launching the arena soooon!! ๐๐
And do we want a Wikipedia index with ~15 million passages?
@orionw the NQ wiki corpus had 2.6M passages at 1.3GB, and you mentioned the newest wiki had 15M passages. This is assuming the passages are chunked the same way but that might not be true.
How does this size seem (a sample of 100)? They are chunked by paragraphs and can contain multiple paragraphs if they are less than 500 words. I don't include any overlap since we're chunking by paragraph.
I wasn't able to use the Cirrus text, which is a shame. They are meant for ElasticSearch and don't have newlines, but the text doesn't have the scraping problems that other Wikipedia extractors have. We could use it and split on sentences, but that isn't ideal either.
the NQ wiki corpus had 2.6M passages at 1.3GB, and you mentioned the newest wiki had 15M passages. This is assuming the passages are chunked the same way but that might not be true.
Thanks for explaining @isaac-chung! Makes sense.
If we like that sample, the remainder of the docs are uploaded here: https://huggingface.co/datasets/orionweller/wikipedia-2024-06-24-docs
Roughly 5GB and only 2.67M passages (although much longer than DPR style ones)
Any idea if the footer/reference are cleaned for NQ? e.g.
References\nExternal links\nPaul Davids' official website\nPaul Davids UFO (link broken \u2013 July 8, 2010)\nReview of The Sci-Fi Boys documentary at The Thunder Child\nAn exchange with Cecil Adams, of \"The Straight Dope\", on the Roswell Incident\nCategory:20th-century American novelists\nCategory:20th-century American male writers\nCategory:American male novelists\nCategory:American science fiction writers\nCategory:Living people\nCategory:Place of birth missing (living people)\nCategory:Year of birth missing (living people)
...
...
Category:Triple J announcers\nCategory:Australian women radio presenters\nCategory:Year of birth missing (living people)\nCategory:Living people"
Looking at the NQ HuggingFace dataset viewer it has a lot of one liner docs and some other quirks, but no references.
The Cirrus file has no references already, but again -- no newlines.
Good catch though @isaac-chung! There's no good functionality that says "this is a references section" from Wikipedia. I can take every header and if it has the word "references" or "reference" or "notes" skip it? This will probably remove some valid sections but at least it would remove all references.
Wikipedia is always deceptively hard to parse, which is why I was hoping the Cirrus file would work since Wikimedia has already parsed it.
Okay new sample here: wiki_extracted.json
It's unfortunately very "whack-a-mole" style. I remove all the errors I could find, but it's probably when I generate it all that there will be some. And perhaps we can remove most of those. I'll generate the full set and upload it to HF when it's done. I also made a PR about the extraction code.
If we don't want to deal with this, I suggest we take someone's existing Wikipedia (like NQs) which has parsing errors also but at least it's not our parsing errors.
Thanks @orionw ! This looks good to me. If @Muennighoff agrees, then I can update the code to read from your HF dataset instead of a local file.
I've gone through a couple rounds of creating the full version and grep
'd for different types of errors in the full set of docs. I was able to fix a few more in each attempt. I think this last set is looking like it covers all the major problems, so I'll push that when it finishes tonight.
The dataset has the most recent previous version if you want to look at the viewer, the new fix should be the same except for fixing ending "}}" phrases which are relatively rare.
Sounds good to me! I would have thought that the creation script for the NQ Wikipedia dump would be available somewhere but also couldn't find it :/
Thanks @orionw ! This looks good to me. If @Muennighoff agrees, then I can update the code to read from your HF dataset instead of a local file.
Updated the PR to read from the HF dataset.
We now have three corpora thanks to @orionw : Wikipedia, StackExchange, arXiv (see https://github.com/embeddings-benchmark/arena/issues/5)
Will the in-memory approach from this PR still work then or do we need to adapt it @isaac-chung ? I can also just go ahead create the indices and we can try if you want? Are the steps for that still up-to-date?
Great! ~Wikipedia and arxiv should be fine but StackExchange at 35GB might be challenging.~ It should be fine with a big enough machine (fine on an A10).
Steps are still up-to-date, though it currently defaults to the wikipedia repo. I'll add a param to specify the corpus at the class __init__
level. That way, we can specify which corpus to load when creating/loading the index:
gcp_index = VertexIndex(
+ corpus="wikipedia",
dim=dim,
model_name=model_name,
model=model,
limit=limit,
)
Do you mean if we get an A10G large?
Yeah something like that. I've got a small running. Also
$ free -h
total used free shared buff/cache available
Mem: 196Gi 7.0Gi 94Gi 5.0Mi 94Gi 187Gi
Swap: 0B 0B 0B
Okay sounds good; I will go through the instructions and create the indices!
Merging this preliminarily so I can directly debug with the arena space
Add GCP Vertex index support.
Key files
run_vertex_index.py
: Example script to call retrieval using gcp index. Can also be used to trigger an index build.retrieval/gcp_index.py
: Stores theVertexIndex
class. After instantiating the class, callingsearch()
should load the endpoint if it exists or build from scratch if not.Checklist
VertexIndex
class (this can be updated to read from repo vars)Setup FULL index and endpoint with deployment by manual trigger (Caution! will take a long time)
gcloud auth application-default login
. Follow the instructions in the terminalcorpus.json
viawget https://huggingface.co/datasets/mteb/nq/resolve/main/corpus.jsonl
at the root of the reporun_vertex_index.py
, updateMODEL_META_PATH = "model_meta.yml"
. This removes the limit on passages.run_vertex_index.py
Notes
Limitations
VertexIndex
class usesself.doc_map
to get raw passage texts as the index does not support storing them as metadata.tmp.json
file that is written to the GCS bucket is ~6x the size ofembeddings.0.pt
on the same passages. The cause could be that 1) use of string to repr each embedding float and 2) added overhead like "id" and "embedding" (smaller)In [2]: sys.getsizeof("0.07250326871871948") Out[2]: 68
In [3]: sys.getsizeof(0.07250326871871948) Out[3]: 24