Direct access to all doc_ids

allenai / ir_datasets

Provides a common interface to many IR ranking datasets.

https://ir-datasets.com/

Apache License 2.0

309 stars 42 forks source link

Direct access to all doc_ids #184

Open ArthurCamara opened 2 years ago

ArthurCamara commented 2 years ago

This is something I was expecting to be quite straightforward (or at least better documented in the API) but it doesn't seem to be. Say I want to gather all doc_ids from a given corpus (for instance, if I want to use a random negative sampler on run time). Currently, this is what I do:

data = ir_datasets.load("msmarco-document/train")
all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())

which is fine, but, from what I can get, this triggers an iteration over all docs in the collection (and is also not very intuitive).

Is there a better way to achieve this?

seanmacavaney commented 2 years ago

The easiest way to load all doc_ids is:

data = ir_datasets.load("msmarco-document/train")
all_doc_ids = [d.doc_id for d in data.docs]

But, as you say, this iterates over all documents.

I think it would be straightforward enough to add a new API for iterating over just the document IDs, if you think it would be valuable. Maybe exposed as something like: data.docs.doc_ids.

But.... For your particular use case, I think you may not actually need the doc_ids themselves. You can just sample by index instead of by doc_id, eliminating the need for loading doc_ids at all. For instance, you could do:

num_docs = len(data.docs)
idx = random.randrange(num_docs)
data.docs[idx]

Lookups by index are fast (especially on SSD) and do not load corpus into memory, once a docstore is built (which it does automatically, and is needed anyway for lookups by doc_id).

ArthurCamara commented 2 years ago

Ok, this also works, iterating by the index of the document!

As for

all_doc_ids = [d.doc_id for d in data.docs]

versus

all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())

The second one seems slightly faster (could be because I was trying the first one before, and I had a tqdm for loop encapsulating it that could be adding some bottleneck).

As for including the API, yes, that shouldn't be very hard. I can do it early next week, if that's ok.

seanmacavaney commented 2 years ago

Great, glad the lookups by index work for what you need!

I think the risk of doing:

all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())

is that it's sorted (lexically) by docid, enabling lookups. So the indices will not necessarily align, which users may expect. Maybe this is alright, but we probably want to think a bit more of the design here before pushing this through.

Not all datasets use a lz4docstore (e.g., the ClueWebs) because we don't want to make a copy of huge corpora. So some consideration of these cases should be made.

seanmacavaney commented 1 year ago

Hey @ArthurCamara -- quick update on this. Over the past few months I've been working on an alternative file format to facilitate doc_id->idx and idx->doc_id lookups, iteration over doc_ids, etc. It also aims to ditch the searchsorted approach for doc_id->idx lookups in favor of an on-disk hash table, since the former requires doc_ids to be padded to the same length (adding considerable size to some lookups) and has an unfavourable access pattern on disk, which makes it a bit slow until everything is loaded into the cache.

Not sure when it'll be ready for primetime, but just letting you know that a solution to this is in the works.

ArthurCamara commented 1 year ago

That sounds awesome, @seanmacavaney. Thanks for letting me know!