allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

CLIRMatrix does not provide document counts #123

Closed eugene-yang closed 2 years ago

eugene-yang commented 3 years ago

Describe the bug The document count is not available (returning none) in both the original and beta python interfaces while there are documents yielding from the document iterators.

Affected dataset(s) CLIRMatrix

To Reproduce Steps to reproduce the behavior:

In [1]: import ir_datasets

In [2]: dataset = ir_datasets.load('clirmatrix/zh/bi139-full/en/dev')

In [3]: len(dataset.docs)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-791a85976ef8> in <module>
----> 1 len(dataset.docs)

TypeError: 'NoneType' object cannot be interpreted as an integer

In [4]: dataset.docs_count()

In [5]: len(list(dataset.docs_iter()))
Out[5]: 1089043

Expected behavior Document count should be returned.

seanmacavaney commented 3 years ago

Ahh, looking back this was an intentional (albeit poor) design decision -- I intended the docs_count() function to be lazy, i.e., not trigger the creation of a docstore to get the count if it didn't already have it. The idea was that e.g., if you wanted to add a progress bar to your iterator, you'd probably rather not want to go through the entire corpus to create a docstore before iterating over the collection again.

In hindsight, a way to override this behaviour would have been better. E.g., docs_count(force=True) or similar.

With the metadata project (#66), counts will always be available, even without needing to download the content. But in the meantime, a workaround would be to do this: dataset.docs_store().count() -- will always give the total number of docs. And after you do it once, dataset.docs_count() and len(dataset.docs) will return the count as well.

eugene-yang commented 3 years ago

Thanks @seanmacavaney for looking into this! This is useful. I guess it would make more sense to give a warning instead of giving None or at least provide identical behavior on both interfaces.

seanmacavaney commented 2 years ago

In the metadata branch, counts now provided by metadata if they are not yet available from the provider directly! (In both versions of the Python API.)