Closed eugene-yang closed 2 years ago
Ahh, looking back this was an intentional (albeit poor) design decision -- I intended the docs_count() function to be lazy, i.e., not trigger the creation of a docstore to get the count if it didn't already have it. The idea was that e.g., if you wanted to add a progress bar to your iterator, you'd probably rather not want to go through the entire corpus to create a docstore before iterating over the collection again.
In hindsight, a way to override this behaviour would have been better. E.g., docs_count(force=True)
or similar.
With the metadata project (#66), counts will always be available, even without needing to download the content. But in the meantime, a workaround would be to do this: dataset.docs_store().count()
-- will always give the total number of docs. And after you do it once, dataset.docs_count()
and len(dataset.docs)
will return the count as well.
Thanks @seanmacavaney for looking into this! This is useful. I guess it would make more sense to give a warning instead of giving None or at least provide identical behavior on both interfaces.
In the metadata branch, counts now provided by metadata if they are not yet available from the provider directly! (In both versions of the Python API.)
Describe the bug The document count is not available (returning none) in both the original and beta python interfaces while there are documents yielding from the document iterators.
Affected dataset(s) CLIRMatrix
To Reproduce Steps to reproduce the behavior:
Expected behavior Document count should be returned.