Is your feature request related to a problem? Please describe.
The MetadataComponent, as I understand it, speeds up querying metadata in production by having a pre-computed metadata index.
But when developing doc classes for new datasets, this component often intercepts calls to docs_count etc, making it difficult to trace down issues in the development code (because the MetadataComponent sort of monkey-patches the corresponding functions).
Describe alternatives you've considered
Alternatively, the metadata component should at least have clear error messages that tell developers that some metadata is missing and how to add those.
Additional context
For developing integrations for large datasets, e.g., #213, having to generate metadata while developing the parsers is also not that great as one would often test the parsers on a smaller sample before computing metadata on the whole database.
Is your feature request related to a problem? Please describe. The
MetadataComponent
, as I understand it, speeds up querying metadata in production by having a pre-computed metadata index. But when developing doc classes for new datasets, this component often intercepts calls todocs_count
etc, making it difficult to trace down issues in the development code (because theMetadataComponent
sort of monkey-patches the corresponding functions).Describe the solution you'd like I'd be happy to have some opt-out environment variable to effectively switch off the monkey patching in this line: https://github.com/allenai/ir_datasets/blob/546baf8c40394c77de177060e3c58fef2ab6fdd7/ir_datasets/util/registry.py#L41-L42
Describe alternatives you've considered Alternatively, the metadata component should at least have clear error messages that tell developers that some metadata is missing and how to add those.
Additional context For developing integrations for large datasets, e.g., #213, having to generate metadata while developing the parsers is also not that great as one would often test the parsers on a smaller sample before computing metadata on the whole database.