allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
306 stars 40 forks source link

Disable MetadataComponent for local development #216

Open heinrichreimer opened 1 year ago

heinrichreimer commented 1 year ago

Is your feature request related to a problem? Please describe. The MetadataComponent, as I understand it, speeds up querying metadata in production by having a pre-computed metadata index. But when developing doc classes for new datasets, this component often intercepts calls to docs_count etc, making it difficult to trace down issues in the development code (because the MetadataComponent sort of monkey-patches the corresponding functions).

Describe the solution you'd like I'd be happy to have some opt-out environment variable to effectively switch off the monkey patching in this line: https://github.com/allenai/ir_datasets/blob/546baf8c40394c77de177060e3c58fef2ab6fdd7/ir_datasets/util/registry.py#L41-L42

Describe alternatives you've considered Alternatively, the metadata component should at least have clear error messages that tell developers that some metadata is missing and how to add those.

Additional context For developing integrations for large datasets, e.g., #213, having to generate metadata while developing the parsers is also not that great as one would often test the parsers on a smaller sample before computing metadata on the whole database.