Accept directories extracted from original compressed files

andrewyates commented 3 years ago

Currently irds requires the original files from a dataset, such as a tar.gz file for the NYT corpus. It would be nice if the extracted directory could also be provided as input instead. This would be useful in cases where the original compressed file wasn't kept (as happened to me with NYT). Taking the directory as input is also closer to what most other tools do, so it potentially removes the need to have both the compressed file and its contents available.

seanmacavaney commented 3 years ago

A tricky bit about this is keeping the iterator order consistent across the two modes. Right now, directory inputs are processed in sorted order (per globs), whereas tar files are processed in the order in which they are encountered during decompression. It seems there are 3 practical options:

Eliminate the iteration order constraint, and have the two possible orderings. In theory, this shouldn't matter as the document content remains the same, but it could have (minor) effects downstream, e.g., different internal docids assigned when indexing, or floating-point non-determinism as documents are batched with other documents on GPUs. Tests would also need to be updated to support both formats somehow? (Or just somehow specify which format to use for the tests.) I'd rather not go this route because I'm a fan of trying to control as much as possible for reproducibility.
When a tarfile is provided, first extract the contents, then operate over it as a directory in sorted order. This would store extra data on disk, though I suppose it could be removed right after if we enforced that a docstore be constructed before iteration. (We already always first construct a docstore for some datasets. It has the advantage that iteration becomes much faster + other features, at the expense of extra storage.)
Include the order of the files in the tarfile in irds itself, and iterate over the directory in the same order as the tarfile. This could be annoying for large archives and those where document each get their own file.

I think I'm leaning towards option 2.

seanmacavaney commented 3 years ago

Partially addressing this in #103 by introducing a new Alternatives component. Will allow things along the lines of:

Alternatives(
  LocalFile("/path/to/compressed/source.tar.gz").un_tar_all(),
  LocalFile("/path/to/uncomrpessed/source")
)

Which then chooses which one to use based on what's found.

allenai / ir_datasets

Accept directories extracted from original compressed files #60