Open andrewyates opened 3 years ago
A tricky bit about this is keeping the iterator order consistent across the two modes. Right now, directory inputs are processed in sorted order (per globs), whereas tar files are processed in the order in which they are encountered during decompression. It seems there are 3 practical options:
I think I'm leaning towards option 2.
Partially addressing this in #103 by introducing a new Alternatives
component. Will allow things along the lines of:
Alternatives(
LocalFile("/path/to/compressed/source.tar.gz").un_tar_all(),
LocalFile("/path/to/uncomrpessed/source")
)
Which then chooses which one to use based on what's found.
Currently irds requires the original files from a dataset, such as a tar.gz file for the NYT corpus. It would be nice if the extracted directory could also be provided as input instead. This would be useful in cases where the original compressed file wasn't kept (as happened to me with NYT). Taking the directory as input is also closer to what most other tools do, so it potentially removes the need to have both the compressed file and its contents available.