allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

utility for cleaning up data #89

Closed seanmacavaney closed 3 years ago

seanmacavaney commented 3 years ago

As suggested by Benjamin Piwowarski

Is your feature request related to a problem? Please describe.

Datasets can end up taking up a lot of space, and it's easy to run out of storage.

Describe the solution you'd like It would be nice if there was a utility to help clean up datasets that you do not need anymore. This could probably be a command line utility.

In many cases, the cleanup can be easy; just delete the dataset's entire directory. This is suitable when the datasets are fully downloadable from public sources. It can just be recovered again, no problem.

For non-downloadable items, though:

  1. We certainly do not want to delete them, as there's a chance it's the only copy.
  2. There's no storage benefit from cleaning up symlinks.

so it'd make sense to explicitly avoid removing files that match the cache_path of files that are not downloadable. (Regardless of whether it's a file/link/directory/etc.)

Describe alternatives you've considered

None

Additional context

None

bpiwowar commented 3 years ago

Thanks for writing the feature request. A possibility for non freely downloadable object would be:

  1. Ask the user to put them in a specific folder
  2. Have a repository of user managed datasets indexed by key

This is how I proceed in datamaestro (not really documented), by having a repository where specific keys can be associated with folders located where the user want them to be, e.g.

$ datamaestro datafolders list
gov.nist.trec.tipster   /local/bpiwowar/datasets/trec/TIPSTER
edu.upenn.ldc.aquaint   /local/bpiwowar/datasets/trec/AQUAINT

and this can be set easily by the user (when moving or creating the resource)

$ datamaestro datafolders set gov.nist.trec.tipster /local/bpiwowar/datasets/trec/TIPSTER
seanmacavaney commented 3 years ago

Hmm, right. That's not far off from the current implementation, but I see the benefit of separating them out of the dataset's directory. Migration and backwards compatibility could be a bit annoying, but should be manageable.

I also like the idea of giving the user a simpler way to link the resource. At the very least, giving the ln command they'll need to run.

seanmacavaney commented 3 years ago

I'm getting cold feet on managing a migration of non-downloadable files. It seems easy enough just to skip those files during the cleanup. I'll write that up as a separate issue.

I built a prototype of a cleanup utility (ir_datasets clean). See below for an example. Not visible here, but sizes >= 1GB are shown in red, which helps them stand out. I think this does what I'd want it to do-- does it satisfy your use cases? Thanks!

$ ir_datasets clean --list
datasets available for cleanup:
883.2MB antique 20 files
2.2GB   aquaint 8 files
45.0GB  beir    250 files
7.1GB   highwire    70 files
43.5GB  medline 31 files
5.7GB   clinicaltrials  32 files
127.6KB clueweb09   4 files
1.0MB   clueweb12   4 files
270.4MB cord19  5 files
3.4MB   cranfield   8 files
22.5GB  dpr-w100    19 files
25.0GB  msmarco-passage 41 files
23.1GB  msmarco-document    26 files
5.8GB   msmarco-qna 19 files
486.8KB nyt 3 files
6.6MB   trec-robust04   2 files
681.1MB tripclick   18 files
8.7MB   vaswani 10 files
2.9MB   wapo    8 files
1.5GB   wikir   22 files
46.4GB  trec-fair-2021  9 files

$ ir_datasets clean cranfield # you can list multiple datasets here and/or include -y to automatically say yes
clean up 3.4MB from cranfield (8 files)?
[y(es) / n(o) / l(ist files)] l
1.6MB   /home/sean/.ir_datasets/cranfield/docs.txt
507.0KB /home/sean/.ir_datasets/cranfield/cran.tar.gz
11.2KB  /home/sean/.ir_datasets/cranfield/docs.pklz4/idx.doc_id.pos
5.6KB   /home/sean/.ir_datasets/cranfield/docs.pklz4/idx.doc_id.key
28B /home/sean/.ir_datasets/cranfield/docs.pklz4/bin.meta
11.2KB  /home/sean/.ir_datasets/cranfield/docs.pklz4/bin.pos
6B  /home/sean/.ir_datasets/cranfield/docs.pklz4/idx.doc_id.meta
1.2MB   /home/sean/.ir_datasets/cranfield/docs.pklz4/bin
clean up 1.2MB from cranfield (8 files)?
[y(es) / n(o) / l(ist files)] y
bpiwowar commented 3 years ago

Looks good - maybe use a flag for size unit (so this can be parsable & sortable)