Closed seanmacavaney closed 3 years ago
Thanks for writing the feature request. A possibility for non freely downloadable object would be:
This is how I proceed in datamaestro (not really documented), by having a repository where specific keys can be associated with folders located where the user want them to be, e.g.
$ datamaestro datafolders list
gov.nist.trec.tipster /local/bpiwowar/datasets/trec/TIPSTER
edu.upenn.ldc.aquaint /local/bpiwowar/datasets/trec/AQUAINT
and this can be set easily by the user (when moving or creating the resource)
$ datamaestro datafolders set gov.nist.trec.tipster /local/bpiwowar/datasets/trec/TIPSTER
Hmm, right. That's not far off from the current implementation, but I see the benefit of separating them out of the dataset's directory. Migration and backwards compatibility could be a bit annoying, but should be manageable.
I also like the idea of giving the user a simpler way to link the resource. At the very least, giving the ln
command they'll need to run.
I'm getting cold feet on managing a migration of non-downloadable files. It seems easy enough just to skip those files during the cleanup. I'll write that up as a separate issue.
I built a prototype of a cleanup utility (ir_datasets clean
). See below for an example. Not visible here, but sizes >= 1GB are shown in red, which helps them stand out. I think this does what I'd want it to do-- does it satisfy your use cases? Thanks!
$ ir_datasets clean --list
datasets available for cleanup:
883.2MB antique 20 files
2.2GB aquaint 8 files
45.0GB beir 250 files
7.1GB highwire 70 files
43.5GB medline 31 files
5.7GB clinicaltrials 32 files
127.6KB clueweb09 4 files
1.0MB clueweb12 4 files
270.4MB cord19 5 files
3.4MB cranfield 8 files
22.5GB dpr-w100 19 files
25.0GB msmarco-passage 41 files
23.1GB msmarco-document 26 files
5.8GB msmarco-qna 19 files
486.8KB nyt 3 files
6.6MB trec-robust04 2 files
681.1MB tripclick 18 files
8.7MB vaswani 10 files
2.9MB wapo 8 files
1.5GB wikir 22 files
46.4GB trec-fair-2021 9 files
$ ir_datasets clean cranfield # you can list multiple datasets here and/or include -y to automatically say yes
clean up 3.4MB from cranfield (8 files)?
[y(es) / n(o) / l(ist files)] l
1.6MB /home/sean/.ir_datasets/cranfield/docs.txt
507.0KB /home/sean/.ir_datasets/cranfield/cran.tar.gz
11.2KB /home/sean/.ir_datasets/cranfield/docs.pklz4/idx.doc_id.pos
5.6KB /home/sean/.ir_datasets/cranfield/docs.pklz4/idx.doc_id.key
28B /home/sean/.ir_datasets/cranfield/docs.pklz4/bin.meta
11.2KB /home/sean/.ir_datasets/cranfield/docs.pklz4/bin.pos
6B /home/sean/.ir_datasets/cranfield/docs.pklz4/idx.doc_id.meta
1.2MB /home/sean/.ir_datasets/cranfield/docs.pklz4/bin
clean up 1.2MB from cranfield (8 files)?
[y(es) / n(o) / l(ist files)] y
Looks good - maybe use a flag for size unit (so this can be parsable & sortable)
As suggested by Benjamin Piwowarski
Is your feature request related to a problem? Please describe.
Datasets can end up taking up a lot of space, and it's easy to run out of storage.
Describe the solution you'd like It would be nice if there was a utility to help clean up datasets that you do not need anymore. This could probably be a command line utility.
In many cases, the cleanup can be easy; just delete the dataset's entire directory. This is suitable when the datasets are fully downloadable from public sources. It can just be recovered again, no problem.
For non-downloadable items, though:
so it'd make sense to explicitly avoid removing files that match the cache_path of files that are not downloadable. (Regardless of whether it's a file/link/directory/etc.)
Describe alternatives you've considered
None
Additional context
None