discoproject / disco

a Map/Reduce framework for distributed computing
http://discoproject.org
BSD 3-Clause "New" or "Revised" License
1.63k stars 241 forks source link

Add ddfs du support #620

Closed oldmantaiter closed 7 years ago

oldmantaiter commented 9 years ago

This feature runs a job on the cluster on the tags specified to find the unreplicated size of the data on the filesystem. It may not be 100% accurate if there is a missing replica, but it will try and avoid downloading from another node (and thus not being able to find the size locally).

Usage: ddfs du [-H/-P/-n]

For larger tags you will want to increase the partitions and number of cores available to the job (-P and -n respectively). If you would like human readable output (or just hate doing math) you can use -H and it will output similar to the following:

$ ddfs du chekov -H chekov: 7.82 MB

This will cause extra load on the cluster, and large tags might take a while to come back with a result.