baidu / bigflow

Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.
http://baidu.github.io/bigflow
Apache License 2.0
1.14k stars 160 forks source link

manage byproducts #47

Open ziyenano opened 6 years ago

ziyenano commented 6 years ago

While running bigflow program, I find it will output some byproducts, e.g., entity-* .flume ... After several times, it will make the folder in a mess. Could you please put those byproducts into a pre-specified subfolder, in order to conveniently manage.

acmol commented 6 years ago

yeah.. This is a issue we wanted to do for a long time but haven't done yet. We are hoping that some of our users can help us to improve it in the future. PS: You should pay some attention when you want to remove all the byproducts, because it will cause a failure if you delete a file which is currently being used.

ziyenano commented 6 years ago

An alternative method: copy the following bash script, e.g., path/bigflow_cleanup.sh

#!/bin/bash
set -x
set -e
if [ $# -ne '0' ]; then
    cd $1
fi
echo `pwd`

rm -rf ./entity-*
rm -rf ./.flume-resource-*
rm -rf ./.flume-app-*.tar.gz
rm -rf ./.empty-*.tar.gz
rm -rf ./hs_err_*
rm -rf ./.tmp

then add an alias in your .bashrc or .zshrc, etc.:

alias bigflow_cleanup='sh path/bigflow_cleanup.sh'

Source the .bashrc or .zshrc file, and then you can easily clean up those byproducts in any folder by excuting bigflow_cleanup command.

acmol commented 6 years ago

Yes, you could do it in your way, but it's very easy to make the running job fail if the job is using these tmp files. So I don't think it's a good idea to make this command to be built-in.

A proper way would be making all the paths under a same tmp folder, such as .tmp/<uuid>/

then, user could run bigflow cleanup 3days to cleanup the folders which is older than 3 days.

chunyang-wen commented 6 years ago

Normally, those files will be cleaned after successful runs. If not, there should be a problem.