aodn / data-services

Scripts which are used to process incoming data in the data ingestion pipeline
GNU General Public License v3.0
1 stars 4 forks source link

Need regular clean-up of `/mnt/ebs/tmp` on 10-aws-syd #395

Open mhidas opened 8 years ago

mhidas commented 8 years ago

This directory is used for temp storage during processing (see https://github.com/aodn/chef-private/pull/1776), and it looks like files are not always cleaned up. There's currently 29Gb worth of stuff in there. For the moment this is not a problem as there's still 156Gb free on /mnt/ebs, but as this filesystem is also used for incoming, error, logging and various other things, some bad things could happen if it fills up.

So, it would be good to periodically clean out the oldest files in /mnt/ebs/tmp. Could easily add a cron job here, but maybe this should be set up in chef?

@jonescc @julian1 @danfruehauf any thoughts?

julian1 commented 8 years ago

currently 93GB

$ du -hs /mnt/ebs/tmp/
93G     /mnt/ebs/tmp/
$ find /mnt/ebs/tmp/ -type f | wc -l
7614

Not really sure why a tmp directory is needed.

Ideally all file handling steps (reception, validity checking, talend harvest) should complete as atomic actions.

Easy to implement using stack unwinding with exception handlers doing cleanup, or else using a queue of command objects with do() and restore() methods.

lbesnard commented 8 years ago

Lots of logs from harvesters. lots of gliders data. see fix here https://github.com/aodn/data-services/pull/417

also will perform a clean up of tmp dir containing anfog data

find . -type d | grep -E glider$ | du -sh
107 GB

I think cleaning 107GB will make everyone happy

lbesnard commented 8 years ago

also many soop_ba files in tmp folder, and temporary subfolder called Raw

find . -type d | grep -E Raw$ | du -sh
63 Gb

and

./tmp.AFEnxbUCXX/0256_Cairns20151130
./tmp.iyPybsNrmm/0016_AIMS20151127
./tmp.RBoYOpOrGJ/0015_AIMS20151021
./tmp.RnuUWkAiYq/0016_AIMS20151127
./tmp.FRlyQ7ukvJ/0017_CharlotteBay20151124
./tmp.5R5ZJxhNA5/0255_Yamba20151110
lbesnard commented 8 years ago

@mhidas also some IMOS_ANMN=NRS and QLD files. no big deal, but maybe would be nice to add some cleaning code in your pipeline functions

mhidas commented 8 years ago

@lbesnard I do have code to clean up temp files at the end of the incoming handler. The problem is when there is an error, the whole thing exits and the clean-up code doesn't get executed. I could add a bunch of rm $tmp_file etc... statements before every file_error, but that's a lot of extra lines to do the same thing.

It would make more sense for the clean-up to be more generic. I did suggest one solution ( #334) a while ago, but that was vetoed. Probably a better solution would be for each incoming_handler process to be provided with its own temp directory, which is removed with all its contents after execution, no matter what. This could be done in https://github.com/aodn/chef/blob/master/cookbooks/imos_po/templates/default/watch-exec-wrapper.sh.erb

mhidas commented 8 years ago

Or the other alternative is to have a separate process that runs every day and removes everything from /mnt/ebs/tmp that is more than X days old, which is what I was suggesting with this issue. (Granted, that wouldn't guarantee that the temp dir won't fill up, but if X is small enough it should be very unlikely).

mhidas commented 8 years ago

Not sure why this was closed. We're still don't have a proper solution for cleaning up these files

julian1 commented 8 years ago

If pipeline files are being orphaned in any of the /tmp dirs then we should find out why - and whether that's a hint of other more serious issues.

After the last round of fixes to data-services - all tmp files were being cleaned up correctly. I spent quite a bit of time verifying this.

I would like to know if this is a new issue, or whether it's a pre-existing/intermittent issue that for some reason wasn't seen or triggered before.

Currently, I feel it's a problem that we cannot actually trace the progress of a file through the "pipeline' or generate any type of audit log of what happened to it.

mhidas commented 8 years ago

@julian1 I mentioned above one basic reason those files are being left in the temp directory, and that's going to keep happening quite a lot.

This is not a new issue, it is the reason this issue was created in the first place. My initial suggestion, to have a separate job doing a regular clean-up is just the simplest thing I could think of at the time. I'd be happy for us to come up with a better solution.

This is really part of https://github.com/aodn/backlog/issues/326 , which we need to re-open.

julian1 commented 8 years ago

I mentioned above one basic reason those files are being left in the temp directory, and that's going to keep happening quite a lot.

Being able to specify resource management on an error condition and making it an invariant enforced by the type-system is something that was solved in modern programming languages about 20 years ago (https://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization).

I agree we should re-open the original issue.

mhidas commented 8 years ago

aodn/backlog#371

ggalibert commented 7 years ago

Recently (2017-03-09 16:07:41) faced a similar problem in ACORN with IMOS_ACORN_RV_20160413T014000Z_TAN_FV01_radial.nc .

IMOS_ACORN_RV_20160413T014000Z_TAN_FV01_radial.nc.20170309-160741.log file says:

Going to process a total of '1' files
Processing slice with '1' files
No space left on device - /tmp/d20170309-21516-13ae91v

Might need to have a look at what's going on with /tmp in the incoming handler code.

mhidas commented 7 years ago

@ggalibert I think that's a Talend issue. As far as I know all incoming handlers now use /mnt/ebs/tmp