aodn / content

Tracks AODN Portal content and configuration issues
0 stars 0 forks source link

Temporary rsync files lying in filesystem #53

Closed danfruehauf closed 8 years ago

danfruehauf commented 9 years ago

I've sent an email about it, but I'll open an issue as well.

How to reproduce:

What to expect:

What happens:

xhoenner commented 9 years ago

@danfruehauf, not sure it's of interest for that particular issue, but the SOOP_TMV_NRT data files (.log) get into the public directory via an lftp command running from the SOOP_TMV_NRT harvester (i.e. I don't copy those manually from my machine onto public).

danfruehauf commented 9 years ago

Perhaps it deserves a different ticket. This ticket is more about a big amount of files lying in production rather than a logic error specific to a data collection. What do you think?

ggalibert commented 9 years ago

RSYNC processes are triggered by cron jobs and fail from time to time (NSP?) but should be resumed/reprocessed when the next cron job is called.

Those files are not visible by the public since they are .files but I understand they take some space on disk unnecessarily.

Do you recommend a best practice so that these files are not left and growing for ever?

danfruehauf commented 9 years ago

Those files are potentially not visible by the public - correct.

Best practice would be transactional staging into production. But since we don't have any of that, I suggest periodic cleanups. It's close to 30K files - it's a lot.

ggalibert commented 9 years ago

Could you please take care of this periodic cleaning in the meantime? So that we are sure this is well applied across facilities.

It would be interested to see how much space you can free from this.

danfruehauf commented 9 years ago

I'm not responsible for cleaning up data in the production filesystems, that's project officers domain.

ggalibert commented 9 years ago

We're not programmers so that would be good to hear suggestions on best practices to deal with that anyway...

danfruehauf commented 9 years ago

The best practice I've already presented is transactional staging of files into production. Until then - project officers are responsible for that.

I'm happy to assist to develop those periodic tools, but I'm not planning to maintain them.

lbesnard commented 9 years ago

FAIMMS and NRS .* rsync files are deleted.

ggalibert commented 9 years ago

Is it possible to know how you did it? So that Dan may approve your solution and if good enough for him then can be generalised to other facilities. These files will keep appearing everywhere rsync is used so it would be good if a generic cron-job solution is applied to every facility...

danfruehauf commented 9 years ago

@ggalibert Generally speaking, I'd recommend using mv instead of rsync when performing those operations, as they are atomic.

I can provide you with a list of files if you'd like. I think that scanning the whole filesystem to find them is a wasteful and expensive operation - I'd try to avoid that.

lbesnard commented 9 years ago

@ggalibert , My workflow : In config.txt, I have dataOpendapRsync.path where all the data is first copied. It stays in the wip directory. this is the main source of data

## [opendap folder location]
destinationProductionData.path                  = /mnt/opendap/1/IMOS/opendap/FAIMMS
dataOpendapRsync.path                           = /mnt/imos-t4/project_officers/wip/FAIMMS/faimms_data_rss_download_temporary/data_opendap_folder_rsync

and then in my bash script calling the main matlab function, i look for the values in config.txt

    for (( jj = 0 ; jj < ${#value[@]} ; jj++ ));
    do

        if [[ "${name[jj]}" =~ "dataOpendapRsync.path" ]] ; then
             rsyncSourcePath=${value[jj]} ;
        fi

        if [[ "${name[jj]}" =~ "destinationProductionData.path" ]] ; then
             rsyncDestinationPath=${value[jj]} ;
        fi

    done

And then I rsync between wip and opendap, cleaning the destination directory with --delete-before option. This way the destination directory is always clean

#rsync between rsyncSourcePath and rsyncDestinationPath
    rsync --size-only --itemize-changes --delete-before  --stats -uhvrD  --progress ${rsyncSourcePath}/opendap/  ${rsyncDestinationPath}/ ;

see https://github.com/aodn/data-services/blob/master/FAIMMS/FAIMMS_data_rss_channels_process_matlab/FAIMMS.sh if interested

danfruehauf commented 9 years ago

@lbesnard :+1: on the --delete-before

lbesnard commented 9 years ago

@lbesnard on the --delete-before

except that it can be really dangerous, and I really recommend to to a --dry-run first. And this is why we need to find a way to include env variables for the scalability and to reduce potential disasters in the future

ggalibert commented 9 years ago

@lbesnard why can --delete-before be really dangerous?

If I understood well : --delete-before deletes files in the destination directory before copying file-with-same-name from source directory.

danfruehauf commented 9 years ago

@ggalibert --delete-before factors a list of deletions before transfer takes place. This is why I think @lbesnard believes it is dangerous. And it is.

ggalibert commented 9 years ago

ACORN is not a problem anymore, I've updated my calls to rsync so that option --delete-before is added.

@danfruehauf I identified 3992 files of 0byte each on OPENDAP which I deleted.

danfruehauf commented 9 years ago

@ggalibert Thanks!

ggalibert commented 9 years ago

Ok, to all who would follow this discussion, --delete-before is dangerous if you perform your rsync with --remove-source-files on a directory :

rsync -vaR --remove-source-files $DATA/ACORN/WERA/radial_nonQC/output/datafabric/gridded_1havg_currentmap_nonQC/./ $OPENDAP/ACORN/gridded_1h-avg-current-map_non-QC/

Next call to this rsync command will see empty directories as source so that same name directories in target will be deleted...

If rsync is performed on a list of files as input :

cat /tmp/move_FV00_vector.checkedList | rsync -va --remove-source-files --files-from=- $STAGING/ACORN/acorn-migration-hierarchy/vector/ $OPENDAP/ACORN/vector/

then --delete-before should be safe.

danfruehauf commented 9 years ago

I don't think this can be closed.

danfruehauf commented 9 years ago

I'd like to shed some more light on that. I think the root of the problem is the crappy network that we have on NSP, which results in bad NFS performance. After migrating the ACORN processing to use the event driven infrastructure, more than once already I've seen files being denied from reaching the production filesystem because of null permission files such as:

$ ls -l /mnt/opendap/1/IMOS/opendap/ACORN/radial/GUI/2015/07/02/IMOS_ACORN_RV_20150702T085000Z_GUI_FV00_radial.nc
---------- 1 nobody users 0 Jul 10  1993 /mnt/opendap/1/IMOS/opendap/ACORN/radial/GUI/2015/07/02/IMOS_ACORN_RV_20150702T085000Z_GUI_FV00_radial.nc

Files that are denied will be put in the error directory and it is pretty easy to move them back to the incoming directory after cleaning up the bogus file with the 000 permissions.

Before that, ACORN used rsync to move the files. rsync uses temporary files and then renames them. This caused all sort of random files to stay on the production filesystem, such as:

./NNB/2015/06/06/.IMOS_ACORN_RV_20150606T143500Z_NNB_FV00_radial.nc.wsa4Ff
./NNB/2015/06/06/.IMOS_ACORN_RV_20150606T121500Z_NNB_FV00_radial.nc.uSvQAc
./NNB/2015/06/06/.IMOS_ACORN_RV_20150606T143500Z_NNB_FV00_radial.nc.Xtc4Ff
./NNB/2015/06/30/.IMOS_ACORN_RV_20150630T094500Z_NNB_FV00_radial.nc.rTYMO0
./NNB/2015/06/16/.IMOS_ACORN_RV_20150616T090500Z_NNB_FV00_radial.nc.qx0cp0
./CSP/2015/05/15/.IMOS_ACORN_RV_20150515T090500Z_CSP_FV00_radial.nc.iOcJkd
./CSP/2015/05/28/.IMOS_ACORN_RV_20150528T074500Z_CSP_FV00_radial.nc.rnmvDF
./CSP/2015/05/31/.IMOS_ACORN_RV_20150531T191500Z_CSP_FV00_radial.nc.Fqanqz
./CSP/2015/05/31/.IMOS_ACORN_RV_20150531T014500Z_CSP_FV00_radial.nc.p72Srn
./CSP/2015/05/31/.IMOS_ACORN_RV_20150531T152500Z_CSP_FV00_radial.nc.IBmq0M
./CSP/2015/05/31/.IMOS_ACORN_RV_20150531T230500Z_CSP_FV00_radial.nc.lFEfpY
./CSP/2015/05/30/.IMOS_ACORN_RV_20150530T121500Z_CSP_FV00_radial.nc.PKHHZ6

272 in total in the radial directory. All are zero sized with null permissions:

---------- 1 9008 users 0 May  2  1974 ./LEI/2015/06/04/.IMOS_ACORN_RV_20150604T072500Z_LEI_FV00_radial.nc.NHJjdc
---------- 1 9008 users 0 Jul  1  1980 ./LEI/2015/06/12/.IMOS_ACORN_RV_20150612T221500Z_LEI_FV00_radial.nc.nTRuD4
---------- 1 9008 users 0 Jul 24  1975 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T023500Z_LEI_FV00_radial.nc.kUin9L
---------- 1 9008 users 0 Mar 31  1976 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T213500Z_LEI_FV00_radial.nc.kmvF6K
---------- 1 9008 users 0 Nov 26  1975 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T140500Z_LEI_FV00_radial.nc.IT2IpJ
---------- 1 9008 users 0 Mar 31  1976 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T193500Z_LEI_FV00_radial.nc.qTQTwd
---------- 1 9008 users 0 Mar 31  1976 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T220500Z_LEI_FV00_radial.nc.AVerIe

It seems like only half the move transaction is done (creating the file) without setting permissions. I suspect due to a timeout on either side of the NFS volume. We can probably mitigate it in code, but ideally I'd like to mitigate it by not relying on crappy NFS volumes for moving files around.

danfruehauf commented 9 years ago

Cleaned up ACORN completely now. Shouldn't have more in radial or vector as they are using the event driven processing which will not allow such files to appear.

ggalibert commented 9 years ago

My hourly averaged product for current is created every hour and rsync'ed from WIP to OPENDAP. Hopefully there is no zero byte files left by this process in gridded_1h-avg-current-map_non-QC and gridded_1h-avg-current-map_QC.

danfruehauf commented 8 years ago

I'm closing that because the solution is being implemented already and those stray files will not make it to S3.