Closed danfruehauf closed 8 years ago
@danfruehauf, not sure it's of interest for that particular issue, but the SOOP_TMV_NRT data files (.log) get into the public directory via an lftp command running from the SOOP_TMV_NRT harvester (i.e. I don't copy those manually from my machine onto public).
Perhaps it deserves a different ticket. This ticket is more about a big amount of files lying in production rather than a logic error specific to a data collection. What do you think?
RSYNC processes are triggered by cron jobs and fail from time to time (NSP?) but should be resumed/reprocessed when the next cron job is called.
Those files are not visible by the public since they are .files but I understand they take some space on disk unnecessarily.
Do you recommend a best practice so that these files are not left and growing for ever?
Those files are potentially not visible by the public - correct.
Best practice would be transactional staging into production. But since we don't have any of that, I suggest periodic cleanups. It's close to 30K files - it's a lot.
Could you please take care of this periodic cleaning in the meantime? So that we are sure this is well applied across facilities.
It would be interested to see how much space you can free from this.
I'm not responsible for cleaning up data in the production filesystems, that's project officers domain.
We're not programmers so that would be good to hear suggestions on best practices to deal with that anyway...
The best practice I've already presented is transactional staging of files into production. Until then - project officers are responsible for that.
I'm happy to assist to develop those periodic tools, but I'm not planning to maintain them.
FAIMMS and NRS .* rsync files are deleted.
Is it possible to know how you did it? So that Dan may approve your solution and if good enough for him then can be generalised to other facilities. These files will keep appearing everywhere rsync is used so it would be good if a generic cron-job solution is applied to every facility...
@ggalibert Generally speaking, I'd recommend using mv instead of rsync when performing those operations, as they are atomic.
I can provide you with a list of files if you'd like. I think that scanning the whole filesystem to find them is a wasteful and expensive operation - I'd try to avoid that.
@ggalibert , My workflow : In config.txt, I have dataOpendapRsync.path where all the data is first copied. It stays in the wip directory. this is the main source of data
## [opendap folder location]
destinationProductionData.path = /mnt/opendap/1/IMOS/opendap/FAIMMS
dataOpendapRsync.path = /mnt/imos-t4/project_officers/wip/FAIMMS/faimms_data_rss_download_temporary/data_opendap_folder_rsync
and then in my bash script calling the main matlab function, i look for the values in config.txt
for (( jj = 0 ; jj < ${#value[@]} ; jj++ ));
do
if [[ "${name[jj]}" =~ "dataOpendapRsync.path" ]] ; then
rsyncSourcePath=${value[jj]} ;
fi
if [[ "${name[jj]}" =~ "destinationProductionData.path" ]] ; then
rsyncDestinationPath=${value[jj]} ;
fi
done
And then I rsync between wip and opendap, cleaning the destination directory with --delete-before option. This way the destination directory is always clean
#rsync between rsyncSourcePath and rsyncDestinationPath
rsync --size-only --itemize-changes --delete-before --stats -uhvrD --progress ${rsyncSourcePath}/opendap/ ${rsyncDestinationPath}/ ;
see https://github.com/aodn/data-services/blob/master/FAIMMS/FAIMMS_data_rss_channels_process_matlab/FAIMMS.sh if interested
@lbesnard :+1: on the --delete-before
@lbesnard on the --delete-before
except that it can be really dangerous, and I really recommend to to a --dry-run first. And this is why we need to find a way to include env variables for the scalability and to reduce potential disasters in the future
@lbesnard why can --delete-before be really dangerous?
If I understood well : --delete-before deletes files in the destination directory before copying file-with-same-name from source directory.
@ggalibert --delete-before
factors a list of deletions before transfer takes place. This is why I think @lbesnard believes it is dangerous. And it is.
ACORN is not a problem anymore, I've updated my calls to rsync so that option --delete-before is added.
@danfruehauf I identified 3992 files of 0byte each on OPENDAP which I deleted.
@ggalibert Thanks!
Ok, to all who would follow this discussion, --delete-before is dangerous if you perform your rsync with --remove-source-files on a directory :
rsync -vaR --remove-source-files $DATA/ACORN/WERA/radial_nonQC/output/datafabric/gridded_1havg_currentmap_nonQC/./ $OPENDAP/ACORN/gridded_1h-avg-current-map_non-QC/
Next call to this rsync command will see empty directories as source so that same name directories in target will be deleted...
If rsync is performed on a list of files as input :
cat /tmp/move_FV00_vector.checkedList | rsync -va --remove-source-files --files-from=- $STAGING/ACORN/acorn-migration-hierarchy/vector/ $OPENDAP/ACORN/vector/
then --delete-before should be safe.
I don't think this can be closed.
I'd like to shed some more light on that. I think the root of the problem is the crappy network that we have on NSP, which results in bad NFS performance. After migrating the ACORN processing to use the event driven infrastructure, more than once already I've seen files being denied from reaching the production filesystem because of null permission files such as:
$ ls -l /mnt/opendap/1/IMOS/opendap/ACORN/radial/GUI/2015/07/02/IMOS_ACORN_RV_20150702T085000Z_GUI_FV00_radial.nc
---------- 1 nobody users 0 Jul 10 1993 /mnt/opendap/1/IMOS/opendap/ACORN/radial/GUI/2015/07/02/IMOS_ACORN_RV_20150702T085000Z_GUI_FV00_radial.nc
Files that are denied will be put in the error directory and it is pretty easy to move them back to the incoming directory after cleaning up the bogus file with the 000 permissions.
Before that, ACORN used rsync to move the files. rsync uses temporary files and then renames them. This caused all sort of random files to stay on the production filesystem, such as:
./NNB/2015/06/06/.IMOS_ACORN_RV_20150606T143500Z_NNB_FV00_radial.nc.wsa4Ff
./NNB/2015/06/06/.IMOS_ACORN_RV_20150606T121500Z_NNB_FV00_radial.nc.uSvQAc
./NNB/2015/06/06/.IMOS_ACORN_RV_20150606T143500Z_NNB_FV00_radial.nc.Xtc4Ff
./NNB/2015/06/30/.IMOS_ACORN_RV_20150630T094500Z_NNB_FV00_radial.nc.rTYMO0
./NNB/2015/06/16/.IMOS_ACORN_RV_20150616T090500Z_NNB_FV00_radial.nc.qx0cp0
./CSP/2015/05/15/.IMOS_ACORN_RV_20150515T090500Z_CSP_FV00_radial.nc.iOcJkd
./CSP/2015/05/28/.IMOS_ACORN_RV_20150528T074500Z_CSP_FV00_radial.nc.rnmvDF
./CSP/2015/05/31/.IMOS_ACORN_RV_20150531T191500Z_CSP_FV00_radial.nc.Fqanqz
./CSP/2015/05/31/.IMOS_ACORN_RV_20150531T014500Z_CSP_FV00_radial.nc.p72Srn
./CSP/2015/05/31/.IMOS_ACORN_RV_20150531T152500Z_CSP_FV00_radial.nc.IBmq0M
./CSP/2015/05/31/.IMOS_ACORN_RV_20150531T230500Z_CSP_FV00_radial.nc.lFEfpY
./CSP/2015/05/30/.IMOS_ACORN_RV_20150530T121500Z_CSP_FV00_radial.nc.PKHHZ6
272 in total in the radial
directory. All are zero sized with null permissions:
---------- 1 9008 users 0 May 2 1974 ./LEI/2015/06/04/.IMOS_ACORN_RV_20150604T072500Z_LEI_FV00_radial.nc.NHJjdc
---------- 1 9008 users 0 Jul 1 1980 ./LEI/2015/06/12/.IMOS_ACORN_RV_20150612T221500Z_LEI_FV00_radial.nc.nTRuD4
---------- 1 9008 users 0 Jul 24 1975 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T023500Z_LEI_FV00_radial.nc.kUin9L
---------- 1 9008 users 0 Mar 31 1976 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T213500Z_LEI_FV00_radial.nc.kmvF6K
---------- 1 9008 users 0 Nov 26 1975 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T140500Z_LEI_FV00_radial.nc.IT2IpJ
---------- 1 9008 users 0 Mar 31 1976 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T193500Z_LEI_FV00_radial.nc.qTQTwd
---------- 1 9008 users 0 Mar 31 1976 ./LEI/2015/06/06/.IMOS_ACORN_RV_20150606T220500Z_LEI_FV00_radial.nc.AVerIe
It seems like only half the move transaction is done (creating the file) without setting permissions. I suspect due to a timeout on either side of the NFS volume. We can probably mitigate it in code, but ideally I'd like to mitigate it by not relying on crappy NFS volumes for moving files around.
Cleaned up ACORN completely now. Shouldn't have more in radial
or vector
as they are using the event driven processing which will not allow such files to appear.
My hourly averaged product for current is created every hour and rsync'ed from WIP to OPENDAP. Hopefully there is no zero byte files left by this process in gridded_1h-avg-current-map_non-QC
and gridded_1h-avg-current-map_QC
.
I'm closing that because the solution is being implemented already and those stray files will not make it to S3.
I've sent an email about it, but I'll open an issue as well.
How to reproduce:
What to expect:
What happens: