Closed lbesnard closed 8 years ago
@danfruehauf if you're bored in NZ !
@lbesnard Would you expect that file to be published? I.e. does it pass the checker and does the dest_path get correctly determined?
Do you get the same behaviour if you do this in the PO box?
It's not surprising that the file isn't in $INCOMING_DIR because it is moved to a temp directory before the incoming_handler is even called (see https://github.com/aodn/chef/blob/master/cookbooks/imos_po/templates/default/watch-exec-wrapper.sh.erb). As you can see the error in the log is referring to the file in /mnt/ebs/tmp/
.
@lbesnard Would you expect that file to be published? I.e. does it pass the checker and does the dest_path get correctly determined?
yeap, and it works fine on nsp14.
It's not surprising that the file isn't in $INCOMING_DIR because it is moved to a temp directory before the incoming_handler is even called (see
yeap, that's the all point. Where does this file go then ?
I think the problem is running it as projectofficer
. Maybe there's a file permissions issue that isn't handled well.
I just tried running your first example above and it worked fine:
Apr 7 10:47:03 10-aws-syd SOOP_ASF_MT: Indexing file 'IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc', source file '/mnt/ebs/tmp/tmp.UbbGBgfnBM/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr 7 10:47:12 10-aws-syd SOOP_ASF_MT: Moving '/mnt/ebs/tmp/tmp.UbbGBgfnBM/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr 7 10:47:12 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.UbbGBgfnBM/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' [1 of 1]
Apr 7 10:47:12 10-aws-syd SOOP_ASF_MT: #015 4096 of 420168 0% in 0s 1842.47 kB/s#015 420168 of 420168 100% in 0s 5.31 MB/s done
Apr 7 10:47:12 10-aws-syd SOOP_ASF_MT: Done. Uploaded 420168 bytes in 1.0 seconds, 410.32 kB/s. Copied 0 files saving 0 bytes transfer.
hum, that's bizarre. How about the crons ? because they all run as projectofficer
. And that's the issue, my soop python script pushes data to incoming
@anguss00 , @julian1 ( @jonescc ) can help out if you need support.
Well I think we (I) definitely need support if it works with a normal user but not PO
Also another similar bug, but not as problematic maybe. If a watch.d script exists, which points to a non existing incoming_handler script, the files landing in the INCOMING _DIR also disappears from the system
Happens on the PO box.
Apr 8 12:40:51 packer-virtualbox-iso-1449624312 AUV: /usr/local/bin/watch-exec-wrapper.sh: line 59: /vagrant/src/data-services/AUV/auv_viewer_processing/incoming_handler.sh: No such file or directory
Apr 8 12:40:51 packer-virtualbox-iso-1449624312 AUV: Could not process file 'File not handled':
Apr 8 12:40:51 packer-virtualbox-iso-1449624312 AUV: Moving 'File not handled' -> '/vagrant/src/error/AUV/.20160408-124051'
Apr 8 12:40:51 packer-virtualbox-iso-1449624312 AUV: Could not move 'File not handled' -> '/vagrant/src/error/AUV/.20160408-124051'
but the file /vagrant/src/error/AUV/.20160408-124051
does not exist
I won't weight my words, it's just disastrous and happens to easily. Any small issue in an incoming handler would result to the same thing.
This bug should really be seen as a priority, we can not guarantee we're not loosing any data.
Could someone point me at the documentation?
Just saying but this looks to me as a very important bug. It wouldn't be a bad idea to spend a little bit of time on this just to see this is actually real as we have no idea how long this 'potential bug' has been on, and how many files have already completely disappeared from the FS.
Yes @lbesnard, I'm on it as of this arvo. Replicating and checking out the chef code atm
:+1: @anguss00 !
Let's hustle this bug :-)
serious question: Should all pipelines be disabled while this is being fixed ? I'd say yes. @pblain @smancini
in process.7.log.gz string found containing
Apr 6 12:22:44 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.bwrZwTYHpo/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' is not a valid file, aborting.
tried to reprocess same file as PO user
cp $WIP_DIR/SOOP/SOOP_XBT_ASF_SST/data_unsorted/ASF_SST/ship/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc $INCOMING_DIR/SOOP/ASF/MT $INCOMING_DIR/SOOP/ASF/MT
Apr 13 16:18:01 10-aws-syd SOOP_ASF_MT: Indexing file 'IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc', source file '/mnt/ebs/tmp/tmp.qwS5fhjUcs/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr 13 16:18:06 10-aws-syd SOOP_ASF_MT: Moving '/mnt/ebs/tmp/tmp.qwS5fhjUcs/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr 13 16:18:07 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.qwS5fhjUcs/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' [1 of 1]
Apr 13 16:18:07 10-aws-syd SOOP_ASF_MT: #015 4096 of 420168 0% in 0s 1861.33 kB/s#015 420168 of 420168 100% in 0s 2.18 MB/s done
Apr 13 16:18:07 10-aws-syd SOOP_ASF_MT: Done. Uploaded 420168 bytes in 1.0 seconds, 410.32 kB/s. Copied 0 files saving 0 bytes transfer.
OK, we have a suspicion that the size of /tmp is not sufficient at just over 8GB. If this fills up, we can't add any more files, and its adios muchachos. Suggest to increase the max size of this directory. Investigating system logs further.
Are you sure \tmp
is relevant? Note that the default temp directory ($TMP
) is now \mnt\ebs\tmp
, so maybe it's the size of that, and regular clearing out of old tmp data ( #395) that we need to look at?
Ah yes, sorry @mhidas. I totally agree with you there, we need to look at /mnt/ebs/tmp. Firstly, nagios monitoring on the du of this dir. Secondly, cleanup cronjobs as #395
List of recent not valid files (some are correctly identified).
$ cd /mnt/ebs/log/data-services/
$ zgrep 'not a valid' $( ls -vr process.log* ) > not_valid.txt
:+1: @julian1 glad to see the problem is actually really serious
@lbesnard stop working!
Is that the real @atkinsn on github?
@julian1 an imposter..
0758315 was only a possible fix
Some files completely disappear from the system after going to $INCOMING_DIR
How to reproduce
on aws10, connect as PO user, and do a
source ./env
at the base of data-services dir (would be good to add this automatically via chef btw)Ok why not, the file is 'not valid'. Well the original one is. But it looks like there is an issue here https://github.com/aodn/data-services/blob/master/lib/common/util.sh#L165 with the tmp file. The definition of 'valid' is a bash check to see if the file exists or not. So we're really at a low level check.
The issue is that we should still be able to locate the 'bad' file. But it is nor in $ERROR_DIR nor in $INCOMING_DIR. Anyway, this is EXTREMELY concerning. Of course, Nagios won't warn us on anything because there is no files in the ERROR_DIR. So we have potentially lost many files.
Both these folders are empty, as well as the thredds folder where the file should go to.
I wish I was wrong, I've tried and checked many times.
@smancini @jonescc @pblain