aodn / data-services

Scripts which are used to process incoming data in the data ingestion pipeline
GNU General Public License v3.0
1 stars 4 forks source link

File disappears after going to $INCOMING_DIR #410

Closed lbesnard closed 8 years ago

lbesnard commented 8 years ago

Some files completely disappear from the system after going to $INCOMING_DIR

How to reproduce

on aws10, connect as PO user, and do a source ./env at the base of data-services dir (would be good to add this automatically via chef btw)

# copy a SOOP ASF MT file to its incoming dir
cp $WIP_DIR/SOOP/SOOP_XBT_ASF_SST/data_unsorted/ASF_SST/ship/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc $INCOMING_DIR/SOOP/ASF/MT   

input_logf SOOP_ASF_MT
Apr  6 14:44:36 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.oltFjHYb6U/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' is not a valid file, aborting.

### similar behaviour with 
cp $WIP_DIR/SOOP/SOOP_XBT_ASF_SST/data_unsorted/ASF_SST/ship/2016/IMOS_SOOP-SST_T_20160405T000300Z_VRDU8_FV01_C-20160406T001307Z.nc $INCOMING_DIR/SOOP/SST

input_logf SOOP_SST
Apr  6 15:16:27 10-aws-syd SOOP_SST: '/mnt/ebs/tmp/tmp.hr9huOeRry/IMOS_SOOP-SST_T_20160405T000300Z_VRDU8_FV01_C-20160406T001307Z.nc' is not a valid file, aborting.

Ok why not, the file is 'not valid'. Well the original one is. But it looks like there is an issue here https://github.com/aodn/data-services/blob/master/lib/common/util.sh#L165 with the tmp file. The definition of 'valid' is a bash check to see if the file exists or not. So we're really at a low level check.

The issue is that we should still be able to locate the 'bad' file. But it is nor in $ERROR_DIR nor in $INCOMING_DIR. Anyway, this is EXTREMELY concerning. Of course, Nagios won't warn us on anything because there is no files in the ERROR_DIR. So we have potentially lost many files.

Both these folders are empty, as well as the thredds folder where the file should go to.

ls -la $INCOMING_DIR/SOOP/ASF/MT 
ls -la $ERROR_DIR/SOOP_ASF_MT

I wish I was wrong, I've tried and checked many times.

@smancini @jonescc @pblain

lbesnard commented 8 years ago

@danfruehauf if you're bored in NZ !

mhidas commented 8 years ago

@lbesnard Would you expect that file to be published? I.e. does it pass the checker and does the dest_path get correctly determined?

Do you get the same behaviour if you do this in the PO box?

It's not surprising that the file isn't in $INCOMING_DIR because it is moved to a temp directory before the incoming_handler is even called (see https://github.com/aodn/chef/blob/master/cookbooks/imos_po/templates/default/watch-exec-wrapper.sh.erb). As you can see the error in the log is referring to the file in /mnt/ebs/tmp/.

lbesnard commented 8 years ago

@lbesnard Would you expect that file to be published? I.e. does it pass the checker and does the dest_path get correctly determined?

yeap, and it works fine on nsp14.

It's not surprising that the file isn't in $INCOMING_DIR because it is moved to a temp directory before the incoming_handler is even called (see

yeap, that's the all point. Where does this file go then ?

mhidas commented 8 years ago

I think the problem is running it as projectofficer. Maybe there's a file permissions issue that isn't handled well.

I just tried running your first example above and it worked fine:

Apr  7 10:47:03 10-aws-syd SOOP_ASF_MT: Indexing file 'IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc', source file '/mnt/ebs/tmp/tmp.UbbGBgfnBM/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr  7 10:47:12 10-aws-syd SOOP_ASF_MT: Moving '/mnt/ebs/tmp/tmp.UbbGBgfnBM/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr  7 10:47:12 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.UbbGBgfnBM/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'  [1 of 1]
Apr  7 10:47:12 10-aws-syd SOOP_ASF_MT: #015   4096 of 420168     0% in    0s  1842.47 kB/s#015 420168 of 420168   100% in    0s     5.31 MB/s  done
Apr  7 10:47:12 10-aws-syd SOOP_ASF_MT: Done. Uploaded 420168 bytes in 1.0 seconds, 410.32 kB/s. Copied 0 files saving 0 bytes transfer.
lbesnard commented 8 years ago

hum, that's bizarre. How about the crons ? because they all run as projectofficer . And that's the issue, my soop python script pushes data to incoming

pblain commented 8 years ago

@anguss00 , @julian1 ( @jonescc ) can help out if you need support.

lbesnard commented 8 years ago

Well I think we (I) definitely need support if it works with a normal user but not PO

lbesnard commented 8 years ago

Also another similar bug, but not as problematic maybe. If a watch.d script exists, which points to a non existing incoming_handler script, the files landing in the INCOMING _DIR also disappears from the system

Happens on the PO box.

Apr  8 12:40:51 packer-virtualbox-iso-1449624312 AUV: /usr/local/bin/watch-exec-wrapper.sh: line 59: /vagrant/src/data-services/AUV/auv_viewer_processing/incoming_handler.sh: No such file or directory
Apr  8 12:40:51 packer-virtualbox-iso-1449624312 AUV: Could not process file 'File not handled':
Apr  8 12:40:51 packer-virtualbox-iso-1449624312 AUV: Moving 'File not handled' -> '/vagrant/src/error/AUV/.20160408-124051'
Apr  8 12:40:51 packer-virtualbox-iso-1449624312 AUV: Could not move 'File not handled' -> '/vagrant/src/error/AUV/.20160408-124051'

but the file /vagrant/src/error/AUV/.20160408-124051 does not exist

lbesnard commented 8 years ago

I won't weight my words, it's just disastrous and happens to easily. Any small issue in an incoming handler would result to the same thing.

This bug should really be seen as a priority, we can not guarantee we're not loosing any data.

julian1 commented 8 years ago

Could someone point me at the documentation?

lbesnard commented 8 years ago

Just saying but this looks to me as a very important bug. It wouldn't be a bad idea to spend a little bit of time on this just to see this is actually real as we have no idea how long this 'potential bug' has been on, and how many files have already completely disappeared from the FS.

anguss00 commented 8 years ago

Yes @lbesnard, I'm on it as of this arvo. Replicating and checking out the chef code atm

lbesnard commented 8 years ago

:+1: @anguss00 !

anguss00 commented 8 years ago

Let's hustle this bug :-)

lbesnard commented 8 years ago

serious question: Should all pipelines be disabled while this is being fixed ? I'd say yes. @pblain @smancini

anguss00 commented 8 years ago

in process.7.log.gz string found containing

Apr  6 12:22:44 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.bwrZwTYHpo/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' is not a valid file, aborting.

tried to reprocess same file as PO user

cp $WIP_DIR/SOOP/SOOP_XBT_ASF_SST/data_unsorted/ASF_SST/ship/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc  $INCOMING_DIR/SOOP/ASF/MT $INCOMING_DIR/SOOP/ASF/MT
Apr 13 16:18:01 10-aws-syd SOOP_ASF_MT: Indexing file 'IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc', source file '/mnt/ebs/tmp/tmp.qwS5fhjUcs/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr 13 16:18:06 10-aws-syd SOOP_ASF_MT: Moving '/mnt/ebs/tmp/tmp.qwS5fhjUcs/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr 13 16:18:07 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.qwS5fhjUcs/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'  [1 of 1]
Apr 13 16:18:07 10-aws-syd SOOP_ASF_MT: #015   4096 of 420168     0% in    0s  1861.33 kB/s#015 420168 of 420168   100% in    0s     2.18 MB/s  done
Apr 13 16:18:07 10-aws-syd SOOP_ASF_MT: Done. Uploaded 420168 bytes in 1.0 seconds, 410.32 kB/s. Copied 0 files saving 0 bytes transfer.
anguss00 commented 8 years ago

OK, we have a suspicion that the size of /tmp is not sufficient at just over 8GB. If this fills up, we can't add any more files, and its adios muchachos. Suggest to increase the max size of this directory. Investigating system logs further.

mhidas commented 8 years ago

Are you sure \tmp is relevant? Note that the default temp directory ($TMP) is now \mnt\ebs\tmp, so maybe it's the size of that, and regular clearing out of old tmp data ( #395) that we need to look at?

anguss00 commented 8 years ago

Ah yes, sorry @mhidas. I totally agree with you there, we need to look at /mnt/ebs/tmp. Firstly, nagios monitoring on the du of this dir. Secondly, cleanup cronjobs as #395

julian1 commented 8 years ago

List of recent not valid files (some are correctly identified).

not_valid.txt

$ cd /mnt/ebs/log/data-services/
$ zgrep 'not a valid' $( ls -vr process.log* ) > not_valid.txt
lbesnard commented 8 years ago

:+1: @julian1 glad to see the problem is actually really serious

atkinsn commented 8 years ago

@lbesnard stop working!

julian1 commented 8 years ago

Is that the real @atkinsn on github?

atkinsn commented 8 years ago

@julian1 an imposter..

julian1 commented 8 years ago

https://github.com/aodn/data-services/pull/422

pmbohm commented 8 years ago

0758315 was only a possible fix