Closed 7yl4r closed 7 years ago
after looking through the most recent files in ftp://thing1.marine.usf.edu/data
and ftp://thing1.marine.usf.edu/gsfcdata
I think I've narrowed down the problem area to between pds ingest and level0, which would be l0l1aqua.
Looking at the process history for l0l1aqua (and keeping in mind that the problem started on 08-12...
perhaps the failures on dune alone are to blame?
Dune's l0l1aqua gets killed because of repeated failures due to https://github.com/USF-IMARS/IPOPP-docs/issues/8
No changes in dune:/etc/ since 08-10 (which was diskio & ping monitoring being added to telegraf.conf)
I've added a restart-l0l1aqua every 5min cronjob to dune... Hopefully that will allow it to get some useful work in through the mess of errors.
In the short-term, this should resolve once the errored products get pushed out of the 14-day processing window. In the long term, I need to implement some better error handling.
I really think the temporal alignment with #6 was purely coincidental... since the nearly identical l0l1terra did not experience any issues over this same time period:
this is a good test for if things are working (today is 2017228):
[root@reef01 ~]# find /srv/imars-data/ipopp/website-images/nrt -name "aqua.2017228.*.A05.sst.tiff"
[root@reef01 ~]#
# compare that with:
[root@reef01 ~]# find /srv/imars-data/ipopp/website-images/nrt -name "terra.2017228.*.A05.sst.tiff"
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.0320.A05.sst.tiff
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.1550.A05.sst.tiff
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.1410.A05.sst.tiff
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.0500.A05.sst.tiff
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.1730.A05.sst.tiff
Whatever is going wrong is generating no logs...
[ipopp@dune l0l1aqua]$ ll
total 216
-rw-r--r--. 1 ipopp ipopp 4096 Jun 22 19:00 ancillary_data.db
drwxrwxr-x. 2 ipopp ipopp 4096 Mar 7 20:08 data
drwxrwxr-x. 2 ipopp ipopp 4096 Aug 18 13:15 FAIL.170818-131502
drwxrwxr-x. 2 ipopp ipopp 4096 Aug 18 13:15 FAIL.170818-131517
drwxrwxr-x. 2 ipopp ipopp 4096 Aug 18 13:33 FAIL.170818-133322
drwxrwxr-x. 2 ipopp ipopp 4096 Aug 18 13:35 FAIL.170818-133532
drwxrwxr-x. 2 ipopp ipopp 4096 Aug 18 13:36 FAIL.170818-133607
drwxrwxr-x. 2 ipopp ipopp 4096 Aug 18 13:36 FAIL.170818-133612
drwxrwxr-x. 2 ipopp ipopp 4096 Aug 18 13:36 FAIL.170818-133617
drwxrwxr-x. 7 ipopp ipopp 4096 Aug 18 13:36 jsw
-rwxrwxr-x. 1 ipopp ipopp 11554 Mar 7 20:08 station.cfgfile
-rwxrwxr-x. 1 ipopp ipopp 1594 Mar 7 20:08 station.cfgfile.xml
-rwxrwxr-x. 1 ipopp ipopp 15 Mar 7 20:08 station.cmdfile
drwxrwxr-x. 2 ipopp ipopp 20480 Aug 18 06:01 station.nslsdir
-rw-rw-r--. 1 ipopp ipopp 127764 Aug 18 13:36 station.stationlog
[ipopp@dune l0l1aqua]$ grep -R "" FAIL.170818-13*
[ipopp@dune l0l1aqua]$ ll FAIL.170818-13*
FAIL.170818-131502:
total 0
FAIL.170818-131517:
total 0
FAIL.170818-133322:
total 0
FAIL.170818-133532:
total 0
FAIL.170818-133607:
total 0
FAIL.170818-133612:
total 0
FAIL.170818-133617:
total 0
even from the console...
[ipopp@dune l0l1aqua]$ ./jsw/bin/wrapper.sh status
NCS Station - l0l1aqua is not running.
[ipopp@dune l0l1aqua]$ ./jsw/bin/wrapper.sh console
Running NCS Station - l0l1aqua...
wrapper | --> Wrapper Started as Console
wrapper | Java Service Wrapper Community Edition 64-bit 3.5.24
wrapper | Copyright (C) 1999-2014 Tanuki Software, Ltd. All Rights Reserved.
wrapper | http://wrapper.tanukisoftware.com
wrapper |
wrapper | Launching a JVM...
jvm 1 | WrapperManager: Initializing...
jvm 1 | cfg_ncs_home is ../..
jvm 1 | Initializing algorithm Mod L1A Aqua
jvm 1 | InitAlgorithm: siteName - NISDS-dune-192.168.1.23, algoName - Mod L1A Aqua, gopherColony - Mod-L1A grp1
jvm 1 | Algorithm Mod L1A Aqua initialized
wrapper | <-- Wrapper Stopped
oh no wait... the console got killed early b/c the FAIL* directories existed... apparently that is how IPOPP decides if a station should be killed.
running the command directly using the test.sh in the gist linked above works like a charm.
[ipopp@dune l0l1aqua]$ ./test.sh
L1A version: 5.0.5 built on Jun 1 2012 (11:51:27)
Scan Number: 0 Fri Aug 18 16:02:51 2017
Scan Number: 10 Fri Aug 18 16:02:51 2017
Scan Number: 20 Fri Aug 18 16:02:51 2017
Scan Number: 30 Fri Aug 18 16:02:51 2017
Scan Number: 40 Fri Aug 18 16:02:51 2017
Scan Number: 50 Fri Aug 18 16:02:51 2017
Scan Number: 60 Fri Aug 18 16:02:51 2017
Scan Number: 70 Fri Aug 18 16:02:52 2017
Scan Number: 80 Fri Aug 18 16:02:52 2017
Scan Number: 90 Fri Aug 18 16:02:52 2017
Scan Number: 100 Fri Aug 18 16:02:52 2017
Scan Number: 110 Fri Aug 18 16:02:52 2017
Scan Number: 120 Fri Aug 18 16:02:52 2017
Scan Number: 130 Fri Aug 18 16:02:52 2017
Scan Number: 140 Fri Aug 18 16:02:52 2017
Scan Number: 150 Fri Aug 18 16:02:52 2017
Scan Number: 160 Fri Aug 18 16:02:52 2017
Scan Number: 170 Fri Aug 18 16:02:52 2017
Scan Number: 180 Fri Aug 18 16:02:52 2017
Scan Number: 190 Fri Aug 18 16:02:53 2017
Scan Number: 200 Fri Aug 18 16:02:53 2017
[ipopp@dune l0l1aqua]$ ll RUN-TEST-1/
total 558008
-rw-rw-r--. 1 ipopp ipopp 571362365 Aug 18 16:02 L1A.hdf
-rw-rw-r--. 1 ipopp ipopp 20719 Aug 18 16:02 L1A.hdf.pcf
-rw-rw-r--. 1 ipopp ipopp 420 Aug 18 16:02 LogReport.L1A.hdf
-rw-rw-r--. 1 ipopp ipopp 1346 Aug 18 16:02 LogStatus.L1A.hdf
-rw-rw-r--. 1 ipopp ipopp 614 Aug 18 16:02 LogUser.L1A.hdf
So I guess it's just IPOPP "throwing a fit" and tossing the data.
No wait... that one file I tested just happened to work. Here's one that isn't:
[ipopp@dune l0l1aqua]$ ./test.sh
ERROR: File: /home/ipopp/temp/P1540064AAAAAAAAAAAAAA17210052000001.PDS does not exist.
Let's try to think through the process here:
product_id
from thing1 (NCS or IS?) (?) product_id
sproduct_id
s are not in thing1-dbSo the solution is going to be in figuring out the details of what is going wrong at step 2 above.
Given how intractable the mess of code is... let's look through the entire database for anything potentially relevant:
use DSM;
show tables;
SET @bad_id = 133025; # the id printed in the "TransferCommand" error
# Algorithms has no potentially relevant data
select * from Ancestors where product=@bad_id; # * NULL
select * from Ancestors where ancestor=@bad_id; # * NULL
select * from Contributors where id=@bad_id; # * NULL
select * from Directories where id=@bad_id; # * NULL
select * from MappedProducts where product=@bad_id; # NONE ???
select * from Markers where product=@bad_id; # * NULL
# Mutex has no potentially relevant data
# NisgsProperties is completely empty?
select * from Passes where id=@bad_id; # * NULL
select * from ProductContributors where product=@bad_id; # * NULL
select * from ProductContributors where contributor=@bad_id; # * NULL
# ProductThumbnails is completely empty?
# ProductTypes has no potentially relevant data
select * from ProductionDataSets where product=@bad_id; # * NULL
select * from Products where id=@bad_id; # * NULL
select * from Products where pass=@bad_id; # * NULL
select * from ResourceSites where resource=@bad_id; # TWO RESULTS !!!!!
# !!!
# '133025', 'IS', '12013', '2017-07-25 02:22:38'
# '133025', 'NISDS-hydra1', '12376', '2017-07-25 02:18:37'
# !!!
select * from Resources where id=@bad_id;
# !!!
# '133025', '128611', 'DATA', 'VODmcrefl_TrueColor.17190164742.seacoos.750m.tiff', NULL, '1'
# !!!
select * from Resources where product=@bad_id; # * NULL
select * from SatTimeAncillaries where id=@bad_id; # * NULL
# StaticAncillaries is completely empty?
# StaticAncillarySites is completely empty?
# Stations is completely empty?
# Thumbnails is completely empty?
select * from TimeAncillaries where id=@bad_id; # * NULL
select * from TimeAncillarySites where aid=@bad_id; # * NULL
select * from TimeAncillarySites where directory=@bad_id; # * NULL
select * from TransferCommands where id=@bad_id;
# !!!
# '133025', 'Products', '138791', 'NISDS-dune-192.168.1.23', '2017-08-14 05:16:08', '0'
# !!!
select * from TransferCommands where tableId=@bad_id; # * NULL
Ah. This fixed itself at 6am Sunday morning...
Soooooo... :man_facepalming:
I believe this was ultimately caused by the family of DSM failure issues (#2 #8 #9 #10 #11).
Matt: