USF-IMARS / IPOPP-docs

Documentation related to IMaRS's use of NASA's IPOPP software.
MIT License
1 stars 0 forks source link

no modis aqua images (l0l1aqua failing) #7

Closed 7yl4r closed 7 years ago

7yl4r commented 7 years ago

Matt:

So the MODIS terra seems to be coming in and web site back up and working, but I don’t see any MODIS aqua?

7yl4r commented 7 years ago

after looking through the most recent files in ftp://thing1.marine.usf.edu/data and ftp://thing1.marine.usf.edu/gsfcdata I think I've narrowed down the problem area to between pds ingest and level0, which would be l0l1aqua.

Looking at the process history for l0l1aqua (and keeping in mind that the problem started on 08-12...

image

perhaps the failures on dune alone are to blame?

7yl4r commented 7 years ago

Dune's l0l1aqua gets killed because of repeated failures due to https://github.com/USF-IMARS/IPOPP-docs/issues/8

7yl4r commented 7 years ago

No changes in dune:/etc/ since 08-10 (which was diskio & ping monitoring being added to telegraf.conf)

7yl4r commented 7 years ago

I've added a restart-l0l1aqua every 5min cronjob to dune... Hopefully that will allow it to get some useful work in through the mess of errors.

In the short-term, this should resolve once the errored products get pushed out of the 14-day processing window. In the long term, I need to implement some better error handling.

7yl4r commented 7 years ago

I really think the temporal alignment with #6 was purely coincidental... since the nearly identical l0l1terra did not experience any issues over this same time period:

image

7yl4r commented 7 years ago

this is a good test for if things are working (today is 2017228):

[root@reef01 ~]# find /srv/imars-data/ipopp/website-images/nrt -name "aqua.2017228.*.A05.sst.tiff"
[root@reef01 ~]# 

# compare that with:

[root@reef01 ~]# find /srv/imars-data/ipopp/website-images/nrt -name "terra.2017228.*.A05.sst.tiff"
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.0320.A05.sst.tiff
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.1550.A05.sst.tiff
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.1410.A05.sst.tiff
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.0500.A05.sst.tiff
/srv/imars-data/ipopp/website-images/nrt/data/terra/modis/level3/terra.2017228.1730.A05.sst.tiff
7yl4r commented 7 years ago

Whatever is going wrong is generating no logs...

[ipopp@dune l0l1aqua]$ ll
total 216
-rw-r--r--. 1 ipopp ipopp   4096 Jun 22 19:00 ancillary_data.db
drwxrwxr-x. 2 ipopp ipopp   4096 Mar  7 20:08 data
drwxrwxr-x. 2 ipopp ipopp   4096 Aug 18 13:15 FAIL.170818-131502
drwxrwxr-x. 2 ipopp ipopp   4096 Aug 18 13:15 FAIL.170818-131517
drwxrwxr-x. 2 ipopp ipopp   4096 Aug 18 13:33 FAIL.170818-133322
drwxrwxr-x. 2 ipopp ipopp   4096 Aug 18 13:35 FAIL.170818-133532
drwxrwxr-x. 2 ipopp ipopp   4096 Aug 18 13:36 FAIL.170818-133607
drwxrwxr-x. 2 ipopp ipopp   4096 Aug 18 13:36 FAIL.170818-133612
drwxrwxr-x. 2 ipopp ipopp   4096 Aug 18 13:36 FAIL.170818-133617
drwxrwxr-x. 7 ipopp ipopp   4096 Aug 18 13:36 jsw
-rwxrwxr-x. 1 ipopp ipopp  11554 Mar  7 20:08 station.cfgfile
-rwxrwxr-x. 1 ipopp ipopp   1594 Mar  7 20:08 station.cfgfile.xml
-rwxrwxr-x. 1 ipopp ipopp     15 Mar  7 20:08 station.cmdfile
drwxrwxr-x. 2 ipopp ipopp  20480 Aug 18 06:01 station.nslsdir
-rw-rw-r--. 1 ipopp ipopp 127764 Aug 18 13:36 station.stationlog
[ipopp@dune l0l1aqua]$ grep -R "" FAIL.170818-13*
[ipopp@dune l0l1aqua]$ ll FAIL.170818-13*
FAIL.170818-131502:
total 0

FAIL.170818-131517:
total 0

FAIL.170818-133322:
total 0

FAIL.170818-133532:
total 0

FAIL.170818-133607:
total 0

FAIL.170818-133612:
total 0

FAIL.170818-133617:
total 0

even from the console...

[ipopp@dune l0l1aqua]$ ./jsw/bin/wrapper.sh status
NCS Station - l0l1aqua is not running.
[ipopp@dune l0l1aqua]$ ./jsw/bin/wrapper.sh console
Running NCS Station - l0l1aqua...
wrapper  | --> Wrapper Started as Console
wrapper  | Java Service Wrapper Community Edition 64-bit 3.5.24
wrapper  |   Copyright (C) 1999-2014 Tanuki Software, Ltd. All Rights Reserved.
wrapper  |     http://wrapper.tanukisoftware.com
wrapper  | 
wrapper  | Launching a JVM...
jvm 1    | WrapperManager: Initializing...
jvm 1    | cfg_ncs_home is ../..
jvm 1    | Initializing algorithm Mod L1A Aqua
jvm 1    | InitAlgorithm: siteName - NISDS-dune-192.168.1.23, algoName - Mod L1A Aqua, gopherColony - Mod-L1A grp1
jvm 1    | Algorithm Mod L1A Aqua initialized
wrapper  | <-- Wrapper Stopped
7yl4r commented 7 years ago

oh no wait... the console got killed early b/c the FAIL* directories existed... apparently that is how IPOPP decides if a station should be killed.

7yl4r commented 7 years ago

console log for 5 failed runs from startup to getting killed

7yl4r commented 7 years ago

running the command directly using the test.sh in the gist linked above works like a charm.

[ipopp@dune l0l1aqua]$ ./test.sh 
L1A version: 5.0.5  built on Jun  1 2012 (11:51:27)
Scan Number: 0  Fri Aug 18 16:02:51 2017
Scan Number: 10  Fri Aug 18 16:02:51 2017
Scan Number: 20  Fri Aug 18 16:02:51 2017
Scan Number: 30  Fri Aug 18 16:02:51 2017
Scan Number: 40  Fri Aug 18 16:02:51 2017
Scan Number: 50  Fri Aug 18 16:02:51 2017
Scan Number: 60  Fri Aug 18 16:02:51 2017
Scan Number: 70  Fri Aug 18 16:02:52 2017
Scan Number: 80  Fri Aug 18 16:02:52 2017
Scan Number: 90  Fri Aug 18 16:02:52 2017
Scan Number: 100  Fri Aug 18 16:02:52 2017
Scan Number: 110  Fri Aug 18 16:02:52 2017
Scan Number: 120  Fri Aug 18 16:02:52 2017
Scan Number: 130  Fri Aug 18 16:02:52 2017
Scan Number: 140  Fri Aug 18 16:02:52 2017
Scan Number: 150  Fri Aug 18 16:02:52 2017
Scan Number: 160  Fri Aug 18 16:02:52 2017
Scan Number: 170  Fri Aug 18 16:02:52 2017
Scan Number: 180  Fri Aug 18 16:02:52 2017
Scan Number: 190  Fri Aug 18 16:02:53 2017
Scan Number: 200  Fri Aug 18 16:02:53 2017

[ipopp@dune l0l1aqua]$ ll RUN-TEST-1/
total 558008
-rw-rw-r--. 1 ipopp ipopp 571362365 Aug 18 16:02 L1A.hdf
-rw-rw-r--. 1 ipopp ipopp     20719 Aug 18 16:02 L1A.hdf.pcf
-rw-rw-r--. 1 ipopp ipopp       420 Aug 18 16:02 LogReport.L1A.hdf
-rw-rw-r--. 1 ipopp ipopp      1346 Aug 18 16:02 LogStatus.L1A.hdf
-rw-rw-r--. 1 ipopp ipopp       614 Aug 18 16:02 LogUser.L1A.hdf

So I guess it's just IPOPP "throwing a fit" and tossing the data.

7yl4r commented 7 years ago

No wait... that one file I tested just happened to work. Here's one that isn't:

[ipopp@dune l0l1aqua]$ ./test.sh 
ERROR: File: /home/ipopp/temp/P1540064AAAAAAAAAAAAAA17210052000001.PDS does not exist.
7yl4r commented 7 years ago

Let's try to think through the process here:

  1. l0l1aqua accepts job from thing1
  2. dune gets input file(s) product_id from thing1 (NCS or IS?) (?)
  3. dune-DSM(?) queries thing1-db for file paths for given product_ids
  4. the product_ids are not in thing1-db
  5. dune cannot copy the file from thing-FTP to /home/ipopp/temp/ bc no path
  6. dune.modisl1db.l0l1aqua cannot find the input file

So the solution is going to be in figuring out the details of what is going wrong at step 2 above.

7yl4r commented 7 years ago

Given how intractable the mess of code is... let's look through the entire database for anything potentially relevant:

use DSM;
show tables;

SET @bad_id = 133025;  # the id printed in the "TransferCommand" error
# Algorithms has no potentially relevant data
select * from Ancestors where product=@bad_id;                # * NULL
select * from Ancestors where ancestor=@bad_id;               # * NULL
select * from Contributors where id=@bad_id;                  # * NULL
select * from Directories where id=@bad_id;                   # * NULL
select * from MappedProducts where product=@bad_id;           # NONE ???
select * from Markers where product=@bad_id;                  # * NULL
# Mutex has no potentially relevant data
# NisgsProperties is completely empty?
select * from Passes where id=@bad_id;                        # * NULL
select * from ProductContributors where product=@bad_id;      # * NULL
select * from ProductContributors where contributor=@bad_id;  # * NULL
# ProductThumbnails is completely empty?
# ProductTypes has no potentially relevant data
select * from ProductionDataSets where product=@bad_id;       # * NULL
select * from Products where id=@bad_id;                      # * NULL
select * from Products where pass=@bad_id;                    # * NULL
select * from ResourceSites where resource=@bad_id;  # TWO RESULTS !!!!!
# !!!
# '133025', 'IS',           '12013', '2017-07-25 02:22:38'
# '133025', 'NISDS-hydra1', '12376', '2017-07-25 02:18:37'
# !!!
select * from Resources where id=@bad_id;
# !!!
# '133025', '128611', 'DATA', 'VODmcrefl_TrueColor.17190164742.seacoos.750m.tiff', NULL, '1'
# !!!
select * from Resources where product=@bad_id;                 # * NULL
select * from SatTimeAncillaries where id=@bad_id;             # * NULL
# StaticAncillaries is completely empty?
# StaticAncillarySites is completely empty?
# Stations is completely empty?
# Thumbnails is completely empty?
select * from TimeAncillaries where id=@bad_id;                # * NULL
select * from TimeAncillarySites where aid=@bad_id;            # * NULL
select * from TimeAncillarySites where directory=@bad_id;      # * NULL
select * from TransferCommands where id=@bad_id;
# !!!
# '133025', 'Products', '138791', 'NISDS-dune-192.168.1.23', '2017-08-14 05:16:08', '0'
# !!!
select * from TransferCommands where tableId=@bad_id;          # * NULL
7yl4r commented 7 years ago

Ah. This fixed itself at 6am Sunday morning...

Soooooo... :man_facepalming:

7yl4r commented 7 years ago

I believe this was ultimately caused by the family of DSM failure issues (#2 #8 #9 #10 #11).