USF-IMARS / IPOPP-docs

Documentation related to IMaRS's use of NASA's IPOPP software.
MIT License
1 stars 0 forks source link

"Product ######does not exist in the database" #2

Open 7yl4r opened 7 years ago

7yl4r commented 7 years ago

Observed in the NSLS console, this is a database corruption error I have encountered a few times, and I don't have a good solution.

The error is thrown by the dsm because it is attempting to move a product from a slave to the master based on a command in the database table "TransferCommands" (I think) when the product is not in the database.

I guess this would be caused by a slave successfully entering a row into the TransferCommands table, when the previous row to the Products table has failed. I guess DSM should check for this.

One workaround I used was to delete the offending entry from TransferCommands. This either worked or irreversibly broke the database.

7yl4r commented 7 years ago

example:

NISDS-seastar-192.168.1.216 DSMR: Error in test method TransferCommand id=48036 table=Products tableId=30569 state=0, java.lang.Throwable: (really java.lang.Exception) Product 30569does not exist in the database.
7yl4r commented 7 years ago

Wow... I'm seeing A LOT of these right now from hydra1, reef02, and dune... Perhaps seastar is the offender?

7yl4r commented 7 years ago

Alright, let's trace this...

  1. this error message comes from DSMR when TransferCommand.test() fails
  2. TransferCommand is an abstract class that does not implement test()
  3. the TransferCommand subclasses actually used in DSMR are ProductTransferCommand and AncillaryTransferCommand
  4. since the error we're talking about is for products, let's make the assumption we're dealing with ProductTransferCommand
  5. (assumption confirmed) the other half of the error prints from ProductTransferCommand when (DSMAdministrator) dsm.getProduct(tableId) returns null
  6. DSMAdministrator inherits getProduct from DSM
  7. DSM.getProduct() calls queryProduct
  8. queryProduct throws some kind of exception that is later being masked

So... Hmm. In general let's not throw and catch exceptions and then hide the error info... so I'd like to fix that, but since I'm not rewriting the whole DSM today... some guesses as to the cause:

  1. can't initiate connection to the database via dsmProperties.getConnection()
  2. can't build the statement... (seems unlikely bc productId looks ok)
  3. Utility.executeQuery() fails to execute because of db connection issue (similar to 1)
  4. ProductFactory.makeProduct doesn't like what it gets back in the resultSet
7yl4r commented 7 years ago

Testing out the query manually with some Ids from the NSLS:

SELECT * FROM Products WHERE id=81347;
NULL

SELECT * FROM Products WHERE id=100485;
NULL

SELECT * FROM Products WHERE id=103402;
NULL

So... I'm thinking we have case 4 above. The Id is actually not in the database! Two remaining questions:

  1. what series of events lead to the insertion of a TransferCommand into the database which references a product id which is not in the database?
  2. how do we stop this issue from spamming the NSLS?
7yl4r commented 7 years ago

RE 2 (spam stopping)

we could set complete=1 (or any value > 0) on select * from TransferCommands where tableId=81347 as a workaround (I've tried this before and ended up reseting the database)

7yl4r commented 7 years ago

Seeing this issue and family again.

I may be approaching a workaround in mysql-workbench:

# === fix for https://github.com/USF-IMARS/IPOPP-docs/issues/2 :
# 1. get id from NSLS and put it here
SET @bad_tableId = 178113; #120143;
# 2a. set the TransferCommand to completed (change last column from 0 to > 0): 
select * from TransferCommands where tableId=@bad_id;
# '163536', 'Products', '178113', 'NISDS-seastar-192.168.1.216', '2017-09-03 11:52:53', '7'
# make note of the server that should have the file
# source_server=seastar

# 2b. ..no wait... delete the row
delete from TransferCommands where tableId=@bad_id;
# NOTE: you'll probably need to use the id of the transferCommand itself (which is different than the tableId)
#DELETE FROM `DSM`.`TransferCommands` WHERE `id`='163536';

# 3. wait (~5min?) for a "* does not exist" error in NSLS, then copy that filepath here along with the host that is missing it
# missing_file=/corals/temp/hydra1/aqua/2017239.0745.gcoos.sst.filtered.h5
# whiny_server=hydra1

# 4. find the file somewhere... (probably on the server listed in the original transferCommand)
# ssh ipopp@source_server
# filepath=/...

# 5. manually copy the file to the whiny_server
# scp $filepath ipopp@$whiny_server:$missing_file

# 4b. if you can't find it... try to delete the relevant pass & let it re-process from scratch
7yl4r commented 7 years ago

The above attempted "fix" created several more of these errors, but on different files.

7yl4r commented 7 years ago

Okay... finally something to help make this issue less serious: If the markers are left as failed (instead of auto-reprocessing via cronjob) then the system can to normal functioning in a few days rather than a couple weeks.