Open 7yl4r opened 7 years ago
example:
NISDS-seastar-192.168.1.216 DSMR: Error in test method TransferCommand id=48036 table=Products tableId=30569 state=0, java.lang.Throwable: (really java.lang.Exception) Product 30569does not exist in the database.
Wow... I'm seeing A LOT of these right now from hydra1, reef02, and dune... Perhaps seastar is the offender?
Alright, let's trace this...
So... Hmm. In general let's not throw and catch exceptions and then hide the error info... so I'd like to fix that, but since I'm not rewriting the whole DSM today... some guesses as to the cause:
dsmProperties.getConnection()
Utility.executeQuery()
fails to execute because of db connection issue (similar to 1)ProductFactory.makeProduct
doesn't like what it gets back in the resultSetTesting out the query manually with some Ids from the NSLS:
SELECT * FROM Products WHERE id=81347;
NULL
SELECT * FROM Products WHERE id=100485;
NULL
SELECT * FROM Products WHERE id=103402;
NULL
So... I'm thinking we have case 4 above. The Id is actually not in the database! Two remaining questions:
RE 2 (spam stopping)
we could set complete=1
(or any value > 0) on select * from TransferCommands where tableId=81347
as a workaround (I've tried this before and ended up reseting the database)
Seeing this issue and family again.
I may be approaching a workaround in mysql-workbench:
# === fix for https://github.com/USF-IMARS/IPOPP-docs/issues/2 :
# 1. get id from NSLS and put it here
SET @bad_tableId = 178113; #120143;
# 2a. set the TransferCommand to completed (change last column from 0 to > 0):
select * from TransferCommands where tableId=@bad_id;
# '163536', 'Products', '178113', 'NISDS-seastar-192.168.1.216', '2017-09-03 11:52:53', '7'
# make note of the server that should have the file
# source_server=seastar
# 2b. ..no wait... delete the row
delete from TransferCommands where tableId=@bad_id;
# NOTE: you'll probably need to use the id of the transferCommand itself (which is different than the tableId)
#DELETE FROM `DSM`.`TransferCommands` WHERE `id`='163536';
# 3. wait (~5min?) for a "* does not exist" error in NSLS, then copy that filepath here along with the host that is missing it
# missing_file=/corals/temp/hydra1/aqua/2017239.0745.gcoos.sst.filtered.h5
# whiny_server=hydra1
# 4. find the file somewhere... (probably on the server listed in the original transferCommand)
# ssh ipopp@source_server
# filepath=/...
# 5. manually copy the file to the whiny_server
# scp $filepath ipopp@$whiny_server:$missing_file
# 4b. if you can't find it... try to delete the relevant pass & let it re-process from scratch
The above attempted "fix" created several more of these errors, but on different files.
Okay... finally something to help make this issue less serious: If the markers are left as failed (instead of auto-reprocessing via cronjob) then the system can to normal functioning in a few days rather than a couple weeks.
Observed in the NSLS console, this is a database corruption error I have encountered a few times, and I don't have a good solution.
The error is thrown by the dsm because it is attempting to move a product from a slave to the master based on a command in the database table "TransferCommands" (I think) when the product is not in the database.
I guess this would be caused by a slave successfully entering a row into the TransferCommands table, when the previous row to the Products table has failed. I guess DSM should check for this.
One workaround I used was to delete the offending entry from TransferCommands. This either worked or irreversibly broke the database.