Closed StevenCTimm closed 4 months ago
file_did,allocations,state,last_allocation_time,rse_name,site_name,jobsub_id,last_jobscript_exit "hd-protodune:np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027283_0087_dataflow3_datawriter_0_20240619T071534.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027296_0152_dataflow2_datawriter_0_20240619T113858.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027296_0172_dataflow1_datawriter_0_20240619T114426.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027298_0154_dataflow3_datawriter_0_20240619T144858.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027298_0215_dataflow3_datawriter_0_20240619T150312.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027308_0024_dataflow2_datawriter_0_20240619T184506.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0096_dataflow3_datawriter_0_20240619T192925.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0091_dataflow3_datawriter_0_20240619T192813.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0096_dataflow1_datawriter_0_20240619T192920.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0096_dataflow0_datawriter_0_20240619T192923.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0301_dataflow2_datawriter_0_20240619T201853.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0342_dataflow0_datawriter_0_20240619T202815.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0386_dataflow2_datawriter_0_20240619T203824.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0002_dataflow3_datawriter_0_20240619T234713.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1137_dataflow2_datawriter_0_20240619T233556.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1130_dataflow0_datawriter_0_20240619T233407.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1144_dataflow2_datawriter_0_20240619T233738.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1125_dataflow2_datawriter_0_20240619T233251.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1118_dataflow0_datawriter_0_20240619T233108.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1128_dataflow1_datawriter_0_20240619T233334.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1121_dataflow3_datawriter_0_20240619T233153.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1122_dataflow2_datawriter_0_20240619T233208.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1129_dataflow1_datawriter_0_20240619T233353.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1080_dataflow1_datawriter_0_20240619T232145.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1080_dataflow3_datawriter_0_20240619T232145.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0985_dataflow2_datawriter_0_20240619T225747.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0079_dataflow0_datawriter_0_20240620T000623.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0078_dataflow0_datawriter_0_20240620T000609.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0189_dataflow2_datawriter_0_20240620T012845.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0373_dataflow0_datawriter_0_20240620T022214.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0672_dataflow2_datawriter_0_20240620T034907.hdf5",0,"notfound",,,,,
above is the CSV list of all such files now looking at declad log on protodune-declad-np02
This is what declad.log had to say for the first such file we looked at:
, 'checksums': {'adler32': 'cd78a623'}, 'fid': '83298971'} declad.log.4.gz:06/19/2024 07:03:35.914: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: file declared to MetaCat declad.log.4.gz:06/19/2024 07:03:35.914: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: ----- declaring to Rucio declad.log.4.gz:06/19/2024 07:13:43.120: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: [DEBUG] runCommand: xrdfs eospublic.cern.ch mv /eos/experiment/neutplatform/protodune/dune/dropbox/np02/np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5.json /eos/experiment/neutplatform/protodune/dune/test/dropbox/quarantine/np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5.json declad.log.4.gz:06/19/2024 07:13:44.182: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: [DEBUG] runCommand: xrdfs eospublic.cern.ch mv /eos/experiment/neutplatform/protodune/dune/dropbox/np02/np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5 /eos/experiment/neutplatform/protodune/dune/test/dropbox/quarantine/np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5 declad.log.4.gz:06/19/2024 07:13:45.270: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: ----- quarantined declad.log.4.gz:Mover failed: np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5 status: quarantined error: Error in creating Rucio replication rule -> FNAL_DCACHE: HTTPSConnectionPool(host='dune-rucio.fnal.gov', port=443): Read timed out. (read timeout=600)
This solves part of a mystery so all the quarantined files were going to the old quarantine location /eos/experiment/neutplatform/protodune/dune/test/dropbox/quarantine
I've moved one of these files back to the main dropbox will see what the declad does with it the second time around Appears it declared it to rucio this time.. therefore the likelihood was that we had an unrecoverable error in rucio the first time around, as shown in the log message above.
first 2 quarantined files did the right thing when copied back to the original dropbox, presume the rest of them will too. They did, but they caused tape rules to be made on whole very big datasets that were not previously on tape. That's mostly sorted out too.
current count is 32.. the first one I spot-checked was indeed not in rucio This would be evident of a quarantine event, need to check the declad logs from 06/20 which is when this run was taken.