DUNE / data-mgmt-ops

3 stars 3 forks source link

Jake's workflow 2382 comes up with files "not found" #647

Closed StevenCTimm closed 3 months ago

StevenCTimm commented 3 months ago

current count is 32.. the first one I spot-checked was indeed not in rucio This would be evident of a quarantine event, need to check the declad logs from 06/20 which is when this run was taken.

StevenCTimm commented 3 months ago

file_did,allocations,state,last_allocation_time,rse_name,site_name,jobsub_id,last_jobscript_exit "hd-protodune:np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027283_0087_dataflow3_datawriter_0_20240619T071534.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027296_0152_dataflow2_datawriter_0_20240619T113858.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027296_0172_dataflow1_datawriter_0_20240619T114426.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027298_0154_dataflow3_datawriter_0_20240619T144858.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027298_0215_dataflow3_datawriter_0_20240619T150312.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027308_0024_dataflow2_datawriter_0_20240619T184506.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0096_dataflow3_datawriter_0_20240619T192925.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0091_dataflow3_datawriter_0_20240619T192813.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0096_dataflow1_datawriter_0_20240619T192920.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0096_dataflow0_datawriter_0_20240619T192923.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0301_dataflow2_datawriter_0_20240619T201853.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0342_dataflow0_datawriter_0_20240619T202815.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0386_dataflow2_datawriter_0_20240619T203824.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0002_dataflow3_datawriter_0_20240619T234713.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1137_dataflow2_datawriter_0_20240619T233556.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1130_dataflow0_datawriter_0_20240619T233407.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1144_dataflow2_datawriter_0_20240619T233738.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1125_dataflow2_datawriter_0_20240619T233251.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1118_dataflow0_datawriter_0_20240619T233108.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1128_dataflow1_datawriter_0_20240619T233334.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1121_dataflow3_datawriter_0_20240619T233153.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1122_dataflow2_datawriter_0_20240619T233208.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1129_dataflow1_datawriter_0_20240619T233353.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1080_dataflow1_datawriter_0_20240619T232145.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_1080_dataflow3_datawriter_0_20240619T232145.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027309_0985_dataflow2_datawriter_0_20240619T225747.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0079_dataflow0_datawriter_0_20240620T000623.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0078_dataflow0_datawriter_0_20240620T000609.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0189_dataflow2_datawriter_0_20240620T012845.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0373_dataflow0_datawriter_0_20240620T022214.hdf5",0,"notfound",,,,, "hd-protodune:np04hd_raw_run027310_0672_dataflow2_datawriter_0_20240620T034907.hdf5",0,"notfound",,,,,

StevenCTimm commented 3 months ago

above is the CSV list of all such files now looking at declad log on protodune-declad-np02

StevenCTimm commented 3 months ago

This is what declad.log had to say for the first such file we looked at:

, 'checksums': {'adler32': 'cd78a623'}, 'fid': '83298971'} declad.log.4.gz:06/19/2024 07:03:35.914: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: file declared to MetaCat declad.log.4.gz:06/19/2024 07:03:35.914: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: ----- declaring to Rucio declad.log.4.gz:06/19/2024 07:13:43.120: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: [DEBUG] runCommand: xrdfs eospublic.cern.ch mv /eos/experiment/neutplatform/protodune/dune/dropbox/np02/np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5.json /eos/experiment/neutplatform/protodune/dune/test/dropbox/quarantine/np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5.json declad.log.4.gz:06/19/2024 07:13:44.182: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: [DEBUG] runCommand: xrdfs eospublic.cern.ch mv /eos/experiment/neutplatform/protodune/dune/dropbox/np02/np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5 /eos/experiment/neutplatform/protodune/dune/test/dropbox/quarantine/np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5 declad.log.4.gz:06/19/2024 07:13:45.270: MoverTask[np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5]: ----- quarantined declad.log.4.gz:Mover failed: np04hd_raw_run027279_1228_dataflow3_datawriter_0_20240619T045929.hdf5 status: quarantined error: Error in creating Rucio replication rule -> FNAL_DCACHE: HTTPSConnectionPool(host='dune-rucio.fnal.gov', port=443): Read timed out. (read timeout=600)

StevenCTimm commented 3 months ago

This solves part of a mystery so all the quarantined files were going to the old quarantine location /eos/experiment/neutplatform/protodune/dune/test/dropbox/quarantine

I've moved one of these files back to the main dropbox will see what the declad does with it the second time around Appears it declared it to rucio this time.. therefore the likelihood was that we had an unrecoverable error in rucio the first time around, as shown in the log message above.

StevenCTimm commented 3 months ago

first 2 quarantined files did the right thing when copied back to the original dropbox, presume the rest of them will too. They did, but they caused tape rules to be made on whole very big datasets that were not previously on tape. That's mostly sorted out too.