DUNE / data-mgmt-ops

3 stars 3 forks source link

repeating file transfer fails in FTS #627

Closed StevenCTimm closed 2 months ago

StevenCTimm commented 3 months ago

davs://fndca1.fnal.gov:2880/dune/persistent/staging/fardet-vd/15/29/anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg_1285_848_20230806T135111Z_gen_g4_detsim_hitreco.root

srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/fardet-vd/hit-reconstructed/2023/mc/out1/fd_mc_2023a/00/00/12/85/anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg_1285_848_20230806T135111Z_gen_g4_detsim_hitreco.root_1718758448

(file not found)

davs://fndca1.fnal.gov:2880/dune/persistent/staging/fardet-vd/15/29/anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg_1285_848_20230806T135111Z_gen_g4_detsim_hitreco.root

srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/fardet-vd/hit-reconstructed/2023/mc/out1/fd_mc_2023a/00/00/12/85/anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg_1285_848_20230806T135111Z_gen_g4_detsim_hitreco.root_1718758968


the first one above is also failing with gridftp-srm That must be a very old Rule because we disabled gridftp already. Trying to find what rule that is a member of, so far without success.

StevenCTimm commented 3 months ago

also different ones failing from eospublic to dcache

for example

davs://eospublic.cern.ch:443/eos/experiment/neutplatform/protodune/dune/hd-protodune/b5/73/np04hd_raw_run026960_4070_dataflow0_datawriter_0_20240613T130156.hdf5

davs://fndca1.fnal.gov:2880/dune/tape_backed/dunepro/hd-protodune/raw/2024/detector/calibration/None/00/02/69/60/np04hd_raw_run026960_4070_dataflow0_datawriter_0_20240613T130156.hdf5_1718743696

Claiming that this one is a problem of checksum.

StevenCTimm commented 3 months ago

The first file above is also failing regularly to transfer to ccin2p3.fr for the same reason, namely it isn't there.

StevenCTimm commented 3 months ago

Have gone through the original hitreco rucio container which is fardet-vd:mc.fardet-vd.fd_mc_2023a.v09_75_03d00.hit-reconstructed.prodgenie_anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg.fcl.prod_v4 and is built out of fardet-vd:fardet-vd_1285 which is the run in which this file would have been made That file is not in that rucio container. or the sub dataset. The file is also not known to metacat.. so what possible rule could be generating this?

StevenCTimm commented 3 months ago

It is a member of higuera:fardet-vdfd_mc_2023amchit-reconstructedprodgenie_anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg.fclv09_75_03d00preliminary data set.

StevenCTimm commented 3 months ago

( in metacat).

StevenCTimm commented 3 months ago

rucio-admin -a root replicas set-tombstone --rse DUNE_CERN_EOS hd-protodune:np04hd_raw_run026794_0125_dataflow0_datawriter_0_20240608T025748.hdf5 Replica is locked Details: Replica hd-protodune:np04hd_raw_run026794_0125_dataflow0_datawriter_0_20240608T025748.hdf5 on RSE DUNE_CERN_EOS is locked. This means that the replica has a lock and is therefore protected. Above file already retired in metacat by me just now, and detached from its dataset and rule, but somehow the replica is still locked. have to investigate further.

StevenCTimm commented 3 months ago

The checksum errors are due to the file being a zero-byte file and things not having detected it. we already know we need to add the --cksum option to the xrdcp that sends the file to the rucio area in declad.

StevenCTimm commented 2 months ago

I believe all old ones of these have now been cleaned up.. watch for another week or so to see if the declad makes any more of them.

StevenCTimm commented 2 months ago

Closing this.