DUNE / data-mgmt-ops

3 stars 3 forks source link

DUNE_US_FNAL_DISK_STAGE (disk) to FNAL_DCACHE (Tape) not getting processed by the conveyor-finisher. #506

Closed StevenCTimm closed 9 months ago

StevenCTimm commented 9 months ago

I am seeing tens of thousands of files being copied to tape-backed dcache with time stamps appended into their file names

For example:

ls -lrt /pnfs/dune/tape_backed/dunepro/fardet-hd/full-reconstructed/2024/mc/out1/fd_mc_2023a_reco2/00/00/11/05/nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco20240220T113535Z_reco2.root* -rw-r--r-- 1 dunepro dune 1654814967 Feb 28 08:55 /pnfs/dune/tape_backed/dunepro/fardet-hd/full-reconstructed/2024/mc/out1/fd_mc_2023a_reco2/00/00/11/05/nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco20240220T113535Z_reco2.root_1709131005

[dunepro@dunesl7gpvm01 mcruciosam]$ ls -lrt /pnfs/dune/tape_backed/dunepro/fardet-hd/full-reconstructed/2024/mc/out1/fd_mc_2023a_reco2/00/00/11/05/nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco20240220T113535Z_reco2.root* -rw-r--r-- 1 dunepro dune 1654814967 Feb 28 08:55 /pnfs/dune/tape_backed/dunepro/fardet-hd/full-reconstructed/2024/mc/out1/fd_mc_2023a_reco2/00/00/11/05/nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco20240220T113535Z_reco2.root_1709131005 [dunepro@dunesl7gpvm01 mcruciosam]$ rucio list-parent-dids fardet-hd:nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco__20240220T113535Z_reco2.root +--------------------------------------------------------------------+--------------+ SCOPE:NAME [DID TYPE] --------------------------------------------------------------------+-------------- fardet-hd:fardet-hd-reco2_ritm1780305_nutau_fhc_skip15000_end_1594 DATASET +--------------------------------------------------------------------+--------------+ [dunepro@dunesl7gpvm01 mcruciosam]$ rucio list-file-replicas fardet-hd:nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco__20240220T113535Z_reco2.rootfardet-hd nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco__20240220T113535Z_reco2.root 1.655 GB a732bc06 DUNE_US_BNL_SDCC: root://dcdndoor.sdcc.bnl.gov:1094/pnfs/sdcc.bnl.gov/data/dune/RSE/fardet-hd/d8/3d/nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco__20240220T113535Z_reco2.root fardet-hd nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco__20240220T113535Z_reco2.root 1.655 GB a732bc06 DUNE_US_FNAL_DISK_STAGE: root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/persistent/staging/fardet-hd/d8/3d/nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco__20240220T113535Z_reco2.root +-----------+---------------------------------------------------------------------------------------------------+------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ [dunepro@dunesl7gpvm01 mcruciosam]$ rucio list-parent-dids fardet-hd:nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco__20240220T113535Z_reco2.root +--------------------------------------------------------------------+--------------+ SCOPE:NAME [DID TYPE] --------------------------------------------------------------------+-------------- fardet-hd:fardet-hd-reco2_ritm1780305_nutau_fhc_skip15000_end_1594 DATASET +--------------------------------------------------------------------+--------------+ [dunepro@dunesl7gpvm01 mcruciosam]$ rucio list-rules fardet-hd:fardet-hd-reco2_ritm1780305_nutau_fhc_skip15000_end_1594 ID ACCOUNT SCOPE:NAME STATE[OK/REPL/STUCK] RSE_EXPRESSION COPIES SIZE EXPIRES (UTC) CREATED (UTC)

498496f5073848f2a1e8f4d3970e33d4 dunepro fardet-hd:fardet-hd-reco2_ritm1780305_nutau_fhc_skip15000_end_1594 STUCK[6167/490/1] DUNE_US_FNAL_DISK_STAGE 1 N/A 2024-02-19 18:41:19 ad1daaa1c4f24763816e153d52fbff55 dunepro fardet-hd:fardet-hd-reco2_ritm1780305_nutau_fhc_skip15000_end_1594 STUCK[0/6656/2] FNAL_DCACHE 1 N/A 2024-02-22 21:47:47

SO: Rucio directs the transfer of the file to nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco__20240220T113535Z_reco2.root_1709131005

The transfer is successful

But something is supposed to rename it. What?

We are using rucio.rse.protocols.gfal.Default implementation, always have been as far as I know.

As it stands the transfer is successful and the file is going to tape, but as yet it is not recognized as a replica by the rucio system, nor has it been renamed, nor has the rule been updated with a replicated file.

We have now been doing these types of copies since 08:27

Thus far there's only one copy per file, it is not doing multiples.

Do we have a very slow conveyor-poller. always a possibility, or is something borked with the config and we are dumping a bunch of stuff to tape that we won't be able to find?

StevenCTimm commented 9 months ago

Have opened RITM2021289, tracking it there.

StevenCTimm commented 9 months ago

The first problem in the conveyor-finisher was that it was confused on the davs prefix which was different between FNAL_DCACHE_STAGING and FNAL_DCACHE.. (why that matters I have no idea, but it did).

StevenCTimm commented 9 months ago

So Dimitrios Christidis of Rucio team says that the file not being renamed is a feature.. whenever there's been a previous copy failure, as there have been many in this case, then rucio writes the file with a _timestamp appended in future.

The problem as mentioned above is the confusion on the "prefix" for davs transfers.. apparently somehow it picked it up during the 1 hour or so I had FNAL_DCACHE_STAGING configured that way and is hanging on to it.

StevenCTimm commented 9 months ago

now 228404 files in place on tape, 32263 still to transfer, 3427 stuck

for DUNE_US_FNAL_DISK_STAGE (with higher total number of files)

762236, 1941, 6470 for OK, REPL, STUCK

StevenCTimm commented 9 months ago

In further examination: (1) Brandon fixed the prefix on the davs protocol for FNAL_DCACHE (2) it looks like the FNAL_DCACHE_STAGING thing was a red herring (3) since davs protocol had never before Tuesday been used for inbound 3rd party transfers it's likely the prefix was always wrong there.. he fixed it to /dune/tape_backed/dunepro from the dune/tape_backed/dunepro that it was. FTS3 doesn't care (adds or subtracts slashes as necessary) but the gfal.py plugin which is called by the conveyor-finisher does.

Almost all caught up, as you see above.

StevenCTimm commented 9 months ago

OK all is caught up closing this. the files that are still STUCK in these rules are probably stuck for the same reason that the DUNE_US_FNAL_DISK_STAGE stuff is stuck and will be covered in that investigation.