Closed StevenCTimm closed 9 months ago
Have opened RITM2021289, tracking it there.
The first problem in the conveyor-finisher was that it was confused on the davs prefix which was different between FNAL_DCACHE_STAGING and FNAL_DCACHE.. (why that matters I have no idea, but it did).
So Dimitrios Christidis of Rucio team says that the file not being renamed is a feature.. whenever there's been a previous copy failure, as there have been many in this case, then rucio writes the file with a _timestamp appended in future.
The problem as mentioned above is the confusion on the "prefix" for davs transfers.. apparently somehow it picked it up during the 1 hour or so I had FNAL_DCACHE_STAGING configured that way and is hanging on to it.
now 228404 files in place on tape, 32263 still to transfer, 3427 stuck
for DUNE_US_FNAL_DISK_STAGE (with higher total number of files)
762236, 1941, 6470 for OK, REPL, STUCK
In further examination: (1) Brandon fixed the prefix on the davs protocol for FNAL_DCACHE (2) it looks like the FNAL_DCACHE_STAGING thing was a red herring (3) since davs protocol had never before Tuesday been used for inbound 3rd party transfers it's likely the prefix was always wrong there.. he fixed it to /dune/tape_backed/dunepro from the dune/tape_backed/dunepro that it was. FTS3 doesn't care (adds or subtracts slashes as necessary) but the gfal.py plugin which is called by the conveyor-finisher does.
Almost all caught up, as you see above.
OK all is caught up closing this. the files that are still STUCK in these rules are probably stuck for the same reason that the DUNE_US_FNAL_DISK_STAGE stuff is stuck and will be covered in that investigation.
I am seeing tens of thousands of files being copied to tape-backed dcache with time stamps appended into their file names
For example:
ls -lrt /pnfs/dune/tape_backed/dunepro/fardet-hd/full-reconstructed/2024/mc/out1/fd_mc_2023a_reco2/00/00/11/05/nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco20240220T113535Z_reco2.root* -rw-r--r-- 1 dunepro dune 1654814967 Feb 28 08:55 /pnfs/dune/tape_backed/dunepro/fardet-hd/full-reconstructed/2024/mc/out1/fd_mc_2023a_reco2/00/00/11/05/nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco20240220T113535Z_reco2.root_1709131005
498496f5073848f2a1e8f4d3970e33d4 dunepro fardet-hd:fardet-hd-reco2_ritm1780305_nutau_fhc_skip15000_end_1594 STUCK[6167/490/1] DUNE_US_FNAL_DISK_STAGE 1 N/A 2024-02-19 18:41:19 ad1daaa1c4f24763816e153d52fbff55 dunepro fardet-hd:fardet-hd-reco2_ritm1780305_nutau_fhc_skip15000_end_1594 STUCK[0/6656/2] FNAL_DCACHE 1 N/A 2024-02-22 21:47:47
SO: Rucio directs the transfer of the file to nutau_dune10kt_1x2x6_1105_400_20230826T143345Z_gen_g4_detsim_hitreco__20240220T113535Z_reco2.root_1709131005
The transfer is successful
But something is supposed to rename it. What?
We are using rucio.rse.protocols.gfal.Default implementation, always have been as far as I know.
As it stands the transfer is successful and the file is going to tape, but as yet it is not recognized as a replica by the rucio system, nor has it been renamed, nor has the rule been updated with a replicated file.
We have now been doing these types of copies since 08:27
Thus far there's only one copy per file, it is not doing multiples.
Do we have a very slow conveyor-poller. always a possibility, or is something borked with the config and we are dumping a bunch of stuff to tape that we won't be able to find?