dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
282 stars 135 forks source link

Truncated files #6474

Open DmitryLitvintsev opened 2 years ago

DmitryLitvintsev commented 2 years ago

Continuing issues of bad files.

User report:

Error in <TFile::Init>: file /pnfs/GM2/scratch/daq/2021-10-29-18-14-40/data/gm2preproduction_full_49855919_44135.00292.root is truncated at 700743060 bytes: should be 1242600121, trying to recover
Warning in <TFile::Init>: no keys recovered, file has been made a Zombie
Unable to open file '/pnfs/GM2/scratch/daq/2021-10-29-18-14-40/data/gm2preproduction_full_49855919_44135.00292.root' for reading.
Skipping file.

The file is not in Error state:

[fndca3b] (PnfsManager@namespaceDomain) enstore > pnfsidof /pnfs/fs/usr/GM2/scratch/daq/2021-10-29-18-14-40/data/gm2preproduction_full_49855919_44135.00292.root'
000093273C9C8B724B9CB3F12CB15F14D6B0
[fndca3b] (PnfsManager@namespaceDomain) enstore > \sl 000093273C9C8B724B9CB3F12CB15F14D6B0 rep ls 000093273C9C8B724B9CB3F12CB15F14D6B0
v-stkendca2003-2:
    000093273C9C8B724B9CB3F12CB15F14D6B0 <C----------L(0)[0]> 700743060 si={GM2.scratch}

But I see upload error in billing:

billing=# select datestamp, protocol, errorcode, errormessage, initiator from billinginfo where pnfsid = '000093273C9C8B724B9CB3F12CB15F14D6B0' and isnew is true;
         datestamp          | protocol | errorcode |                                     errormessage                         
             |                                   initiator                                    
----------------------------+----------+-----------+--------------------------------------------------------------------------
-------------+--------------------------------------------------------------------------------
 2021-10-29 20:51:08.827-05 | GFtp-2.0 |       666 | General problem: Problem while connected to 137.99.174.35:56498: Connecti
on timed out | door:GFTP-stkendca2043-AAXPhlXymbg@gridftp-stkendca2043Domain:1635550758461000
(1 row)
*AND* interestingly I do not see record associated with `door:GFTP-stkendca2043-AAXPhlXymbg@gridftp-stkendca2043Domain:1635550758461000` in doorinfo. 

Houston, we have a problem.

paulmillar commented 2 years ago

I'd be interested what was the client interactions for this transfer.

Could you copy the corresponding lines from the access log file for this FTP session? (something like grep 1635550758461000 /var/log/dcache/gridftp-stkendca2043Domain.access).

My guess is that the client didn't provide any hint about the expected file size (or checksum).

It looks like the FTP mover knew there was a problem ("Connection timed out"), but the protocol-agnostic post-transfer handler doesn't know any better, so considers the replica "good" and the transfer as successful.

The missing door billing entry is also interesting: is anything logged by the door at around the time the transfer finished?