dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

WebDAV uploads leave behind broken files on failure #6455

Open DmitryLitvintsev opened 2 years ago

DmitryLitvintsev commented 2 years ago

two problems in one:

1) WebDAV transfers fail like so:

02 Feb 2022 19:26:58 (v-stkendca1806-1) [door:WebDAV-fndca4b-1@webdavDomain:AAXXEs+sAog WebDAV-fndca4b-1 PoolAcceptFile 00008FBFF2A3F3B542BCA50AD15FEEF94E90] Transfer failed: Channel idle for too long during upload.

The file is left behind in broken state:

00008FBFF2A3F3B542BCA50AD15FEEF94E90 <---------E-L(0)[0]> 709295696 si={GM2.scratch}

2) users are trying to access these files via NFS get transfers stuck in D state (rather than failing with I/O error:

This is dCache 7.2. We recently switched from GFTP to HTTPs and see rather substantial numbers of broken files due to issue (1):

    305 si={GM2.scratch}
     73 si={egp.scratch}
     19 si={uboone.scratch}
     17 si={icarus.scratch}
      7 si={dune.scratch}
      1 si={mu2e.scratch}
paulmillar commented 2 years ago

This really sounds like two problems, so should be two separate issues.

I have some questions about the first problem (bad WebDAV upload):

  1. When this problem is encountered, typically how much of the file was uploaded?
  2. Are there any commonalities between the cases; e.g., same subset of pools?, same subset of clients?, same client software?

The second problem (stuck in D state), I suspect the problem is triggered by the issue described in #6456. If so, the problem should go away once that problem is fixed. However, I also believe there is some other problem, as the error-recovery procedure shouldn't leave clients in such a bad state.

DmitryLitvintsev commented 2 years ago

Ideally. If the file associated with failed transfer were removed, the second issue would not have happened. Second issue is NFS specific and can be classified "why don't we fail fast if file replica is in Error state?". Yes, I agree this deserves a separate issue. This issue is then purely about handling failing uploads by WebDAV. Without having additional client side info I can tell that the fail was not limited to a particular pool/set of pools. The pools where the broken filed are located seem to be random to me.

paulmillar commented 2 years ago

OK, so there's a lot of things going wrong here.

  1. The upload failed.

This is the main failure here. Fixing this resolves all the problems. My questions about commonality in client and pool nodes, and how much of the file was uploaded were intended to help identity the cause so this could be fixed.

  1. The client is broken: it reacted badly to a failed upload

Given the upload failed (dCache didn't return a 201 status code) there was subsequent attempts to read the file. This is clearly behaviour that dCache doesn't anticipate. It looks like the client is broken; something whoever is responsible for the client should fix. However, it could also be an expectation mismatch -- are our users expecting different behaviour from dCache?

  1. dCache forcing the client to delete a bad upload.

The pool does not allow a client to read data from a broken replica under any circumstances. This makes sense when there are multiple replicas of a file (e.g., a copy on tape), however for a file that was just uploaded, is this the correct behaviour? If the file is freshly uploaded then there is only one replica, so there is no other replica for the client to use. If (for example) 99% of a file were uploaded over a slow link, the file's data might still contain useful information.

Keeping bad uploads is something (I believe) other storage systems do. A failed upload is still accepted as a valid upload. However, we should check this. Also, I believe dCache's clients do not expect dCache to check whether an upload was successful. They always request the uploaded file's checksum, to verify the file was uploaded successfully (again, something that would need checking).

Accepting bad uploads as valid files (i.e., not deleting them) would require changes in dCache, but it might be worth thinking about.

However, it's certainly true that WebDAV's current behaviour (keeping a broken upload file only to force the client to delete it) makes no sense. I think we should fix this and delete broken uploads, at least in the short-term. Medium-to-longer term, I think this may be worth revisiting this decision.

  1. The NFS mover code contains a bug, where attempting to read a corrupt replica are handled badly.

This is pretty obviously a bug and should be easy to fix.

  1. The recovery procedure from bugs is also buggy.

Bugs can happen in any code, including pool movers. The pool should be robust against such problems. However, in this case, the recovery procedure resulted in the pool attempting to close the ReadHandlerImpl twice, which is (itself) a bug. This, too, should be fixed.

So, there's (at least) five problems here.

DmitryLitvintsev commented 2 years ago

Statistically the error is rather noticeable. Here us daily counts:

          day           | count 
------------------------+-------
 2022-01-13 00:00:00-06 |     1
 2022-01-14 00:00:00-06 |   378
 2022-01-15 00:00:00-06 |  1538
 2022-01-16 00:00:00-06 |  1994
 2022-01-17 00:00:00-06 |  1264
 2022-01-18 00:00:00-06 |  1414
 2022-01-19 00:00:00-06 |   770
 2022-01-20 00:00:00-06 |   529
 2022-01-21 00:00:00-06 |   686
 2022-01-22 00:00:00-06 |   699
 2022-01-23 00:00:00-06 |  1031
 2022-01-24 00:00:00-06 |  2187
 2022-01-25 00:00:00-06 |  1788
 2022-01-26 00:00:00-06 |   907
 2022-01-27 00:00:00-06 |   155
 2022-01-28 00:00:00-06 |   562
 2022-01-29 00:00:00-06 |   249
 2022-01-30 00:00:00-06 |   312
 2022-01-31 00:00:00-06 |   857
 2022-02-01 00:00:00-06 |   900
 2022-02-02 00:00:00-06 |  1382
 2022-02-03 00:00:00-06 |  2032
 2022-02-04 00:00:00-06 |   495
 2022-02-05 00:00:00-06 |  1065
 2022-02-06 00:00:00-06 |   870
 2022-02-07 00:00:00-06 |   250
(26 rows)

The breakdown of errors by storage class that I posted previously was obtained for so called scratch pools (volatile). I am now checking if we have these broken files scattered across other pools (like NEARLINE/CUSTODIAL and ONLINE/REPLICA pools).

DmitryLitvintsev commented 2 years ago

I went ahead and looked at all these failures (using billing DB). Many of the files have Cached (non broken) replicas. I can cp these files with no issues. Now this makes me nervous. Some files are corrupted as the result of this error, some are not (supposedly). I will try to get users to read these files and see if they are readable by their frameworks. These are mostly root files.

paulmillar commented 2 years ago

I'm not sure I fully understand your last comment.

When you say "I can cp these files", is that reading the file via an NFS mounted dCache?

Have you found files exhibiting these symptoms where there are multiple replicas, with (at least) one replica that is broken and (at least) one replica that is not broken? The broken replica has the system-sticky and the non-broken replica is cache-only.

If so, it would be very interesting to learn the history of one such file. When were the replicas made (and why: client upload, resilience, ...)?

At the moment, I think there is too little information to offer any concrete suggestion.

Please try to collect more information and (ideally) try to reproduce the problem.

DmitryLitvintsev commented 2 years ago

I am just saying that some of the files that failed with this error end up with Broken replicas and some of these files end up with OK replicas. And mentioned that I can cp (yes over NFS) one of the unbroken ones. This opens up a pandora box. Files failed on upload but some of them would go to tape and I am not sure if they are readable. All these files have single replicas.

paulmillar commented 2 years ago

OK, just to check I understand correctly.

You see files for which the pool records an entry like: Transfer failed: Channel idle for too long during upload

However, the replica is not marked broken on the pool and you can successfully download the file.

Is that correct?

DmitryLitvintsev commented 2 years ago

Yes this is correct. Example:

[fndca3b] (PnfsManager@namespaceDomain) enstore > \sl 0000CBC5A7DCE51F43F5860E5B982A4D0B92 rep ls 0000CBC5A7DCE51F43F5860E5B982A4D0B92
p-sbn-stkendca2028-5:
    0000CBC5A7DCE51F43F5860E5B982A4D0B92 <C-------X--L(0)[0]> 29619695 si={sbn.persdata}

Billing:

billing=# select d.datestamp, b.protocol, b.datestamp, d.errorcode, d.errormessage, b.errorcode, b.errormessage from billinginfo b inner join  doorinfo d on b.initiator = d.transaction where b.pnfsid = '0000CBC5A7DCE51F43F5860E5B982A4D0B92' and d.pnfsid = '0000CBC5A7DCE51F43F5860E5B982A4D0B92' and b.isnew is true
;
         datestamp          | protocol |         datestamp         | errorcode |               errormessage               | errorcode |               errormessage  

----------------------------+----------+---------------------------+-----------+------------------------------------------+-----------+-----------------------------
-------------
 2022-02-06 03:49:22.775-06 | Http-1.1 | 2022-02-06 03:49:22.77-06 |     10004 | Channel idle for too long during upload. |     10004 | Channel idle for too long du
ring upload.
(1 row)

(above query prints both errorcode/errormessage for door and pool). This does not look too good.

log files:

pool:

grep '0000CBC5A7DCE51F43F5860E5B982A4D0B92' p-sbn-stkendca2028-5Domain.log-20220207
06 Feb 2022 03:49:22 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXVlNEkLg WebDAV-fndca4b-1 PoolAcceptFile 0000CBC5A7DCE51F43F5860E5B982A4D0B92] Transfer failed: Channel idle for too long during upload.

door access file:

level=INFO ts=2022-02-06T03:44:07.795-0600 event=org.dcache.webdav.request request.method=PUT request.url=https://fndca1.fnal.gov:2880/sbn/data/sbn_nd/poms_production/official/MCP2021Cv2_BNBCosmics/v09_37_01_03/prodoverlay_corsika_cosmics_proton_genie_nu_spill_gsimple-configh-v1_tpc/gen_g4_detsim_reco1/fc/25/gen_g4_detsim_reco1-c508015f-7767-46ad-aa3a-ccb6dccaf39f.root response.code=307 location=http://131.225.69.157:23548/pnfs/fnal.gov/usr/sbn/data/sbn_nd/poms_production/official/MCP2021Cv2_BNBCosmics/v09_37_01_03/prodoverlay_corsika_cosmics_proton_genie_nu_spill_gsimple-configh-v1_tpc/gen_g4_detsim_reco1/fc/25/gen_g4_detsim_reco1-c508015f-7767-46ad-aa3a-ccb6dccaf39f.root?dcache-http-uuid=a0e2600b-c8f6-4d74-9d0e-5efdf1d64d56 socket.remote=137.99.174.13:47468 user-agent=curl/7.64.1 user.dn="CN=685171643,CN=1505365588,CN=658300149,CN=1685675135,CN=sbndpro-sbndgpvm01.fnal.gov,OU=Fermilab,O=Fermi Research Alliance,L=Batavia,ST=Illinois,C=US,DC=incommon,DC=org" user.mapped=52084:9470,9767,8844,8623,9470 transaction=door:WebDAV-fndca4b-1@webdavDomain:AAXXVlNEkLg:1644140647757000 duration=80

door log file : nothing.

Interestingly pool is filled with the only errors like these:

06 Feb 2022 03:49:22 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXVlNEkLg WebDAV-fndca4b-1 PoolAcceptFile 0000CBC5A7DCE51F43F5860E5B982A4D0B92] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 04:16:11 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXVrFkYBA WebDAV-fndca4b-1 PoolAcceptFile 00003698E3F349D94829AF9ACD675A1ACCFA] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 04:34:42 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXVvShsXA WebDAV-fndca4b-1 PoolAcceptFile 000005A70250ABB34050B20B83B0F600E519] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 05:18:54 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXV5KAJwA WebDAV-fndca4b-1 PoolAcceptFile 0000BF539F92E6124FB1A795291FF72FFC6B] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 06:00:58 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXWCnXGlA WebDAV-fndca4b-1 PoolAcceptFile 00002545D3B51E6A4A88BACD0C78C862E221] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 06:20:02 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXWG3G/og WebDAV-fndca4b-1 PoolAcceptFile 00005487B3129B5F423998E3E6719B65EF43] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 07:25:21 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXWVfr8Vg WebDAV-fndca4b-1 PoolAcceptFile 0000D3ED6E6A55D845C9ABC437FF8DF04C4F] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 09:36:53 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXWy1UGDg WebDAV-fndca4b-1 PoolAcceptFile 00003A420B9B8A754A2994CC34C3D7C37AEA] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 09:38:46 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXWzAFEyA WebDAV-fndca4b-1 PoolAcceptFile 00003F2D833C76AD464089964D1BD182518C] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 09:52:28 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXW2Z3G9g WebDAV-fndca4b-1 PoolAcceptFile 00008E35C6860B824A5A8153FC8DF6E6259B] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 11:23:56 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXXKzcKVA WebDAV-fndca4b-1 PoolAcceptFile 0000840AF0D0E4C145E8B8CC4871051DEF3E] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 13:17:36 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXXkQB7aA WebDAV-fndca4b-1 PoolAcceptFile 0000DB0452DEE8E4406D806847B77294E91D] Transfer failed: Channel idle for too long during upload.
06 Feb 2022 15:21:12 (p-sbn-stkendca2028-5) [door:WebDAV-fndca4b-1@webdavDomain:AAXXX/1Dy+g WebDAV-fndca4b-1 PoolAcceptFile 0000BBB1790EA50544DEADE301F75FB5B13E] Transfer failed: Channel idle for too long during upload.

And all these files (with the exception of 3 that got deleted) are Cached, not broken:

 p-sbn-stkendca2028-5:
    0000CBC5A7DCE51F43F5860E5B982A4D0B92 <C-------X--L(0)[0]> 29619695 si={sbn.persdata}

p-sbn-stkendca2028-5:
    00003698E3F349D94829AF9ACD675A1ACCFA <C-------X--L(0)[0]> 31426959 si={sbn.persdata}

p-sbn-stkendca2028-5:
    000005A70250ABB34050B20B83B0F600E519 <C-------X--L(0)[0]> 2243056641 si={sbn.persdata}

(3) CacheException(rc=10001;msg=No such file or directory: 0000BF539F92E6124FB1A795291FF72FFC6B) from \sl
(3) CacheException(rc=10001;msg=No such file or directory: 00002545D3B51E6A4A88BACD0C78C862E221) from \sl
(3) CacheException(rc=10001;msg=No such file or directory: 00005487B3129B5F423998E3E6719B65EF43) from \sl
p-sbn-stkendca2028-5:
    0000D3ED6E6A55D845C9ABC437FF8DF04C4F <C-------X--L(0)[0]> 204690302 si={sbn.persdata}

p-sbn-stkendca2028-5:
    00003A420B9B8A754A2994CC34C3D7C37AEA <C-------X--L(0)[0]> 74271213 si={sbn.persdata}

p-sbn-stkendca2028-5:
    00003F2D833C76AD464089964D1BD182518C <C-------X--L(0)[0]> 60779678 si={sbn.persdata}

p-sbn-stkendca2028-5:
    00008E35C6860B824A5A8153FC8DF6E6259B <C-------X--L(0)[0]> 35158269 si={sbn.persdata}

p-sbn-stkendca2028-5:
    0000840AF0D0E4C145E8B8CC4871051DEF3E <C-------X--L(0)[0]> 725398353 si={sbn.persdata}

p-sbn-stkendca2028-5:
    0000DB0452DEE8E4406D806847B77294E91D <C-------X--L(0)[0]> 44530267 si={sbn.persdata}

p-sbn-stkendca2028-5:
    0000BBB1790EA50544DEADE301F75FB5B13E <C-------X--L(0)[0]> 1786032673 si={sbn.persdata}
DmitryLitvintsev commented 2 years ago

Just to re-iterate What happened - clients switched from GFTP to HTTPs as default. And this is the most noticeable aftermath. When using GFTP there has always been transfer failures but we never seen broken files left behind (well.... rarely). And we noticed broken files only because DH service hang when all its thread hang in D state trying to access broken files. Fact that we also have un-broken files corresponding to failed transfers has only just transpired as I discovered it digging into the issue.

I have requested DH experts to gather client side logs/info to see what client has experienced. We will see if client can handle the failure. Hope is that they got error code. So we can ask them to try to delete namespace entry on failure as a mitigation.

See this seems like how the process goes:

  1. jobs upload files to a directory
  2. some process "looks" at the files in that directory and does something

So (1) has produced broken files, and (2) failed to detect them because it just got stuck accessing them.

paulmillar commented 2 years ago

OK. I have a working theory for some of the symptoms. I'm not sure it explains everything, but perhaps it's something we can test.

OK, first the bad news.

The way the pool works is that a mover can abort a transfer (in this case, because a client was idle for too long). This results in the mover having an error code; however, this does NOT mean the replica will be marked as broken. The replica is marked as broken if (and only if) the post-transfer service detects the replica is corrupt. This happens if (and only if) the WriteHandleImpl#commit method throws an exception, which happens if the file size doesn't match, or the namespace rejects registering the new file attributes (e.g., checksum doesn't match). So, if the replica has the wrong size or the wrong checksum then the replica will be marked as broken and update the mover error code and message with the problem. Just to reiterate: a mover aborting a transfer will not (by itself) result in the replica being marked broken.

Whether or not post-processing identifies the replica as corrupt, billing is notified of the mover error code. So, it's perfectly possible for billing to record an error (from the mover aborting the transfer) but the pool believes it has a valid (non-corrupt) replica.

The Channel idle for too long during upload message indicates that a write file / handle was still open when the pool disconnected the client. Taking this at face value (assuming there are no subtle bugs), this might mean the pool believes more data was possible (the file is possibly incomplete). However, post-transfer didn't mark the replica as bad. This means the pool didn't know either the file's expected checksum or its expected size.

Normal HTTP uploads require the client to specify the Content-Length header. From this, the pool will know the expected file's size. Once all the data is accepted, the file handle is closed. If the data is incomplete then the pool will know this and mark the replica as broken.

However, HTTP also supports a mode called "chunked encoding". Using this mode, there is no Content-Length header, so the pool doesn't know the expected file size. When using chunked encoding, an incomplete upload will be stored in dCache as a valid file and can be downloaded without any problem (I've verified this).

The way this works: chunked encoding provides a special chunk that indicates the upload is complete. It could be the client simply forgot to send that chunk (dCache received a complete upload) or that the client got stuck (dCache received an incomplete transfer). There's no way to tell.

So, open questions and suggested next steps:

  1. my working hypothesis is that the client is using chunked encoded.

Curl must use chunked-encoding for an upload if the content doesn't come directly from a file; e.g., grep -v "^DEBUG" output.txt | curl. Chunked encoding can also be enabled manually when uploading a file (there's various ways of doing this).

Dmitry, could you ask someone familiar with the upload procedure to carefully verify whether the client uses chunked encoding for the upload?

  1. I don't have an explanation why some uploads have replicas marked broken.

Dmitry, could provide the error code and error message from billing for an upload where the pool marked the replica as broken? The pool's log should also record this information (a message starting Transfer failed in post-processing: ...).

  1. I do not understand why the connection is idle for ~five minutes.

My experiments with chunked-encoding uploads haven't been able to reproduce this behaviour. My suspicion is that this is a client-side artefact. Dmitry, could you someone familiar with the upload procedure if they can think of a reason why the client might freeze during the upload?

DmitryLitvintsev commented 2 years ago

I am pretty sure that this is not chunked transfers They just upload files from worker nodes to dCache after job has finished. I do know about Content-Lengh header. I am pretty sure they just run something like:

curl ...   -Tfoo https://fndcaitb4.fnal.gov:2880/bar

Unless curl does it automatically. Tomorrow I will put here more info as a once I get it.

Log file entry is just single line:

[root@stkendca1806 dcache-log]# zgrep 'AAXXEs+sAog' v-stkendca1806-1Domain.log-20220203.gz
02 Feb 2022 19:26:58 (v-stkendca1806-1) [door:WebDAV-fndca4b-1@webdavDomain:AAXXEs+sAog WebDAV-fndca4b-1 PoolAcceptFile 00008FBFF2A3F3B542BCA50AD15FEEF94E90] Transfer failed: Channel idle for too long during upload.

No "Transfer failed in post-processing" record.

The bit that logs with message does throw FileCorruptedCacheException

   public void userEventTriggered(ChannelHandlerContext ctx, Object event) throws Exception {
        if (event instanceof IdleStateEvent) {
            IdleStateEvent idleStateEvent = (IdleStateEvent) event;
            if (idleStateEvent.state() == IdleState.ALL_IDLE) {
                if (LOGGER.isInfoEnabled()) {
                    LOGGER.info("Connection from {} id idle; disconnecting.",
                          ctx.channel().remoteAddress());
                }
                releaseAllFiles(uploadsSeeError(
                      new FileCorruptedCacheException("Channel idle for too long during upload.")));
                ctx.close();
            }
        }
    }

That pnfsid is gone (deleted). Another one:

[fndca3b] (local) enstore > \sl 00003C8034ACFC73477AA5FFD4FE34479105 rep ls 00003C8034ACFC73477AA5FFD4FE34479105 
v-stkendca1806-5:
    00003C8034ACFC73477AA5FFD4FE34479105 <---------E-L(0)[0]> 52041700 si={egp.scratch}
chimera=# select * from t_inodes_checksum where inumber = (select inumber from t_inodes where ipnfsid = '00003C8034ACFC73477AA5FFD4FE34479105');
  inumber   | itype |   isum   
------------+-------+----------
 6891721447 |     1 | 2078ccd6
(1 row)

CRC of file on disk matches - 0x2078ccd6.

Just feeling like being adventurous:

[fndca3b] (v-stkendca1806-5@v-stkendca1806-5Domain) enstore > rep set cached 00003C8034ACFC73477AA5FFD4FE34479105 
[fndca3b] (v-stkendca1806-5@v-stkendca1806-5Domain) enstore > rep ls 00003C8034ACFC73477AA5FFD4FE34479105 
00003C8034ACFC73477AA5FFD4FE34479105 <C----------L(0)[0]> 152346336 si={egp.scratch}

Then:

[litvinse@fnisd1 ~]$ gfal-copy https://fndca1.fnal.gov:2880//pnfs/fs/usr/egp/scratch/output/delete/_Stage1Spheres327/CosmicsStage1_EGP_207.root . 
Copying https://fndca1.fnal.gov:2880//pnfs/fs/usr/egp/scratch/output/delete/_StaCopying https://fndca1.fnal.gov:2880//pnfs/fs/usr/egp/scratch/output/delete/_StaCopying https://fndca1.fnal.gov:2880//pnfs/fs/usr/egp/scratch/output/delete/_Stage1Spheres327/CosmicsStage1_EGP_207.root   [FAILED]  after 0s
gfal-copy error: 5 (Input/output error) - Result HTTP 500 : Unexpected server error: 500 , while  readding after 1 attempts
[litvinse@fnisd1 ~]$ 

(if only NFS behaved the same !) 

and file is back toE state:

[fndca3b] (v-stkendca1806-5@v-stkendca1806-5Domain) enstore > rep ls 00003C8034ACFC73477AA5FFD4FE34479105 
00003C8034ACFC73477AA5FFD4FE34479105 <---------E-L(0)[0]> 52041700 si={egp.scratch}
paulmillar commented 2 years ago

Thanks for the update, Dmitry.

The file 00003C8034ACFC73477AA5FFD4FE34479105 looks interesting.

Some questions:

  1. Did the pool log any Transfer failed in post-processing message when the client uploaded this file?

  2. Does the 500 status code (that gfal-copy reported) come from the door or the pool? Is anything logged by that service, as a result of this interaction?

  3. Seemingly, the pool updated the replica, putting it back into error/broken state. Did it log why it did this?

DmitryLitvintsev commented 2 years ago

Statistically speaking this is breakdown of all errors for Http protocol for today so far:

count  |                                      errormessage                                       
-------+-----------------------------------------------------------------------------------------
 31022 | No connection from client after 300 seconds. Giving up.
  1354 | java.io.IOException: Connection reset by peer
   832 | Channel idle for too long during upload.
    48 | Connection lost before end of file.
    19 | java.io.IOException: Broken pipe
     1 | Post-processing failed: No such file or directory: 00001F23738D41314265BB77AEDFA60EA6E1
     1 | Post-processing failed: No such file or directory: 00005EC445E9499B43D2838DF759106C27D4
     1 | Post-processing failed: No such file or directory: 000065C44DB2C5154486B003625BBF9A559C
     1 | Post-processing failed: No such file or directory: 0000743FFBBFF3E141CE8712912912E10747
     1 | Post-processing failed: No such file or directory: 0000852C9F525166400CA5F4D5F7C55A6D30
     1 | Post-processing failed: No such file or directory: 0000889D2A3E1882414A876849880A21BEED
     1 | Post-processing failed: No such file or directory: 00009103A75AD8774F9ABA8357638BB1301E
     1 | Post-processing failed: No such file or directory: 00009D3FAF2D131C44E0A2D774E1245A03E6
     1 | Post-processing failed: No such file or directory: 0000A1DCAF7AF7B346A09573728088B463BA
     1 | Post-processing failed: No such file or directory: 0000C8F07D9BE6B8479F8B4861E9459D2019
     1 | Post-processing failed: No such file or directory: 0000DA666686DEDD4D09A0377C844FB8EC45
     1 | Post-processing failed: No such file or directory: 0000E3E352767A7444E1BB048A313155B6B3
     1 | Post-processing failed: No such file or directory: 0000F8780EBDDA11415CB39916AD48E1B31B
     1 | Transfer forcefully killed: Active transfer cancelled: door timed out before pool
     1 | Post-processing failed: No such file or directory: 00000CD9357259654B9A897541A9BEB238F6
     1 | Post-processing failed: No such file or directory: 0000128C0791D5C74196BD3FA7197F1A665C
     1 | Post-processing failed: No such file or directory: 00001321FE534873458BA638A870C63EBB63
     1 | Post-processing failed: No such file or directory: 00001687FDF4C93A48D68985E7D5187DDD64

breakdown of errors for a VO that is, for some reason, particularly affected:

  protocol  | total  | fail | success | isnew 
------------+--------+------+---------+-------
 Http-1.1   | 923807 | 4413 |  919394 | t
 Xrootd-5.0 | 783082 |    3 |  783079 | f
 GFtp-2.0   | 240945 |   36 |  240909 | f
 GFtp-2.0   | 114473 |  103 |  114370 | t
 NFS4-4.1   |  57640 |    0 |   57640 | f
 Http-1.1   |  51491 |    3 |   51488 | f
 NFS4-4.1   |    218 |    0 |     218 | t
(7 rows)
DmitryLitvintsev commented 2 years ago

www_cp.sh.txt OK, got hold of client side log, looks like so:

ifdh cp -D gm2DQC_ana_29037383_44371.00255.root /pnfs/GM2/scratch/users/gm2pro/prod/DQCAna/5304A/Run4QuadRF-v9_74/2022-02-03-02-05-30/data
Thu Feb  3 08:18:38 2022
 program: www_cp.sh  /srv/gm2DQC_ana_29037383_44371.00255.root https://fndca1.fnal.gov:2880/GM2/scratch/users/gm2pro/prod/DQCAna/5304A/Run4QuadRF-v9_74/2022-02-03-02-05-30/data/gm2DQC_ana_29037383_44371.00255.rootexited status 55
delaying 0 ...
Thu Feb  3 08:18:38 2022
 retrying...
Thu Feb  3 08:18:39 2022
 program: www_cp.sh  /srv/gm2DQC_ana_29037383_44371.00255.root https://fndca1.fnal.gov:2880/GM2/scratch/users/gm2pro/prod/DQCAna/5304A/Run4QuadRF-v9_74/2022-02-03-02-05-30/data/gm2DQC_ana_29037383_44371.00255.rootexited status 22
delaying 34 ...
Thu Feb  3 08:19:13 2022
 retrying...

!!!Failed to copy gm2DQC_ana_29037383_44371.00255.root to /pnfs/GM2/scratch/users/gm2pro/prod/DQCAna/5304A/Run4QuadRF-v9_74/2022-02-03-02-05-30/data !!!

Thu Feb 03 08:19:14 UTC 2022   

the www_cp.sh is a wrapper over curl and is attached

by all means it calls :

curl -f -L --silent --capath .. --key .. --cacert .. --cert .. -T${src} https://fndca1.fnal.gov:2880${dst}

gets code 55, then retries and gets code 22 and quits...

AND it looks like if fails second put because the destination already exist. Indeed:

[litvinse@fnisd1 ~]$ curl -f  -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u8637 --cacert /tmp/x509up_u8637 --key  /tmp/x509up_u8637 -Tna49.root  https://fndca4a.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/t2.root
115337 bytes uploaded

[litvinse@fnisd1 ~]$ curl -f  -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u8637 --cacert /tmp/x509up_u8637 --key  /tmp/x509up_u8637 -Tna49.root  https://fndca4a.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/t2.root
curl: (22) The requested URL returned error: 409 Conflict

isn't PUT supposed to replace existing resource?!

Ah...

dcache.enable.overwrite=false

strikes again.

paulmillar commented 2 years ago

Thanks @DmitryLitvintsev , that's very interesting information.

A couple of observations:

  1. www_cp.sh accepts arbitrary curl arguments as command-line options.

These are (seemingly) not logged, so we don't know if this feature is being used.

One could run the script with an option like '-H "Transfer-Encoding: chunked"'. This would result in curl using chunked encoding, even though the file is uploaded using the -T option. So, I'm not sure we can completely eliminate the possibility that curl is using chunked encoding when uploading the file.

  1. The script uses chunked encoding for server-to-server copies.

Server-to-server copies are supported by the script using one curl process and piping the output to another curl process. This forces the second curl process (the upload) to use chunked encoding.

That said, you're absolutely right that for direct uploads (transferring a file from the local filesystem to remote storage) the script doesn't use chunked upload by default. The specific example here is an upload.

The first invocation of www_cp.sh fails with error code 55. This page says:

Failed sending network data. Sending data over the network is a crucial part of most curl operations and when curl gets an error from the lowest networking layers that the sending failed, this exit status gets returned. To pinpoint why this happens, some serious digging is usually required. Start with enabling verbose mode, do tracing and if possible check the network traffic with a tool like Wireshark or similar.

This suggests the problem is either at the networking layer, or something masquerading as such.

Questions about this update:

  1. Are we sure the client was redirected for this transfer (see access log file)?
  2. Was/is the replica marked broken in the repository (pool logs Transfer failed in post-processing:)?
  3. Does billing contain an error (from the pool) about this upload?
  4. Was anything recorded in the pool's log file about this transfer?

The second error is 22, which seems to be a generic return code for when the request HTTP solicits an error status code (>= 400). The above mentioned page says:

HTTP page not retrieved. The requested url was not found or returned another error with the HTTP error code being 400 or above. This return code only appears if -f, --fail is used

As you mentioned, dcache.enable.overwrite=false (if set) is the most likely curprit. The WebDAV door's access log file should confirm that the error is coming from the door (and not the pool).

DmitryLitvintsev commented 2 years ago

Yes it is logged. The script is called with these arguments:

program: www_cp.sh  /srv/gm2DQC_ana_29037383_44371.00255.root https://fndca1.fnal.gov:2880/GM2/scratch/users/gm2pro/prod/DQCAna/5304A/Run4QuadRF-v9_74/2022-02-03-02-05-30/data/gm2DQC_ana_29037383_44371.00255.root

local source, remote destination. So only -f -L --silent --capath .. --key .. --cacert .. --cert .. to curl options are used. setting overwrite to true may help some. Still the broken files should not be left behind, don't you think?

paulmillar commented 2 years ago

OK, thanks for the info.

Yes, broken files should be removed: there's no reason to keep them.

However, I'm still curious what is causing these transfers to hang: we still need to fix that problem.

All things being equal, the pool will only accept the file if the client sent the expected number of bytes.

So, my best guess is that the client somehow sends all the file's data, but somehow there is no LastHttpContent event. It's this event that triggers the clean-up process and (eventually) responding to the client with the result of the PUT request (e.g., "201 Created" status line + the response headers).

I might be able to put some diagnostics into the pool, to capture more information, but I'm really just shooting in the dark.

It would be very helpful if you could reproduce the problem, or perhaps identify whether this is a one-off problem or something systemic with uploading that particular file.

One idea would be to update the door's configuration, allowing a client to overwrite the file. If the initial upload fails and the subsequent upload succeeds then we know the problem is somehow stochastic in nature.

DmitryLitvintsev commented 2 years ago

An update. On trying to reproduce the problem.

when using, curl exact command

[litvinse@fnisd1 scitokens]$ curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u8637 --cacert /tmp/x509up_u8637 --key  /tmp/x509up_u8637  -Tjunk https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.3

I wait for a mover to show up, and then suspend the transfer:

^Z
[1]+  Stopped                 curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u8637 --cacert /tmp/x509up_u8637 --key /tmp/x509up_u8637 -Tjunk https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.3

I wait for pool to kill the mover, then resume the transfer:

[litvinse@fnisd1 scitokens]$ fg
curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u8637 --cacert /tmp/x509up_u8637 --key /tmp/x509up_u8637 -Tjunk https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.3
curl: (55) Send failure: Broken pipe
[litvinse@fnisd1 scitokens]$ 

After that I see file remaining in the namespace:

[root@fndcaitb2 scratch]# ls -al  /pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.3
-rw-rw-rw- 1 litvinse 9767 642252800 Feb 25 11:51 /pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.3
[root@fndcaitb2 scratch]# 

and the replica is broken on pool :+1:

[fndcaitb3] (v-stkendca28a-5@v-stkendca28a-5Domain) enstore > rep ls 00006FC29A00A5E04691A349B257EC66B057 
00006FC29A00A5E04691A349B257EC66B057 <---------E-L(0)[0]> 642252800 si={test2.scratch}

And I see this in pinboard:

25 Feb 2022 11:46:29 [v-stkendca28a-5-26] [door:WebDAV-fndcaitb4-1@webdavDomain:AAXY20dKnVg WebDAV-fndcaitb4-1 PoolAcceptFile 00006FC29A00A5E04691A349B257EC66B057] Creating new entry for 00006FC29A00A5E04691A349B257EC66B057
25 Feb 2022 11:51:48 [http-net-0] [] Connection from /2620:6a:0:87:219:b9ff:fe03:7377:47750 id idle; disconnecting.
25 Feb 2022 11:51:48 [http-net-0] [door:WebDAV-fndcaitb4-1@webdavDomain:AAXY20dKnVg WebDAV-fndcaitb4-1 PoolAcceptFile 00006FC29A00A5E04691A349B257EC66B057] Transfer failed: Channel idle for too long during upload.
25 Feb 2022 11:51:48 [post-transfer-2] [door:WebDAV-fndcaitb4-1@webdavDomain:AAXY20dKnVg WebDAV-fndcaitb4-1 PoolAcceptFile 00006FC29A00A5E04691A349B257EC66B057] Marking pool entry 00006FC29A00A5E04691A349B257EC66B057 on v-stkendca28a-5 as BROKEN
25 Feb 2022 11:51:48 [post-transfer-2] [door:WebDAV-fndcaitb4-1@webdavDomain:AAXY20dKnVg WebDAV-fndcaitb4-1 PoolAcceptFile 00006FC29A00A5E04691A349B257EC66B057] Transfer failed in post-processing: File size mismatch (expected=837787900, actual=642252800)

GOOD NEWS, trying the same using gfal-cp DOES NOT leave broken file behind. Exact commad:

gfal-copy junk  https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4

again wait till mover runs, suspend. Wait for mover kill, resume.:

[root@fndcaitb2 scratch]# ls -al /pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4
ls: cannot access /pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4: No such file or directory

in the pinboard I see:

25 Feb 2022 11:40:00 [http-net-4] [door:WebDAV-fndcaitb4-1@webdavDomain:AAXY2x3fW8A WebDAV-fndcaitb4-1 PoolAcceptFile 000018DB9B63720F4364A61AD653810ACE63] Transfer failed: Channel idle for too long during upload.
25 Feb 2022 11:40:00 [post-transfer-23] [door:WebDAV-fndcaitb4-1@webdavDomain:AAXY2x3fW8A WebDAV-fndcaitb4-1 PoolAcceptFile 000018DB9B63720F4364A61AD653810ACE63] Marking pool entry 000018DB9B63720F4364A61AD653810ACE63 on v-stkendca19a-3 as BROKEN
25 Feb 2022 11:40:00 [post-transfer-23] [door:WebDAV-fndcaitb4-1@webdavDomain:AAXY2x3fW8A WebDAV-fndcaitb4-1 PoolAcceptFile 000018DB9B63720F4364A61AD653810ACE63] Transfer failed in post-processing: File size mismatch (expected=837787900, actual=218103808)
25 Feb 2022 11:42:39 [pool-7-thread-3] [cleaner PoolRemoveFiles] remove entry 000018DB9B63720F4364A61AD653810ACE63: cleaner@cleanerDomain [PoolRemoveFiles]

So, looks like gfal client explicitly removes file on fail and curl obviously not. Indeed in the access file I see DELETE request. Full sequence:

level=WARN ts=2022-02-25T11:34:53.389-0600 event=org.dcache.webdav.request request.method=HEAD request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=404 response.reason="Not Found" socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:35988 user-agent="gfal2-util/1.6.0 gfal2/2.18.1 neon/0.0.29" user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 duration=44 

level=WARN ts=2022-02-25T11:34:53.482-0600 event=org.dcache.webdav.request request.method=HEAD request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=404 response.reason="Not Found" socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:35988 user-agent="gfal2-util/1.6.0 gfal2/2.18.1 neon/0.0.29" user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 duration=19

level=WARN ts=2022-02-25T11:34:53.594-0600 event=org.dcache.webdav.request request.method=HEAD request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=404 response.reason="Not Found" socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:35988 user-agent="gfal2-util/1.6.0 gfal2/2.18.1 neon/0.0.29" user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 duration=17

level=WARN ts=2022-02-25T11:34:53.748-0600 event=org.dcache.webdav.request request.method=HEAD request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=404 response.reason="Not Found" socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:35988 user-agent="gfal2-util/1.6.0 gfal2/2.18.1 neon/0.0.29" user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 duration=16

level=WARN ts=2022-02-25T11:34:53.846-0600 event=org.dcache.webdav.request request.method=HEAD request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=404 response.reason="Not Found" socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:35988 user-agent="gfal2-util/1.6.0 gfal2/2.18.1 neon/0.0.29" user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 duration=16

level=WARN ts=2022-02-25T11:34:54.022-0600 event=org.dcache.webdav.request request.method=HEAD request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=404 response.reason="Not Found" socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:35988 user-agent="gfal2-util/1.6.0 gfal2/2.18.1 neon/0.0.29" user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 duration=17

level=INFO ts=2022-02-25T11:34:54.426-0600 event=org.dcache.webdav.request request.method=PUT request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=307 location=http://[2620:6a:0:4812:f0:0:69:31]:22390/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4?dcache-http-uuid=9f18ce8c-3e71-4b96-b676-009c1e96286e socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:35988 user-agent="gfal2-util/1.6.0 gfal2/2.18.1 neon/0.0.29" user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 transaction=door:WebDAV-fndcaitb4-1@webdavDomain:AAXY2x3fW8A:1645810494234000 duration=258

level=INFO ts=2022-02-25T11:41:56.424-0600 event=org.dcache.webdav.request request.method=DELETE request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=204 socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:36842 user-agent="gfal2-util/1.6.0 gfal2/2.18.1 neon/0.0.29" user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 duration=70
level=WARN ts=2022-02-25T11:41:56.589-0600 event=org.dcache.webdav.request request.method=DELETE request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=404 response.reason="Not Found" socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:36842 user-agent="gfal2-util/1.6.0 gfal2/2.18.1 neon/0.0.29" user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 duration=16

level=INFO ts=2022-02-25T11:46:29.269-0600 event=org.dcache.webdav.request request.method=PUT request.url=https://fndcaitb4.fnal.gov:2880/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4 response.code=307 location=http://[2620:6a:0:4812:f0:0:69:35]:21324/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.4?dcache-http-uuid=48c7738e-0403-4c17-92b7-90a986909cfd socket.remote=[2620:6a:0:87:219:b9ff:fe03:7377]:37388 user-agent=curl/7.29.0 user.dn="CN=1231504422,CN=UID:litvinse,CN=Dmitry Litvintsev,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org" user.mapped=8637:9767,3200,9985,5063,9767,9320,8623,9617,9874,9010,8660,9620,9303,5111 transaction=door:WebDAV-fndcaitb4-1@webdavDomain:AAXY20dKnVg:1645811189120000 duration=237

NB: this failure is not exactly the kind seen - in many broken replica cases the file size seems correct. And I did not see "Transfer failed in post-processing" together with "Channel idle for too long during upload." in the pool log files.

DmitryLitvintsev commented 2 years ago

To me the question is - should dCache remove broken file resulted in failed upload. Is it a good idea and would it play nice with gfal-copy which would also try to delete failed destination.

paulmillar commented 2 years ago

The more I think about this, the more I'm coming to the opinion that dCache shouldn't delete bad uploads.

Instead, the client should delete "bad" uploads, if that's the client's preferred recovery strategy. (The client might have other recovery strategies).

Going further, I think the pool and doors should be updated so that, if dCache knows the upload is broken (upload too short, expected checksum doesn't match replica's checksum) then we return an error to the client; however, the replica is NOT marked broken and the door doesn't delete the file. In other words, we accept whatever data we're given and it's the client's responsibility to decide what to do if it's broken.

To be clear, there are several ways a file's replica can end up on a pool: client upload, pool2pool transfer, restore from tape. It's only for an initial upload (NFS can support multiple uploads for the same file) that dCache accepts whatever the client uploads. For the other operations (tape stage, pool2pool) the pool marks the replica as broken.

So, in summary, I believe the correct solution to this problem is:

  1. fix whatever underlying problem is causing the uploads to fail,
  2. update the client to delete the upload on bad uploads (if that's their chosen recovery strategy),
DmitryLitvintsev commented 2 years ago

Here is how globus-url-copy behaves:

  1. ctr_c in the middle of transfer:
[litvinse@fnisd1 scitokens]$ globus-url-copy -vb file:////`pwd`/junk gsiftp://fndcaitb4.fnal.gov//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp
Source: file://///home/litvinse/scitokens/
Dest:   gsiftp://fndcaitb4.fnal.gov//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/
  junk  ->  junk.gsiftp

^C  163577856 bytes        26.24 MB/sec avg        55.45 MB/sec inst
Cancelling copy...

no namespace entry remain

[root@fndcaitb2 ~]# //pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp
-bash: //pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp: No such file or directory
[root@fndcaitb2 ~]# 

No explicit removal of target, just error on PUT:

level=INFO ts=2022-02-28T08:52:26.994-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTJgUXg@gridftp0-fndcaitb4Domain command="ENC{SITE CLIENTINFO scheme=gsiftp;appname=\"globus-url-copy\";appver=\"10.10 (gcc64, 1629915172-0) [Grid Community Toolkit 6.2]\";}" reply="ENC{250 OK}"
level=INFO ts=2022-02-28T08:52:26.996-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTJgUXg@gridftp0-fndcaitb4Domain command="ENC{TYPE I}" reply="ENC{200 Type set to I}"
level=INFO ts=2022-02-28T08:52:26.997-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTJgUXg@gridftp0-fndcaitb4Domain command="ENC{ALLO 837787900}" reply="ENC{200 ALLO command successful}"
level=INFO ts=2022-02-28T08:52:29.147-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTJgUXg@gridftp0-fndcaitb4Domain command="ENC{PUT path=//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp;pasv;}" reply="ENC{127 PORT (131,225,191,91,87,230)}"
level=INFO ts=2022-02-28T08:52:29.156-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTJgUXg@gridftp0-fndcaitb4Domain command="ENC{PUT path=//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp;pasv;}" reply="ENC{150 Opening BINARY data connection for /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp}"
level=INFO ts=2022-02-28T08:52:32.811-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTJgUXg@gridftp0-fndcaitb4Domain command="ENC{PUT path=//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp;pasv;}" reply="ENC{451 Aborting transfer due to session termination}"
level=INFO ts=2022-02-28T08:52:32.813-0600 event=org.dcache.ftp.disconnect host.remote=131.225.80.210:59792 session=door:GFTP0-fndcaitb4-AAXZFTJgUXg@gridftp0-fndcaitb4Domain
  1. Interrupting and waiting for JTM to kill mover:
    
    [litvinse@fnisd1 scitokens]$ globus-url-copy -vb file:////`pwd`/junk gsiftp://fndcaitb4.fnal.gov//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp.1
    Source: file://///home/litvinse/scitokens/
    Dest:   gsiftp://fndcaitb4.fnal.gov//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/
    junk  ->  junk.gsiftp.1

^Z 192937984 bytes 51.83 MB/sec avg 42.00 MB/sec inst [1]+ Stopped globus-url-copy -vb file:////pwd/junk gsiftp://fndcaitb4.fnal.gov//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp.1 [litvinse@fnisd1 scitokens]$ fg globus-url-copy -vb file:////pwd/junk gsiftp://fndcaitb4.fnal.gov//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp.1 225443840 bytes 0.02 MB/sec avg 0.00 MB/sec inst error: globus_ftp_client: the server responded with an error


No namespace entry left behind:

root@fndcaitb2 ~]# ls -al /pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp.1 ls: cannot access /pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp.1: No such file or directory [root@fndcaitb2 ~]#


These is no explicit DELETE executed by client :

level=INFO ts=2022-02-28T08:54:37.392-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTotwTg@gridftp0-fndcaitb4Domain command="ENC{SITE CLIENTINFO scheme=gsiftp;appname=\"globus-url-copy\";appver=\"10.10 (gcc64, 1629915172-0) [Grid Community Toolkit 6.2]\";}" reply="ENC{250 OK}" level=INFO ts=2022-02-28T08:54:37.393-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTotwTg@gridftp0-fndcaitb4Domain command="ENC{TYPE I}" reply="ENC{200 Type set to I}" level=INFO ts=2022-02-28T08:54:37.395-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTotwTg@gridftp0-fndcaitb4Domain command="ENC{ALLO 837787900}" reply="ENC{200 ALLO command successful}" level=INFO ts=2022-02-28T08:54:37.717-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTotwTg@gridftp0-fndcaitb4Domain command="ENC{PUT path=//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp.1;pasv;}" reply="ENC{127 PORT (131,225,69,31,92,204)}" level=INFO ts=2022-02-28T08:54:37.721-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTotwTg@gridftp0-fndcaitb4Domain command="ENC{PUT path=//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp.1;pasv;}" reply="ENC{150 Opening BINARY data connection for /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp.1}" level=INFO ts=2022-02-28T09:55:24.770-0600 event=org.dcache.ftp.response session=door:GFTP0-fndcaitb4-AAXZFTotwTg@gridftp0-fndcaitb4Domain command="ENC{PUT path=//pnfs/fnal.gov/usr/test/litvinse/world_readable/scratch/junk.gsiftp.1;pasv;}" reply="ENC{451 Transfer forcefully killed: Active transfer cancelled: killed by JTM}" level=INFO ts=2022-02-28T12:24:48.952-0600 event=org.dcache.ftp.disconnect host.remote=131.225.80.210:60060 session=door:GFTP0-fndcaitb4-AAXZFTotwTg@gridftp0-fndcaitb4Domain



So namespace removal is handled by the door I assume. I think this is more user friendly (and better for server side - no junk accumulates). Your proposed solution not even mark replica as broken does not strike me as ideal. Users may never discover than they write junk until too late.