Closed todor-ivanov closed 1 year ago
In the course of extra tests I've done while working on this bug fix, I explicitly checked if gfal
actually interprets the two different protocol errors (from WebDAV
and SRMv2
protocols) for missing files equally, and if it returns the same error code for both of them [1]. This can clearly be seen in [1] - the error code is always 2
, while only the error message changes and gfal
is not simply propagating an HTTP 404
error code in the exception. This is very good for us because we would not have to add extra error handling originating from the underlying protocol difference. But of course this was the original justification for us to choose to work with gfal
at the first place, instead of a plethora of different clients for every possible protocol we can find at a site.
Just to mention yet another thing I have noticed. The two protocols somehow differ in what they return in terms of Access Control attributes to the directory entry: [2]. But this is just a side note.
If we need more information on WebDAV
internals and how it handles access control etc. Here is the full documentation:
@amaltaro @vkuznet @khurtado
[1]
In [58]: try:
...: ctx.listdir('davs://t2-xrdcms.lnl.infn.it:2880/pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/2800000')
...: except Exception as ex:
...: excWebDav = ex
...:
In [59]: try:
...: ctx.listdir('srm://t2-srm-02.lnl.infn.it:8443/srm/managerv2?SFN=/pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/2800000')
...: except Exception as ex:
...: excSRMv2 = ex
...:
In [60]: excWebDav
Out[60]: gfal2.GError('HTTP 404 : File not found ', 2)
In [61]: excSRMv2
Out[61]:
gfal2.GError('Error reported from srm_ifce : 2 [SE][Ls][SRM_INVALID_PATH] No such file or directory /pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/2800000',
2)
In [63]: excSRMv2.code
Out[63]: 2
In [64]: excWebDav.code
Out[64]: 2
[2]
In [29]: ctx.stat('srm://t2-srm-02.lnl.infn.it:8443/srm/managerv2?SFN=/pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/280000')
Out[29]:
uid: 52
gid: 53
mode: 40755
size: 512
nlink: 1
ino: 0
ctime: 1623223126
atime: 0
mtime: 1623267391
In [35]: ctx.stat('davs://t2-xrdcms.lnl.infn.it:2880/pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/280000/')
Out[35]:
uid: 0
gid: 0
mode: 40777
size: 0
nlink: 0
ino: 0
ctime: 1623223126
atime: 0
mtime: 1623267391
Impact of the bug MSUnmerged
Describe the bug I am not quite sure yet if this is a real bug.
While redeploying the latest version of MSUnmergred I noticed one exception: [1], which was not supposed to be raised. This is an HTTP error raised by
gfal2
, related to a missing file/directory. If my interpretation of this error is not wrong, it is coming from this line here:https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L456
And is raised while the service is trying to list a directory which is missing. The problem comes from the fact that just a line above this one, we have already did
stat
on this very same entry in order to figure out if it was a file or directory:https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L439
This operation has already been enclosed in a
try - except
block. Andgfal2
did not raise any exception (being it gfal or generic one).The result is that this
Error
is handled in the pipeline and is treated as a General Error from the pipeline at this line: https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L295Which causes the cycle to drop the current RSE and jump to the next one.
How to reproduce it No idea yet
Expected behavior A missing file or a directory entry to raise exception equally for all remote operations ( e.g.
stat
&&listdir
)Additional context and error message [1]