Closed todor-ivanov closed 2 years ago
FYI @amaltaro
Thanks for creating this issue, Todor. Do I read it right that there are only 2 problems reported:
'No such file or directory'
; or'Communication error on send'
?If so, I guess what deserves attention is only these reporting communication error, right? It could be that this is somewhat a generic error from the gfal plugin though.
hi @amaltaro , yes I think you are correct. I'll double check today.
The more I look into those logs the more confident I get that we should exclude the /store/unmerged/SAM
directory from the tree to be cleaned. The contents there are too volatile and the files are constantly written and deleted by the SAM
tests, which in general seems to be doing their job of cleaning behind themselves quite well. Also the contents are usually small enough. While on the other hand if we preserve this part of the tree we are going to be constantly blocked to delete one or two non-empty directories per site and hence not announcing some of them clean for a quite long.
We are going to have a deployment in production today due to another minor configuration error I've found, so I am about to make the relevant configuration PRs for excluding this part of the tree too.
FYI @amaltaro
And here are the relevant configuration changes:
@ivmfnal Igor, don't we have this path skipped from the Rucio ConMon system?
@todor-ivanov the deployment is just the deployment of the new configuration, right?
Hi @amaltaro,
the deployment is just the deployment of the new configuration, right?
Almost... here is another minor configuration error (a typo) we have found with @muhammadimranfarooqi earlier today and which actually trigggered the need of this deployment, but this one should not have caused any troubles in the service run.
https://github.com/dmwm/CMSKubernetes/commit/0b09abbd4bb0ee3cc0811d1c921e7511edd1f3b8
And just for logging purposes here follows a short list of all the RSEs that are missing from MongoDB due to various of reasons, pasted in parallel bellow:
T1_FR_CCIN2P3_Disk In non-final state in Rucio Consistency Monitor
T1_IT_CNAF_Disk In non-final state in Rucio Consistency Monitor
T2_BE_UCL In non-final state in Rucio Consistency Monitor
T2_ES_IFCA In non-final state in Rucio Consistency Monitor
T2_FR_CCIN2P3 Missing in stats records at Rucio Consistency Monitor
T2_GR_Ioannina In non-final state in Rucio Consistency Monitor
T2_PK_NCP In non-final state in Rucio Consistency Monitor
T2_RU_IHEP In non-final state in Rucio Consistency Monitor
T2_UK_London_Brunel In non-final state in Rucio Consistency Monitor
T3_BG_UNI_SOFIA Missing in stats records at Rucio Consistency Monitor
T3_CH_CERNBOX Missing in stats records at Rucio Consistency Monitor
T3_CH_CERN_OpenData Missing in stats records at Rucio Consistency Monitor
T3_CH_PSI Missing in stats records at Rucio Consistency Monitor
T3_FR_IPNL Missing in stats records at Rucio Consistency Monitor
T3_HR_IRB Missing in stats records at Rucio Consistency Monitor
T3_IR_IPM Missing in stats records at Rucio Consistency Monitor
T3_IT_MIB Missing in stats records at Rucio Consistency Monitor
T3_IT_Trieste Missing in stats records at Rucio Consistency Monitor
T3_KR_KISTI Missing in stats records at Rucio Consistency Monitor
T3_KR_KNU Missing in stats records at Rucio Consistency Monitor
T3_KR_UOS Missing in stats records at Rucio Consistency Monitor
T3_MX_Cinvestav Missing in stats records at Rucio Consistency Monitor
T3_TW_NCU Missing in stats records at Rucio Consistency Monitor
T3_TW_NTU_HEP Missing in stats records at Rucio Consistency Monitor
* T2T3_US:
T2_US_Caltech_Ceph is skipped due to a restriction set in msConfig T2_US_Florida In non-final state in Rucio Consistency Monitor T2_US_Nebraska In non-final state in Rucio Consistency Monitor T3_US_Baylor Missing in stats records at Rucio Consistency Monitor T3_US_Brown Missing in stats records at Rucio Consistency Monitor T3_US_CMU Missing in stats records at Rucio Consistency Monitor T3_US_Colorado Missing in stats records at Rucio Consistency Monitor T3_US_FNALLPC Missing in stats records at Rucio Consistency Monitor T3_US_MIT Missing in stats records at Rucio Consistency Monitor T3_US_NERSC Missing in stats records at Rucio Consistency Monitor T3_US_NotreDame Missing in stats records at Rucio Consistency Monitor T3_US_OSU Missing in stats records at Rucio Consistency Monitor T3_US_Princeton_ICSE Missing in stats records at Rucio Consistency Monitor T3_US_PuertoRico Missing in stats records at Rucio Consistency Monitor T3_US_Rice Missing in stats records at Rucio Consistency Monitor T3_US_TAMU Missing in stats records at Rucio Consistency Monitor T3_US_UMD Missing in stats records at Rucio Consistency Monitor T3_US_UMiss Missing in stats records at Rucio Consistency Monitor
Hi @ivmfnal Could we check if this informatin I am extracting from our dabase and logs is correct? Sorry if causing additional troubles, with such a request.
@amaltaro in order to fetch those errors I had to do a several `set` operations to figure out which are the RSEs actually missing from the database and upon that I had to parse all those long longs that we have in order to find the reasons. And this is because all those they exit the pipeline by raising a `MSUnmergedException` quite early [1] and we have never considered for such cases to actually put a record in the database. Given how difficult and time consuming it was to track all that I'd say we better record this as an actual error in the RSE object and preserve it in the database. I may create yet another bugifx for that too. What do you think?
[1]
https://github.com/dmwm/WMCore/blob/90a3cb7703bebe2957f94a1887183da7bd6d8a55/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L491
Todor, you can see all the status and some history about CC here: https://cmsweb-k8s-prod.cern.ch/rucioconmon/index
Igor
From: todor-ivanov @.> Sent: Friday, February 11, 2022 10:25 AM To: dmwm/WMCore @.> Cc: Igor V Mandrichenko @.>; Mention @.> Subject: Re: [dmwm/WMCore] MSUnmerged: Track accumulated errors (Issue #10982)
And just for logging purposes here follows a short list of all the RSEs that are missing from MongoDB due to various of reasons, pasted in parallel bellow:
T1_FR_CCIN2P3_Disk In non-final state in Rucio Consistency Monitor T1_IT_CNAF_Disk In non-final state in Rucio Consistency Monitor
T2_BE_UCL In non-final state in Rucio Consistency Monitor T2_ES_IFCA In non-final state in Rucio Consistency Monitor T2_FR_CCIN2P3 Missing in stats records at Rucio Consistency Monitor T2_GR_Ioannina In non-final state in Rucio Consistency Monitor T2_PK_NCP In non-final state in Rucio Consistency Monitor T2_RU_IHEP In non-final state in Rucio Consistency Monitor T2_UK_London_Brunel In non-final state in Rucio Consistency Monitor T3_BG_UNI_SOFIA Missing in stats records at Rucio Consistency Monitor T3_CH_CERNBOX Missing in stats records at Rucio Consistency Monitor T3_CH_CERN_OpenData Missing in stats records at Rucio Consistency Monitor T3_CH_PSI Missing in stats records at Rucio Consistency Monitor T3_FR_IPNL Missing in stats records at Rucio Consistency Monitor T3_HR_IRB Missing in stats records at Rucio Consistency Monitor T3_IR_IPM Missing in stats records at Rucio Consistency Monitor T3_IT_MIB Missing in stats records at Rucio Consistency Monitor T3_IT_Trieste Missing in stats records at Rucio Consistency Monitor T3_KR_KISTI Missing in stats records at Rucio Consistency Monitor T3_KR_KNU Missing in stats records at Rucio Consistency Monitor T3_KR_UOS Missing in stats records at Rucio Consistency Monitor T3_MX_Cinvestav Missing in stats records at Rucio Consistency Monitor T3_TW_NCU Missing in stats records at Rucio Consistency Monitor T3_TW_NTU_HEP Missing in stats records at Rucio Consistency Monitor
T2_US_Caltech_Ceph is skipped due to a restriction set in msConfig T2_US_Florida In non-final state in Rucio Consistency Monitor T2_US_Nebraska In non-final state in Rucio Consistency Monitor T3_US_Baylor Missing in stats records at Rucio Consistency Monitor T3_US_Brown Missing in stats records at Rucio Consistency Monitor T3_US_CMU Missing in stats records at Rucio Consistency Monitor T3_US_Colorado Missing in stats records at Rucio Consistency Monitor T3_US_FNALLPC Missing in stats records at Rucio Consistency Monitor T3_US_MIT Missing in stats records at Rucio Consistency Monitor T3_US_NERSC Missing in stats records at Rucio Consistency Monitor T3_US_NotreDame Missing in stats records at Rucio Consistency Monitor T3_US_OSU Missing in stats records at Rucio Consistency Monitor T3_US_Princeton_ICSE Missing in stats records at Rucio Consistency Monitor T3_US_PuertoRico Missing in stats records at Rucio Consistency Monitor T3_US_Rice Missing in stats records at Rucio Consistency Monitor T3_US_TAMU Missing in stats records at Rucio Consistency Monitor T3_US_UMD Missing in stats records at Rucio Consistency Monitor T3_US_UMiss Missing in stats records at Rucio Consistency Monitor
Hi @ivmfnalhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ivmfnal&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=KAjoEyq2ZD8snZ4iJAcdvfJpezNpdmW8WsPZOVWpbRE&e= Could we check if this informatin I am extracting from our dabase and logs is correct? Sorry if causing additional troubles, with such a request.
@amaltarohttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_amaltaro&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=B-eJA5gjFzTzlry6KO8ep17xQwbwO5mH1MLYFoQHPM4&e= in order to fetch those errors I had to do a several set operations to figure out which are the RSEs actually missing from the database and upon that I had to parse all those long longs that we have in order to find the reasons. And this is because all those they exit the pipeline by raising a MSUnmergedException quite early [1] and we have never considered for such cases to actually put a record in the database. Given how difficult and time consuming it was to track all that I'd say we better record this as an actual error in the RSE object and preserve it in the database. I may create yet another bugifx for that too. What do you think?
[1] https://github.com/dmwm/WMCore/blob/90a3cb7703bebe2957f94a1887183da7bd6d8a55/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L491https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_blob_90a3cb7703bebe2957f94a1887183da7bd6d8a55_src_python_WMCore_MicroService_MSUnmerged_MSUnmerged.py-23L491&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=xBSV0doEJVGiQIpMKKHv3Lt3hkfMOGrN7PH-Iv922hQ&e=
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_10982-23issuecomment-2D1036385522&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=rk0cApT-wk9f0jo5IU0E7X3_nkRmPuR28j8uBu5nypA&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFK4SQW2HWVVBXWTBPUKD5TU2UZ6FANCNFSM5N57BT5A&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=eKnXVozzG2KWVuf4E7qFb8NfsC8R-Flqqu3VV3mzYww&e=. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=iNtw5Fx4GixXK36a0aayiJhrTA8w0DTsb6fy9BskrQs&e= or Androidhttps://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=hujBgdSkzK9mikKmKSzzrkQ6AHG_zVby4vD-Bi-HGec&e=. You are receiving this because you were mentioned.Message ID: @.***>
Thank you @ivmfnal , We will track the information provided there too. I am also creating another issue just to address the missing records in our database and I will put the link in the docstring inside the code so that the sites in error could be linked from both systems.
With the latest version of MSUnmerged deployed in production, we can easily check the result of the above query to MongoDB through the info=ALL_DOCS
REST API of the service. The link is here.
From what I can see in the current run there are gfal
errors mostly of the following two types:
T1_UK_RAL_Disk
T2_IN_TIFR
T2_CH_CERN
T2_US_Nebraska
The 3 out of those 4 RSEs above are actually marked as clean, because the number of directories found to be purged equals the number of directories which either have been successfully deleted or have been already gone at the time MSUnmerged retried the RSE.
The forth one: T1_UK_RAL_Disk
is a different story, which I explain bellow - in the next comment.
The error we get for T1_UK_RAL_Disk
is [1]. This is basically an error triggered while trying to decode the Json document fetched from RucioConMon for the RSE. I did try to fetch the document by hand and it is enormous:
wget https://cmsweb.cern.ch/rucioconmon/WM/files/T1_UK_RAL_Disk_wm_file_list.json?rse=T1_UK_RAL_Disk&format=json
...
-rw-r--r-- 1 827M Mar 10 15:13 T1_UK_RAL_Disk_wm_file_list.json
I did try to run the service in standalone mode in my VM, but every attempt to load this document in the memory is kelled by the kernel (most probably because of VM resource limitations, but anyway):
Mar 10 13:58:18 tivanov-unit02 kernel: Out of memory: Kill process 27141 (ipython) score 802 or sacrifice child
Mar 10 13:58:18 tivanov-unit02 kernel: Killed process 27141 (ipython), UID 71437, total-vm:4273580kB, anon-rss:2826508kB, file-rss:0kB, shmem-rss:0kB
Mar 10 14:00:02 tivanov-unit02 systemd: Created slice User Slice of root.
Checking the status of the RSE at RucioConMon it seems like the record is completely broken. Even though it says here the RSE is in status done
. There is no additional stat info other then number of files.
So:
And this is, kind of, not unexpected, because RAL has (if I am not wrong) a customized setup, which somehow may be affecting the consistency monitor run. @ivmfnal Do you know any more details on that. And should we even try to run the service on that RSE, or should we skip it completely. Maybe also @KatyEllis may shed some light here too.
[1]
2022-03-10 10:15:37,127:INFO:MSUnmerged: RSE: T1_UK_RAL_Disk Reading rse data from MongoDB.
2022-03-10 10:15:37,131:INFO:MSUnmerged: RSE: T1_UK_RAL_Disk With old consistency record in Rucio Consistency Monitor. But the RSE has NOT been fully cleaned during the last Rucio Consistency Monitor polling cycle.Retrying cleanup in the current run.
2022-03-10 09:15:09,786:DEBUG:RucioConMon: Fetching data from WM/files?rse=T1_UK_RAL_Disk&format=json, with args None
2022-03-10 09:15:09,786:DEBUG:Service: Data is from the Service cache
2022-03-10 09:15:10,673:ERROR:MSUnmerged: plineUnmerged: General error from pipeline. RSE: T1_UK_RAL_Disk. Error: Unterminated string starting at: line 1 column 168804222 (char 168804221) Will retry again in the next cycle.
Traceback (most recent call last):
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 269, in _execute
pline.run(MSUnmergedRSE(rseName))
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 140, in run
return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 140, in <lambda>
return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 72, in __call__
return self.run(obj)
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 75, in run
return self.func(obj, *self.args, **self.kwargs)
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 550, in getUnmergedFiles
rse['files']['allUnmerged'] = self.rucioConMon.getRSEUnmerged(rse['name'])
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 103, in getRSEUnmerged
rseUnmerged = self._getResult(uri, callname=rseName)
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 66, in _getResult
results = json.loads(results)
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 168804222 (char 168804221)
@todor-ivanov JSON is a text file. Even if it is 800MB you can still open it with vim and look up the mailformed characted. The key line here is line 1 column 168804222
, so in vim you just need to jump to column 168804222 and see your guy. The trick is to use <number>|
in your case it would be 168804222|
and vim cursor will jump to that column.
It is too bad that we have such enourmous JSON, few comments about it:
I was able to download https://cmsweb-prod.cern.ch/rucioconmon/unmerged/files/T1_UK_RAL_Disk_wm_file_list.json?rse=T1_UK_RAL_Disk&format=json and parse it as JSON
$ curl -k -o file_list.json --cert ~/certs/CERN_crt.pem --key ~/certs/CERN_key.pem "https://cmsweb-prod.cern.ch/rucioconmon/unmerged/files/T1_UK_RAL_Disk_wm_file_list.json?rse=T1_UK_RAL_Disk&format=json"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 826M 0 826M 0 0 10.5M 0 --:--:-- 0:01:18 --:--:-- 12.8M
$ python -m json.tool file_list.json | wc -l
3790380
Thanks @vkuznet and @ivmfnal . I continue searching if there are any issues/limitations on our side. Valentin, do we have any statistics about resource consumption at the K8 clusters.
@todor-ivanov , the short answer is we do have plenty, we monitor nodes and services. But it really depends on what service metrics are. For instance, all python based services under cmsweb reports some stats, e.g. here is reqmgr2 dashboard. It provides cpu, ram, fds, etc. Since you're querying rucioconmon it is up to this service to provide its metrics.
Just few days ago, we added node metrics of k8s cluster, see here and here. The question is do they useful for your use case I have no idea. The monitoring comes from services, if service provide relevant metrics we can easily plug them in. If nobody cares about service metrics then you have nothing.
I am going to implement ignore list filtering at the web server so that the file list is filtered before being sent to the client
I made changes to the web server:
https://cmsweb.cern.ch/rucioconmon/WM/files/T1_UK_RAL_Disk_wm_file_list.txt?rse=T1_UK_RAL_Disk&format=json
. For now, by default it filters out /store/unmerges/logs/*
.include
and exclude
are present, the list will include only those paths matched by include
minus those matched by exclude
raw
(old) - gzip-compressed original list from the scanner, without any filters described above appliedtext
(new) - plain text, paths are separated with newlinesjson
(old) - JSON list of pathszip-stream
(new) - zip stream (see https://docs.python.org/3/library/zlib.html#module-zlib) Examples:
https://...?rse=T1_UK_RAL_Disk&format=json
- exclude /store/unmerges/logs/*
by default
https://...?rse=T1_UK_RAL_Disk&format=json&include=/store/unmerged&exclude=/store/unmerged/SAM
https://...?rse=T1_UK_RAL_Disk&format=json&exclude=/store/unmerged/SAM/,/store/unmerged/data/logs/prod/
thanks @ivmfnal
And indeed your change took immediate effect. Now RAL is iterated properly. Here is and excerpt of the service logs:
2022-03-11 18:20:10,338:DEBUG:MSUnmerged: RSE: T1_UK_RAL_Disk, Dir: /store/unmerged/RunIISummer20UL17RECO/TTJets_HT-1200to2500_TuneCP5_
GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] globus_xio: System error in send: Broken pipe globus_xio: A system call failed: B
None,
None,
None,
None,
None,
None,
None,
None,
None]
2022-03-11 18:20:10,338:INFO:MSUnmerged: RSE: T1_UK_RAL_Disk, Dir: /store/unmerged/RunIISummer20UL17RECO/TTJets_HT-1200to2500_TuneCP5_1
2022-03-11 18:20:10,338:DEBUG:MSUnmerged: Purging dirEntry: gsiftp://gridftp.echo.stfc.ac.uk:2811/cms:/store/unmerged/RunIISummer20UL17
2022-03-11 18:20:10,339:DEBUG:MSUnmerged: -> [gfal_gridftp_statG]
2022-03-11 18:20:10,339:DEBUG:MSUnmerged: -> [GridFTPModule::stat]
2022-03-11 18:20:10,339:DEBUG:MSUnmerged: -> [Gridftp_stat_module::globus_gass_stat]
2022-03-11 18:20:10,339:DEBUG:MSUnmerged: GSIFTP using certificate /etc/proxy/proxy
2022-03-11 18:20:10,339:DEBUG:MSUnmerged: GSIFTP using private key /etc/proxy/proxy
2022-03-11 18:20:10,340:DEBUG:MSUnmerged: gridftp session for: gsiftp://gridftp.echo.stfc.ac.uk:2811 found in cache !
And here is the fresh record for the RSE at MongoDB upon the successful run.
The RSE is still not cleaned, but we are at least able to debug now. I do see plenty of gfal
related errors like the one above.
Just logging here something we have previously noticed, but it is now a question that needs to be answered.
While looking at the logs related to T1_UK_RAL_Disk
we noticed that more than one (in this case 2) different errors on protocol level, with two different error messages, but with the same error code returned by gfal, has been mapped to the same posix
error:
2022-03-11 18:17:17,450:DEBUG:MSUnmerged: RSE: T1_UK_RAL_Disk, Dir: /store/unmerged/RunIISummer20UL16DIGIPremixAPV/BsToMuMuPhi_PhiToKK_BMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/GE
N-SIM-DIGI/106X_mcRun2_asymptotic_preVFP_v8-v1, delResult: [GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] an end-of-file was reached globus_xio: An end of file occurred ', 70),
GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] an end-of-file was reached globus_xio: An end of file occurred ', 70),
GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] an end-of-file was reached globus_xio: An end of file occurred ', 70),
GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] globus_xio: System error in send: Broken pipe globus_xio: A system call failed: Broken pipe ', 70),
GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] globus_xio: System error in send: Broken pipe globus_xio: A system call failed: Broken pipe ', 70),
And those are both mapped to the posix
error: 70: ECOMM: Communication error on send
, and reported as such in MongoDB by our service [1].
This is a question that has been asked to the gfal
developers here. But we still have no answer.
FYI @amaltaro
[1] https://cmsweb.cern.ch/ms-unmerged/data/info?rse=T1_UK_RAL_Disk&detail=False
{"result": [
{
"wmcore_version": "2.0.1.pre6",
"microservice_version": "2.0.1.pre6",
"microservice": "MSManager",
"query": "rse=T1_UK_RAL_Disk&detail=False",
"rseData": [
{
"name": "T1_UK_RAL_Disk",
"pfnPrefix": "gsiftp://gridftp.echo.stfc.ac.uk:2811/cms:",
"rucioConMonStatus": "done",
"isClean": true,
"timestamps": {
"rseConsStatTime": 1646451101,
"prevStartTime": 1647322438.8978102,
"startTime": 1647326040.5946178,
"prevEndTime": 1647322438.8980052,
"endTime": 1647326040.5947526
},
"counters": {
"totalNumFiles": 20348,
"totalNumDirs": 1052,
"dirsToDelete": 235,
"filesToDelete": 0,
"filesDeletedSuccess": 5551,
"filesDeletedFail": 0,
"dirsDeletedSuccess": 235,
"dirsDeletedFail": 2,
"gfalErrors": {
"Communication error on send": 408
}
}
}
]
}]}
@todor-ivanov have I missed the different errors at the protocol level from your message above?
My naive and generic view of this error handling is the following though, if the posix error is good enough for us to track: a) problems in MSUnmerged b) problems at the storage level
then we should not try to micromanage those errors. To me, it looks fair to have a map of multiple protocol errors into the same posix error (since the posix errors are very limited as well).
hi @amaltaro
In general this should be the case yes, and that's exactly why I believed this should be the correct mapping at the first place. We should only watch for permissions/authorisation
related errors from the protocol level, not to be masked inside the Communication Error on Send
posix
error. Here [1] is an example. (This error itself is actually triggered because of a broken uri
construction for the gsiftp
protocol, but still we need to pay attention to those.)
[1]
$ gfal-ls gsiftp://gridftp.echo.stfc.ac.uk:2811/cms/store/unmerged/RunIISummer20UL18DIGIPremix/DYJetsToEE_M-100To160_VBFFilter_TuneCP5_13TeV-amcatnloFXFX-pythia8/GEN-SIM-DIGI/106X_upgrade2018_realistic_v11_L1v1-v1
gfal-ls error: 70 (Communication error on send) - globus_ftp_client: the server responded with an error 500 Command failed : globus_l_gfs_ceph_stat: authorization error: 'MLST' operation not allowed
have I missed the different errors at the protocol level from your message above
I am not sure if you have... but, maybe I should have pointed them better. Here they are:
...globus_xio: An end of file occurred
, Error code: 70 = Posix error: Communication error on send
...globus_xio: A system call failed: Broken pipe
, Error code: 70 = Posix error: Communication error on send
So far I managed to go through all the logs since last change of the document structure at MongoDB and the full list of RSEs to have issues with deletions is:
T2_UK_London_Brunel
T2_BE_UCL
I am creating the relevant GGUS tickets for them.
And here is the llist of GGUS tickets created for this [1]. I am closing the current issue here. We will follow with the sites and Site Support team. If any other changes are needed another WMcore issue is to be opened.
[1] https://ggus.eu/?mode=ticket_info&ticket_id=156526 https://ggus.eu/?mode=ticket_info&ticket_id=156527
Impact of the new feature MSUnmerged
Is your feature request related to a problem? Please describe. This is NOT a feature request NEITHER a bug
After we deployed the latest version of
MSUnmerged
in production yesterday, we are already capable of tracking all the errors we have per RSE with a single query directly to the database. [1] The current issue is just for tracking purposes, in order to follow if any of those errors are critical, so that we can announce the service completely functional later on.Describe the solution you'd like Follow on all the critical Errors that we observe and make sure we do not miss an RSE behind.
Describe alternatives you've considered No alternative.
Additional context [1]