dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

MSUnmerged: Track accumulated errors #10982

Closed todor-ivanov closed 2 years ago

todor-ivanov commented 2 years ago

Impact of the new feature MSUnmerged

Is your feature request related to a problem? Please describe. This is NOT a feature request NEITHER a bug

After we deployed the latest version of MSUnmerged in production yesterday, we are already capable of tracking all the errors we have per RSE with a single query directly to the database. [1] The current issue is just for tracking purposes, in order to follow if any of those errors are critical, so that we can announce the service completely functional later on.

Describe the solution you'd like Follow on all the critical Errors that we observe and make sure we do not miss an RSE behind.

Describe alternatives you've considered No alternative.

Additional context [1]

In [47]: filter = {}

In [48]: projection = {
    ...:     "_id": False,
    ...:     "name": True,
    ...:     "isClean": True,
    ...:     "counters": {
    ...:         "gfalErrors": True,
    ...:         "dirsToDelete": True,
    ...:         "dirsDeletedSuccess": True,
    ...:         "dirsDeletedFail": True}}

In [49]: list(msUnmergedCollProd.find(filter, projection))
Out[49]: 
[{'name': 'T1_US_FNAL_Disk',
  'isClean': True,
  'counters': {'dirsToDelete': 37,
   'dirsDeletedSuccess': 37,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_TW_NCHC',
  'isClean': False,
  'counters': {'dirsToDelete': 57,
   'dirsDeletedSuccess': 55,
   'dirsDeletedFail': 2,
   'gfalErrors': {'No such file or directory': 720}}},
 {'name': 'T2_US_Vanderbilt',
  'isClean': True,
  'counters': {'dirsToDelete': 302,
   'dirsDeletedSuccess': 302,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_PT_NCG_Lisbon',
  'isClean': True,
  'counters': {'dirsToDelete': 8,
   'dirsDeletedSuccess': 8,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 8}}},
 {'name': 'T2_UA_KIPT',
  'isClean': False,
  'counters': {'dirsToDelete': 70,
   'dirsDeletedSuccess': 69,
   'dirsDeletedFail': 1,
   'gfalErrors': {'No such file or directory': 279}}},
 {'name': 'T2_TR_METU',
  'isClean': True,
  'counters': {'dirsToDelete': 0,
   'dirsDeletedSuccess': 0,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_IT_Rome',
  'isClean': True,
  'counters': {'dirsToDelete': 111,
   'dirsDeletedSuccess': 111,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 440}}},
 {'name': 'T2_UK_SGrid_RALPP',
  'isClean': False,
  'counters': {'dirsToDelete': 569,
   'dirsDeletedSuccess': 561,
   'dirsDeletedFail': 8,
   'gfalErrors': {'No such file or directory': 4820}}},
 {'name': 'T1_UK_RAL_Disk',
  'isClean': True,
  'counters': {'dirsToDelete': 194,
   'dirsDeletedSuccess': 194,
   'dirsDeletedFail': 1,
   'gfalErrors': {'Communication error on send': 61}}},
 {'name': 'T2_DE_DESY',
  'isClean': True,
  'counters': {'dirsToDelete': 194,
   'dirsDeletedSuccess': 194,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 1376}}},
 {'name': 'T2_UK_London_IC',
  'isClean': True,
  'counters': {'dirsToDelete': 291,
   'dirsDeletedSuccess': 291,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 3533}}},
 {'name': 'T2_DE_RWTH',
  'isClean': False,
  'counters': {'dirsToDelete': 268,
   'dirsDeletedSuccess': 264,
   'dirsDeletedFail': 4,
   'gfalErrors': {'No such file or directory': 2260}}},
 {'name': 'T2_IT_Bari',
  'isClean': False,
  'counters': {'dirsToDelete': 166,
   'dirsDeletedSuccess': 165,
   'dirsDeletedFail': 1,
   'gfalErrors': {'No such file or directory': 670}}},
 {'name': 'T2_RU_INR',
  'isClean': True,
  'counters': {'dirsToDelete': 31,
   'dirsDeletedSuccess': 31,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 31}}},
 {'name': 'T2_FR_GRIF_LLR',
  'isClean': False,
  'counters': {'dirsToDelete': 593,
   'dirsDeletedSuccess': 565,
   'dirsDeletedFail': 28,
   'gfalErrors': {'No such file or directory': 8054}}},
 {'name': 'T2_US_Wisconsin',
  'isClean': False,
  'counters': {'dirsToDelete': 673,
   'dirsDeletedSuccess': 666,
   'dirsDeletedFail': 7,
   'gfalErrors': {'Communication error on send': 7139}}},
 {'name': 'T2_FR_IPHC',
  'isClean': False,
  'counters': {'dirsToDelete': 241,
   'dirsDeletedSuccess': 238,
   'dirsDeletedFail': 3,
   'gfalErrors': {'No such file or directory': 1383}}},
 {'name': 'T2_RU_ITEP',
  'isClean': False,
  'counters': {'dirsToDelete': 1,
   'dirsDeletedSuccess': 0,
   'dirsDeletedFail': 1,
   'gfalErrors': {'Input/output error': 3}}},
 {'name': 'T2_IN_TIFR',
  'isClean': False,
  'counters': {'dirsToDelete': 176,
   'dirsDeletedSuccess': 171,
   'dirsDeletedFail': 5,
   'gfalErrors': {'No such file or directory': 3053}}},
 {'name': 'T1_DE_KIT_Disk',
  'isClean': True,
  'counters': {'dirsToDelete': 126,
   'dirsDeletedSuccess': 126,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 1639}}},
 {'name': 'T1_RU_JINR_Disk',
  'isClean': True,
  'counters': {'dirsToDelete': 10,
   'dirsDeletedSuccess': 10,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 5}}},
 {'name': 'T1_ES_PIC_Disk',
  'isClean': False,
  'counters': {'dirsToDelete': 367,
   'dirsDeletedSuccess': 366,
   'dirsDeletedFail': 1,
   'gfalErrors': {'No such file or directory': 827758}}},
 {'name': 'T2_US_Caltech',
  'isClean': True,
  'counters': {'dirsToDelete': 126,
   'dirsDeletedSuccess': 126,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 1317}}},
 {'name': 'T2_US_UCSD',
  'isClean': True,
  'counters': {'dirsToDelete': 54,
   'dirsDeletedSuccess': 54,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_IT_Pisa',
  'isClean': True,
  'counters': {'dirsToDelete': 92,
   'dirsDeletedSuccess': 92,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 245}}},
 {'name': 'T2_KR_KISTI',
  'isClean': True,
  'counters': {'dirsToDelete': 64,
   'dirsDeletedSuccess': 64,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 196}}},
 {'name': 'T2_RU_JINR',
  'isClean': True,
  'counters': {'dirsToDelete': 32,
   'dirsDeletedSuccess': 32,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 85}}},
 {'name': 'T2_AT_Vienna',
  'isClean': True,
  'counters': {'dirsToDelete': 107,
   'dirsDeletedSuccess': 107,
   'dirsDeletedFail': 0,
   'gfalErrors': {'Communication error on send': 506}}},
 {'name': 'T2_BR_UERJ',
  'isClean': True,
  'counters': {'dirsToDelete': 167,
   'dirsDeletedSuccess': 167,
   'dirsDeletedFail': 0,
   'gfalErrors': {'Communication error on send': 1}}},
 {'name': 'T2_FI_HIP',
  'isClean': True,
  'counters': {'dirsToDelete': 28,
   'dirsDeletedSuccess': 28,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 76}}},
 {'name': 'T2_UK_SGrid_Bristol',
  'isClean': False,
  'counters': {'dirsToDelete': 94,
   'dirsDeletedSuccess': 0,
   'dirsDeletedFail': 94,
   'gfalErrors': {'Communication error on send': 282}}},
 {'name': 'T2_US_MIT',
  'isClean': True,
  'counters': {'dirsToDelete': 113,
   'dirsDeletedSuccess': 113,
   'dirsDeletedFail': 0,
   'gfalErrors': {'Communication error on send': 36438}}},
 {'name': 'T2_US_Purdue',
  'isClean': True,
  'counters': {'dirsToDelete': 18,
   'dirsDeletedSuccess': 18,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T3_US_Rutgers',
  'isClean': True,
  'counters': {'dirsToDelete': 0,
   'dirsDeletedSuccess': 0,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_CN_Beijing',
  'isClean': True,
  'counters': {'dirsToDelete': 42,
   'dirsDeletedSuccess': 42,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 64}}},
 {'name': 'T2_EE_Estonia',
  'isClean': False,
  'counters': {'dirsToDelete': 185,
   'dirsDeletedSuccess': 184,
   'dirsDeletedFail': 1,
   'gfalErrors': {'Communication error on send': 1666}}},
 {'name': 'T2_BE_IIHE',
  'isClean': True,
  'counters': {'dirsToDelete': 141,
   'dirsDeletedSuccess': 141,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 902}}},
 {'name': 'T2_IT_Legnaro',
  'isClean': True,
  'counters': {'dirsToDelete': 310,
   'dirsDeletedSuccess': 310,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 1881}}},
 {'name': 'T2_PL_Swierk',
  'isClean': True,
  'counters': {'dirsToDelete': 36,
   'dirsDeletedSuccess': 36,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 2}}},
 {'name': 'T2_CH_CSCS',
  'isClean': True,
  'counters': {'dirsToDelete': 68,
   'dirsDeletedSuccess': 68,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 328}}},
 {'name': 'T2_BR_SPRACE',
  'isClean': False,
  'counters': {'dirsToDelete': 263,
   'dirsDeletedSuccess': 254,
   'dirsDeletedFail': 9,
   'gfalErrors': {'No such file or directory': 982}}},
 {'name': 'T2_HU_Budapest',
  'isClean': True,
  'counters': {'dirsToDelete': 85,
   'dirsDeletedSuccess': 85,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 465}}},
 {'name': 'T2_CH_CERN',
  'isClean': False,
  'counters': {'dirsToDelete': 245,
   'dirsDeletedSuccess': 244,
   'dirsDeletedFail': 1,
   'gfalErrors': {'Communication error on send': 701}}},
 {'name': 'T2_FR_GRIF_IRFU',
  'isClean': False,
  'counters': {'dirsToDelete': 523,
   'dirsDeletedSuccess': 499,
   'dirsDeletedFail': 24,
   'gfalErrors': {'No such file or directory': 3811}}},
 {'name': 'T2_ES_CIEMAT',
  'isClean': False,
  'counters': {'dirsToDelete': 514,
   'dirsDeletedSuccess': 512,
   'dirsDeletedFail': 2,
   'gfalErrors': {'No such file or directory': 2586}}}]
todor-ivanov commented 2 years ago

FYI @amaltaro

amaltaro commented 2 years ago

Thanks for creating this issue, Todor. Do I read it right that there are only 2 problems reported:

  1. 'No such file or directory'; or
  2. 'Communication error on send' ?

If so, I guess what deserves attention is only these reporting communication error, right? It could be that this is somewhat a generic error from the gfal plugin though.

todor-ivanov commented 2 years ago

hi @amaltaro , yes I think you are correct. I'll double check today.

todor-ivanov commented 2 years ago

The more I look into those logs the more confident I get that we should exclude the /store/unmerged/SAM directory from the tree to be cleaned. The contents there are too volatile and the files are constantly written and deleted by the SAM tests, which in general seems to be doing their job of cleaning behind themselves quite well. Also the contents are usually small enough. While on the other hand if we preserve this part of the tree we are going to be constantly blocked to delete one or two non-empty directories per site and hence not announcing some of them clean for a quite long.

We are going to have a deployment in production today due to another minor configuration error I've found, so I am about to make the relevant configuration PRs for excluding this part of the tree too.

FYI @amaltaro

todor-ivanov commented 2 years ago

And here are the relevant configuration changes:

cmsweb-k8s/services_config!127

cmsweb-k8s/services_config!126

amaltaro commented 2 years ago

@ivmfnal Igor, don't we have this path skipped from the Rucio ConMon system?

@todor-ivanov the deployment is just the deployment of the new configuration, right?

todor-ivanov commented 2 years ago

Hi @amaltaro,

the deployment is just the deployment of the new configuration, right?

Almost... here is another minor configuration error (a typo) we have found with @muhammadimranfarooqi earlier today and which actually trigggered the need of this deployment, but this one should not have caused any troubles in the service run.

https://github.com/dmwm/CMSKubernetes/commit/0b09abbd4bb0ee3cc0811d1c921e7511edd1f3b8

todor-ivanov commented 2 years ago

And just for logging purposes here follows a short list of all the RSEs that are missing from MongoDB due to various of reasons, pasted in parallel bellow:

 * T2T3_US:

T2_US_Caltech_Ceph is skipped due to a restriction set in msConfig T2_US_Florida In non-final state in Rucio Consistency Monitor T2_US_Nebraska In non-final state in Rucio Consistency Monitor T3_US_Baylor Missing in stats records at Rucio Consistency Monitor T3_US_Brown Missing in stats records at Rucio Consistency Monitor T3_US_CMU Missing in stats records at Rucio Consistency Monitor T3_US_Colorado Missing in stats records at Rucio Consistency Monitor T3_US_FNALLPC Missing in stats records at Rucio Consistency Monitor T3_US_MIT Missing in stats records at Rucio Consistency Monitor T3_US_NERSC Missing in stats records at Rucio Consistency Monitor T3_US_NotreDame Missing in stats records at Rucio Consistency Monitor T3_US_OSU Missing in stats records at Rucio Consistency Monitor T3_US_Princeton_ICSE Missing in stats records at Rucio Consistency Monitor T3_US_PuertoRico Missing in stats records at Rucio Consistency Monitor T3_US_Rice Missing in stats records at Rucio Consistency Monitor T3_US_TAMU Missing in stats records at Rucio Consistency Monitor T3_US_UMD Missing in stats records at Rucio Consistency Monitor T3_US_UMiss Missing in stats records at Rucio Consistency Monitor



Hi @ivmfnal Could we check if this informatin I am extracting from our dabase and logs is correct? Sorry if causing additional troubles, with such a request. 

@amaltaro in order to fetch those errors I had to do a several `set` operations to figure out which are the RSEs actually missing from the database and upon that I had to parse all those long longs that we have in order to find the reasons. And this is because all those they exit the pipeline by  raising a `MSUnmergedException` quite early [1] and we have never considered for such cases to actually put a record in the database. Given how   difficult and time consuming it was to track all that I'd say we better record this as an actual error in the RSE object and preserve it in the database. I may create yet another bugifx for that too. What do you think?  

[1]
https://github.com/dmwm/WMCore/blob/90a3cb7703bebe2957f94a1887183da7bd6d8a55/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L491
ivmfnal commented 2 years ago

Todor, you can see all the status and some history about CC here: https://cmsweb-k8s-prod.cern.ch/rucioconmon/index

Igor


From: todor-ivanov @.> Sent: Friday, February 11, 2022 10:25 AM To: dmwm/WMCore @.> Cc: Igor V Mandrichenko @.>; Mention @.> Subject: Re: [dmwm/WMCore] MSUnmerged: Track accumulated errors (Issue #10982)

And just for logging purposes here follows a short list of all the RSEs that are missing from MongoDB due to various of reasons, pasted in parallel bellow:

T1_FR_CCIN2P3_Disk In non-final state in Rucio Consistency Monitor T1_IT_CNAF_Disk In non-final state in Rucio Consistency Monitor

T2_BE_UCL In non-final state in Rucio Consistency Monitor T2_ES_IFCA In non-final state in Rucio Consistency Monitor T2_FR_CCIN2P3 Missing in stats records at Rucio Consistency Monitor T2_GR_Ioannina In non-final state in Rucio Consistency Monitor T2_PK_NCP In non-final state in Rucio Consistency Monitor T2_RU_IHEP In non-final state in Rucio Consistency Monitor T2_UK_London_Brunel In non-final state in Rucio Consistency Monitor T3_BG_UNI_SOFIA Missing in stats records at Rucio Consistency Monitor T3_CH_CERNBOX Missing in stats records at Rucio Consistency Monitor T3_CH_CERN_OpenData Missing in stats records at Rucio Consistency Monitor T3_CH_PSI Missing in stats records at Rucio Consistency Monitor T3_FR_IPNL Missing in stats records at Rucio Consistency Monitor T3_HR_IRB Missing in stats records at Rucio Consistency Monitor T3_IR_IPM Missing in stats records at Rucio Consistency Monitor T3_IT_MIB Missing in stats records at Rucio Consistency Monitor T3_IT_Trieste Missing in stats records at Rucio Consistency Monitor T3_KR_KISTI Missing in stats records at Rucio Consistency Monitor T3_KR_KNU Missing in stats records at Rucio Consistency Monitor T3_KR_UOS Missing in stats records at Rucio Consistency Monitor T3_MX_Cinvestav Missing in stats records at Rucio Consistency Monitor T3_TW_NCU Missing in stats records at Rucio Consistency Monitor T3_TW_NTU_HEP Missing in stats records at Rucio Consistency Monitor

T2_US_Caltech_Ceph is skipped due to a restriction set in msConfig T2_US_Florida In non-final state in Rucio Consistency Monitor T2_US_Nebraska In non-final state in Rucio Consistency Monitor T3_US_Baylor Missing in stats records at Rucio Consistency Monitor T3_US_Brown Missing in stats records at Rucio Consistency Monitor T3_US_CMU Missing in stats records at Rucio Consistency Monitor T3_US_Colorado Missing in stats records at Rucio Consistency Monitor T3_US_FNALLPC Missing in stats records at Rucio Consistency Monitor T3_US_MIT Missing in stats records at Rucio Consistency Monitor T3_US_NERSC Missing in stats records at Rucio Consistency Monitor T3_US_NotreDame Missing in stats records at Rucio Consistency Monitor T3_US_OSU Missing in stats records at Rucio Consistency Monitor T3_US_Princeton_ICSE Missing in stats records at Rucio Consistency Monitor T3_US_PuertoRico Missing in stats records at Rucio Consistency Monitor T3_US_Rice Missing in stats records at Rucio Consistency Monitor T3_US_TAMU Missing in stats records at Rucio Consistency Monitor T3_US_UMD Missing in stats records at Rucio Consistency Monitor T3_US_UMiss Missing in stats records at Rucio Consistency Monitor

Hi @ivmfnalhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ivmfnal&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=KAjoEyq2ZD8snZ4iJAcdvfJpezNpdmW8WsPZOVWpbRE&e= Could we check if this informatin I am extracting from our dabase and logs is correct? Sorry if causing additional troubles, with such a request.

@amaltarohttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_amaltaro&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=B-eJA5gjFzTzlry6KO8ep17xQwbwO5mH1MLYFoQHPM4&e= in order to fetch those errors I had to do a several set operations to figure out which are the RSEs actually missing from the database and upon that I had to parse all those long longs that we have in order to find the reasons. And this is because all those they exit the pipeline by raising a MSUnmergedException quite early [1] and we have never considered for such cases to actually put a record in the database. Given how difficult and time consuming it was to track all that I'd say we better record this as an actual error in the RSE object and preserve it in the database. I may create yet another bugifx for that too. What do you think?

[1] https://github.com/dmwm/WMCore/blob/90a3cb7703bebe2957f94a1887183da7bd6d8a55/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L491https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_blob_90a3cb7703bebe2957f94a1887183da7bd6d8a55_src_python_WMCore_MicroService_MSUnmerged_MSUnmerged.py-23L491&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=xBSV0doEJVGiQIpMKKHv3Lt3hkfMOGrN7PH-Iv922hQ&e=

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_10982-23issuecomment-2D1036385522&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=rk0cApT-wk9f0jo5IU0E7X3_nkRmPuR28j8uBu5nypA&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFK4SQW2HWVVBXWTBPUKD5TU2UZ6FANCNFSM5N57BT5A&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=eKnXVozzG2KWVuf4E7qFb8NfsC8R-Flqqu3VV3mzYww&e=. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=iNtw5Fx4GixXK36a0aayiJhrTA8w0DTsb6fy9BskrQs&e= or Androidhttps://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=hujBgdSkzK9mikKmKSzzrkQ6AHG_zVby4vD-Bi-HGec&e=. You are receiving this because you were mentioned.Message ID: @.***>

todor-ivanov commented 2 years ago

Thank you @ivmfnal , We will track the information provided there too. I am also creating another issue just to address the missing records in our database and I will put the link in the docstring inside the code so that the sites in error could be linked from both systems.

todor-ivanov commented 2 years ago

With the latest version of MSUnmerged deployed in production, we can easily check the result of the above query to MongoDB through the info=ALL_DOCS REST API of the service. The link is here.

From what I can see in the current run there are gfal errors mostly of the following two types:

The 3 out of those 4 RSEs above are actually marked as clean, because the number of directories found to be purged equals the number of directories which either have been successfully deleted or have been already gone at the time MSUnmerged retried the RSE.

The forth one: T1_UK_RAL_Disk is a different story, which I explain bellow - in the next comment.

todor-ivanov commented 2 years ago

The error we get for T1_UK_RAL_Disk is [1]. This is basically an error triggered while trying to decode the Json document fetched from RucioConMon for the RSE. I did try to fetch the document by hand and it is enormous:

wget https://cmsweb.cern.ch/rucioconmon/WM/files/T1_UK_RAL_Disk_wm_file_list.json?rse=T1_UK_RAL_Disk&format=json
...
-rw-r--r--  1   827M Mar 10 15:13  T1_UK_RAL_Disk_wm_file_list.json

I did try to run the service in standalone mode in my VM, but every attempt to load this document in the memory is kelled by the kernel (most probably because of VM resource limitations, but anyway):

Mar 10 13:58:18 tivanov-unit02 kernel: Out of memory: Kill process 27141 (ipython) score 802 or sacrifice child
Mar 10 13:58:18 tivanov-unit02 kernel: Killed process 27141 (ipython), UID 71437, total-vm:4273580kB, anon-rss:2826508kB, file-rss:0kB, shmem-rss:0kB
Mar 10 14:00:02 tivanov-unit02 systemd: Created slice User Slice of root.

Checking the status of the RSE at RucioConMon it seems like the record is completely broken. Even though it says here the RSE is in status done. There is no additional stat info other then number of files.

So:

And this is, kind of, not unexpected, because RAL has (if I am not wrong) a customized setup, which somehow may be affecting the consistency monitor run. @ivmfnal Do you know any more details on that. And should we even try to run the service on that RSE, or should we skip it completely. Maybe also @KatyEllis may shed some light here too.

[1]

2022-03-10 10:15:37,127:INFO:MSUnmerged: RSE: T1_UK_RAL_Disk Reading rse data from MongoDB.
2022-03-10 10:15:37,131:INFO:MSUnmerged: RSE: T1_UK_RAL_Disk With old consistency record in Rucio Consistency Monitor. But the RSE has NOT been fully cleaned during the last Rucio Consistency Monitor polling cycle.Retrying cleanup in the current run.
2022-03-10 09:15:09,786:DEBUG:RucioConMon: Fetching data from WM/files?rse=T1_UK_RAL_Disk&format=json, with args None
2022-03-10 09:15:09,786:DEBUG:Service: Data is from the Service cache
2022-03-10 09:15:10,673:ERROR:MSUnmerged: plineUnmerged: General error from pipeline. RSE: T1_UK_RAL_Disk. Error: Unterminated string starting at: line 1 column 168804222 (char 168804221) Will retry again in the next cycle.
Traceback (most recent call last):
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 269, in _execute
    pline.run(MSUnmergedRSE(rseName))
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 140, in run
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 140, in <lambda>
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 72, in __call__
    return self.run(obj)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 75, in run
    return self.func(obj, *self.args, **self.kwargs)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 550, in getUnmergedFiles
    rse['files']['allUnmerged'] = self.rucioConMon.getRSEUnmerged(rse['name'])
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 103, in getRSEUnmerged
    rseUnmerged = self._getResult(uri, callname=rseName)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 66, in _getResult
    results = json.loads(results)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 168804222 (char 168804221)
vkuznet commented 2 years ago

@todor-ivanov JSON is a text file. Even if it is 800MB you can still open it with vim and look up the mailformed characted. The key line here is line 1 column 168804222, so in vim you just need to jump to column 168804222 and see your guy. The trick is to use <number>| in your case it would be 168804222| and vim cursor will jump to that column.

It is too bad that we have such enourmous JSON, few comments about it:

ivmfnal commented 2 years ago

I was able to download https://cmsweb-prod.cern.ch/rucioconmon/unmerged/files/T1_UK_RAL_Disk_wm_file_list.json?rse=T1_UK_RAL_Disk&format=json and parse it as JSON

ivmfnal commented 2 years ago
$ curl -k -o file_list.json --cert ~/certs/CERN_crt.pem --key ~/certs/CERN_key.pem "https://cmsweb-prod.cern.ch/rucioconmon/unmerged/files/T1_UK_RAL_Disk_wm_file_list.json?rse=T1_UK_RAL_Disk&format=json"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  826M    0  826M    0     0  10.5M      0 --:--:--  0:01:18 --:--:-- 12.8M

$ python -m json.tool file_list.json | wc -l
 3790380
todor-ivanov commented 2 years ago

Thanks @vkuznet and @ivmfnal . I continue searching if there are any issues/limitations on our side. Valentin, do we have any statistics about resource consumption at the K8 clusters.

vkuznet commented 2 years ago

@todor-ivanov , the short answer is we do have plenty, we monitor nodes and services. But it really depends on what service metrics are. For instance, all python based services under cmsweb reports some stats, e.g. here is reqmgr2 dashboard. It provides cpu, ram, fds, etc. Since you're querying rucioconmon it is up to this service to provide its metrics.

Just few days ago, we added node metrics of k8s cluster, see here and here. The question is do they useful for your use case I have no idea. The monitoring comes from services, if service provide relevant metrics we can easily plug them in. If nobody cares about service metrics then you have nothing.

ivmfnal commented 2 years ago

I am going to implement ignore list filtering at the web server so that the file list is filtered before being sent to the client

ivmfnal commented 2 years ago

I made changes to the web server:

Examples:

https://...?rse=T1_UK_RAL_Disk&format=json - exclude /store/unmerges/logs/* by default https://...?rse=T1_UK_RAL_Disk&format=json&include=/store/unmerged&exclude=/store/unmerged/SAM https://...?rse=T1_UK_RAL_Disk&format=json&exclude=/store/unmerged/SAM/,/store/unmerged/data/logs/prod/

todor-ivanov commented 2 years ago

thanks @ivmfnal

And indeed your change took immediate effect. Now RAL is iterated properly. Here is and excerpt of the service logs:

2022-03-11 18:20:10,338:DEBUG:MSUnmerged: RSE: T1_UK_RAL_Disk, Dir: /store/unmerged/RunIISummer20UL17RECO/TTJets_HT-1200to2500_TuneCP5_
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] globus_xio: System error in send: Broken pipe globus_xio: A system call failed: B
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]
2022-03-11 18:20:10,338:INFO:MSUnmerged: RSE: T1_UK_RAL_Disk, Dir: /store/unmerged/RunIISummer20UL17RECO/TTJets_HT-1200to2500_TuneCP5_1
2022-03-11 18:20:10,338:DEBUG:MSUnmerged: Purging dirEntry: gsiftp://gridftp.echo.stfc.ac.uk:2811/cms:/store/unmerged/RunIISummer20UL17

2022-03-11 18:20:10,339:DEBUG:MSUnmerged:   -> [gfal_gridftp_statG]
2022-03-11 18:20:10,339:DEBUG:MSUnmerged:  -> [GridFTPModule::stat] 
2022-03-11 18:20:10,339:DEBUG:MSUnmerged:  -> [Gridftp_stat_module::globus_gass_stat] 
2022-03-11 18:20:10,339:DEBUG:MSUnmerged: GSIFTP using certificate /etc/proxy/proxy
2022-03-11 18:20:10,339:DEBUG:MSUnmerged: GSIFTP using private key /etc/proxy/proxy
2022-03-11 18:20:10,340:DEBUG:MSUnmerged: gridftp session for: gsiftp://gridftp.echo.stfc.ac.uk:2811 found in  cache !

And here is the fresh record for the RSE at MongoDB upon the successful run. The RSE is still not cleaned, but we are at least able to debug now. I do see plenty of gfal related errors like the one above.

todor-ivanov commented 2 years ago

Just logging here something we have previously noticed, but it is now a question that needs to be answered.

While looking at the logs related to T1_UK_RAL_Disk we noticed that more than one (in this case 2) different errors on protocol level, with two different error messages, but with the same error code returned by gfal, has been mapped to the same posix error:

2022-03-11 18:17:17,450:DEBUG:MSUnmerged: RSE: T1_UK_RAL_Disk, Dir: /store/unmerged/RunIISummer20UL16DIGIPremixAPV/BsToMuMuPhi_PhiToKK_BMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/GE
N-SIM-DIGI/106X_mcRun2_asymptotic_preVFP_v8-v1, delResult: [GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] an end-of-file was reached globus_xio: An end of file occurred ', 70),
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] an end-of-file was reached globus_xio: An end of file occurred ', 70),
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] an end-of-file was reached globus_xio: An end of file occurred ', 70),
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] globus_xio: System error in send: Broken pipe globus_xio: A system call failed: Broken pipe ', 70),
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] globus_xio: System error in send: Broken pipe globus_xio: A system call failed: Broken pipe ', 70),

And those are both mapped to the posix error: 70: ECOMM: Communication error on send, and reported as such in MongoDB by our service [1].

This is a question that has been asked to the gfal developers here. But we still have no answer.

FYI @amaltaro

[1] https://cmsweb.cern.ch/ms-unmerged/data/info?rse=T1_UK_RAL_Disk&detail=False

{"result": [
 {
  "wmcore_version": "2.0.1.pre6",
  "microservice_version": "2.0.1.pre6",
  "microservice": "MSManager",
  "query": "rse=T1_UK_RAL_Disk&detail=False",
  "rseData": [
    {
      "name": "T1_UK_RAL_Disk",
      "pfnPrefix": "gsiftp://gridftp.echo.stfc.ac.uk:2811/cms:",
      "rucioConMonStatus": "done",
      "isClean": true,
      "timestamps": {
        "rseConsStatTime": 1646451101,
        "prevStartTime": 1647322438.8978102,
        "startTime": 1647326040.5946178,
        "prevEndTime": 1647322438.8980052,
        "endTime": 1647326040.5947526
      },
      "counters": {
        "totalNumFiles": 20348,
        "totalNumDirs": 1052,
        "dirsToDelete": 235,
        "filesToDelete": 0,
        "filesDeletedSuccess": 5551,
        "filesDeletedFail": 0,
        "dirsDeletedSuccess": 235,
        "dirsDeletedFail": 2,
        "gfalErrors": {
          "Communication error on send": 408
        }
      }
    }
  ]
}]}
amaltaro commented 2 years ago

@todor-ivanov have I missed the different errors at the protocol level from your message above?

My naive and generic view of this error handling is the following though, if the posix error is good enough for us to track: a) problems in MSUnmerged b) problems at the storage level

then we should not try to micromanage those errors. To me, it looks fair to have a map of multiple protocol errors into the same posix error (since the posix errors are very limited as well).

todor-ivanov commented 2 years ago

hi @amaltaro In general this should be the case yes, and that's exactly why I believed this should be the correct mapping at the first place. We should only watch for permissions/authorisation related errors from the protocol level, not to be masked inside the Communication Error on Send posix error. Here [1] is an example. (This error itself is actually triggered because of a broken uri construction for the gsiftp protocol, but still we need to pay attention to those.)

[1]

$ gfal-ls gsiftp://gridftp.echo.stfc.ac.uk:2811/cms/store/unmerged/RunIISummer20UL18DIGIPremix/DYJetsToEE_M-100To160_VBFFilter_TuneCP5_13TeV-amcatnloFXFX-pythia8/GEN-SIM-DIGI/106X_upgrade2018_realistic_v11_L1v1-v1
gfal-ls error: 70 (Communication error on send) - globus_ftp_client: the server responded with an error 500 Command failed : globus_l_gfs_ceph_stat: authorization error: 'MLST' operation not allowed
todor-ivanov commented 2 years ago

have I missed the different errors at the protocol level from your message above

I am not sure if you have... but, maybe I should have pointed them better. Here they are:

todor-ivanov commented 2 years ago

So far I managed to go through all the logs since last change of the document structure at MongoDB and the full list of RSEs to have issues with deletions is:

T2_UK_London_Brunel
T2_BE_UCL

I am creating the relevant GGUS tickets for them.

todor-ivanov commented 2 years ago

And here is the llist of GGUS tickets created for this [1]. I am closing the current issue here. We will follow with the sites and Site Support team. If any other changes are needed another WMcore issue is to be opened.

[1] https://ggus.eu/?mode=ticket_info&ticket_id=156526 https://ggus.eu/?mode=ticket_info&ticket_id=156527