HEPCloud / decisionengine_modules

Apache License 2.0
2 stars 19 forks source link

Python3 encodes hex strings differently, resulting in malformed htcondor classads #263

Closed StevenCTimm closed 3 years ago

StevenCTimm commented 3 years ago

We have been observing while testing the DE with python3 glideinwms that the glideclientglobal_manifests classads were failing to be advertised to the factory condor collector because they were malformed.

I have done the following investigation:

The advertise in the python3 code fails, the advertise in the python2 code succeeds.

I believe that this is not a function of the glideinwms routines that were recently added, this is decision engine code that has not yet been exercised because we never got this far before.

The main idea is that there are a number of fields in the glideclientglobal classads that are encrypted secrets and stored in hex. In python2 these fields look like this:

02919d382e0f7c9b491a0a8615e4b112

In python3 these fields look like this:

b'7f8c2008cf6dae0de63832c9550b2359'

In the decision engine this code is all in decisionengine_modules/htcondor/publishers/publisher.py There is one publisher which is called both by decisionengine_modules/glideinwms/publishers/glideclientglobal.py and fe_group_classads.py and the same means of converting pandas data frames to python dictionaries to classads.

The root cause is not that the above code isn't python3 compatible, the root cause appears to be that pandas is encoding the values differently internally inside the data frame and thus when we try to decode it and extract it in the same way we were before, we get a malformed htcondor classad.

Moral of the story, I am pretty sure we have to somehow extract the b' ' wrapping of the hex string to make a well-formed htcondor classad.

I attach 400 lines of debug output which contain a dump of the classads that we are trying and failing to advertise.

StevenCTimm commented 3 years ago

excerpt.txt

DmitryLitvintsev commented 3 years ago

b'abcdef'

means bytestream in python3

Dmitry


From: Steven Timm notifications@github.com Sent: Friday, September 11, 2020 4:05 PM To: HEPCloud/decisionengine_modules Cc: Subscribed Subject: Re: [HEPCloud/decisionengine_modules] Pandas 1.x encodes hex strings differently, resulting in malformed htcondor classads (#263)

excerpt.txthttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_HEPCloud_decisionengine-5Fmodules_files_5211499_excerpt.txt&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=7PHi3TDlwkvpc07MjENbOxVFl0u_sEurf250JnUFWCU&m=J79GCDblqI23SHF6XUeedT1q0k27ohAJONqTtvHvydE&s=_JcgfgzZ7OFC7124MrFDUyW8nGHhFVE-UHUSGeRcfEM&e=

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_HEPCloud_decisionengine-5Fmodules_issues_263-23issuecomment-2D691311744&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=7PHi3TDlwkvpc07MjENbOxVFl0u_sEurf250JnUFWCU&m=J79GCDblqI23SHF6XUeedT1q0k27ohAJONqTtvHvydE&s=21JL5b_1yNMg5mRdboxO5RT4aMgTErJts5uQQKaDB_Q&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAP6VB6A4FWEQZYNBDEMGFTSFKGKXANCNFSM4RIKDWIQ&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=7PHi3TDlwkvpc07MjENbOxVFl0u_sEurf250JnUFWCU&m=J79GCDblqI23SHF6XUeedT1q0k27ohAJONqTtvHvydE&s=W2L9yMQfNYM2jltkhHPx2ZIBwwOqmqPJz8bYdQ3Nbbc&e=.

StevenCTimm commented 3 years ago

I would be interested to hear how the glideinwms developers tackled this issue--maybe we can use the same trick they did.

StevenCTimm commented 3 years ago

I am told by the glideinwms developers that the right way to treat byte stream data is to use the var.encode method, for example for byte stream variable bvar, to use bvar.encode("latin-1") latin-1 preserves all 256 characters and does not add extras.

for the decision engine the module in question is htcondor/publishers/publisher.py. There may be others.

This is a blocking issue, next release should not be cut without fixing this problem.

StevenCTimm commented 3 years ago

This is the dataframe_to_classads function in htcondor/publishers/publishers.py (which is inherited by two different publishers, glideinwms/publishers/glideclientglobal.py and glideinwms/publishers/fe_group_classads.py)

The point is that already in the dataframe byte stream things are encoded as b'1341356146146' so all that this code is doing is doing a dataframe.to_dict function.

The fix will have to be made either when the dataframe is originally made, or before calling this function. (or possibly after calling that function).

def dataframe_to_classads(dataframe): """ Convert pandas dataframe to list of HTCondor classads

:type dataframe: :obj:`DataFrame`
"""
ads = []
records = dataframe.to_dict(orient='records')
for record in records:
    # NOTE: Pandas will add NaN for some of the values. This causes
    #       extremely undesired/unexpected issues. Better to remove
    #       NaN values before converting a dataframe row to classad.
    ad_dict = {}
    for key in record:
        if pandas.notnull(record[key]):
            ad_dict[key] = record[key]
    ad = classad.ClassAd()
    ad.update(ad_dict)
    ads.append(ad)
return ads
DmitryLitvintsev commented 3 years ago

what machine have you encountered it on. I want to try to fix it there.

StevenCTimm commented 3 years ago

fermicloud117

Look at resource_request.log the errors are there.

Steve


From: Dmitry Litvintsev notifications@github.com Sent: Tuesday, September 15, 2020 12:48 PM To: HEPCloud/decisionengine_modules decisionengine_modules@noreply.github.com Cc: Steven C Timm timm@fnal.gov; Author author@noreply.github.com Subject: Re: [HEPCloud/decisionengine_modules] Python3 encodes hex strings differently, resulting in malformed htcondor classads (#263)

what machine have you encountered it on. I want to try to fix it there.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_HEPCloud_decisionengine-5Fmodules_issues_263-23issuecomment-2D692873700&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=TapO4uyz2q32XuFFqGWuHtXSJ_fIpnaA3g9_p170FrE&s=RSLiib_mu1EuTna4DWdbNKFDA7uZbONX8UoAgQXd0Lw&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AGG4SOACQIZIENGUHFEROKTSF6SFNANCNFSM4RIKDWIQ&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=TapO4uyz2q32XuFFqGWuHtXSJ_fIpnaA3g9_p170FrE&s=-FrMoDIffpVo6ciVZC22xeQ_CxvBkuM9Pw9xN0Z14Qs&e=.

DmitryLitvintsev commented 3 years ago

I might have fixed or not. There is quite frequent

020-09-15 14:56:17,543 - root - SourceProxy - 28957 - BillingInfoSourceProxy - ERROR - Error getting datablock for AWSbilling 'Last generation id not found for taskmanager=AWSbilling taskmanager_id=3B9348E2-0B05-4325-BCCC-72CBCDC16D6E'

I don't think these are related.

StevenCTimm commented 3 years ago

Ok I will check.

There is something wrong with the retry mechanism in the AWS billing source proxy. It is retrying too fast but it is not related to the other problem.

Steve


From: Dmitry Litvintsev notifications@github.com Sent: Tuesday, September 15, 2020 2:57 PM To: HEPCloud/decisionengine_modules decisionengine_modules@noreply.github.com Cc: Steven C Timm timm@fnal.gov; Author author@noreply.github.com Subject: Re: [HEPCloud/decisionengine_modules] Python3 encodes hex strings differently, resulting in malformed htcondor classads (#263)

I might have fixed or not. There is quite frequent

020-09-15 14:56:17,543 - root - SourceProxy - 28957 - BillingInfoSourceProxy - ERROR - Error getting datablock for AWSbilling 'Last generation id not found for taskmanager=AWSbilling taskmanager_id=3B9348E2-0B05-4325-BCCC-72CBCDC16D6E'

I don't think these are related.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_HEPCloud_decisionengine-5Fmodules_issues_263-23issuecomment-2D692945622&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=i--fbm9CtqJQeNE8H7-7SCH7QmTwoeZZe0XtKqJZy_Q&s=6I_YweLDRd3n5yt-UJDc9qmGMN3_Rhz8Zxg-lx79Q1w&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AGG4SOEQKVOEFEUUPS2KDVTSF7BLHANCNFSM4RIKDWIQ&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=i--fbm9CtqJQeNE8H7-7SCH7QmTwoeZZe0XtKqJZy_Q&s=76PfBM0V9qP_tIPJNGJEvcHZ9zN6kxdzps8JUQX0iG4&e=.

StevenCTimm commented 3 years ago

Not fixed. There are still binary encodings in the classads and also the condor_advertise is failing the same way as before.

No, it's not fixed\, error message is same as before

2020-09-15 15:02:32,384 - root - publisher - 28957 - MainThread - ERROR - ['Trac eback (most recent call last):\n', ' File "/usr/lib/python3.6/site-packages/dec isionengine_modules/htcondor/publishers/publisher.py", line 149, in publish_to_h tcondor\n self.condor_advertise(ads, collector_host=collector)\n', ' File "/ usr/lib/python3.6/site-packages/decisionengine_modules/htcondor/publishers/publi sher.py", line 116, in condor_advertise\n self.nretries, self.retry_interval) \n', ' File "/usr/lib/python3.6/site-packages/decisionengine_modules/util/retry _function.py", line 36, in retry_wrapper\n raise e\n', ' File "/usr/lib/pyth on3.6/site-packages/decisionengine_modules/util/retry_function.py", line 27, in retry_wrapper\n return f()\n', ' File "/usr/lib/python3.6/site-packages/deci sionengine_modules/htcondor/publishers/publisher.py", line 95, in _condor_advert ise\n collector.advertise(ads, update_ad_command, True)\n', 'ValueError: Fail ed to advertise to collector\n']

There is still binary encoding in the pandas dataframes

StevenCTimm commented 3 years ago

(you can see the binary by doing de-client --print-product glideclientglobal_manifests and/or glideclient_manifests) If it's in the dataframe then it will go through to the classads.

DmitryLitvintsev commented 3 years ago

So, I suspect this has nothing to do with bytes vs strings. I implemented the code that does decode all these b'abcd' and the error remains.

BTW, looking at your excerpt.txt I see :

GlideinEncParamSecurityClass435299 = "b'3b301c503aaaca7e166d7b67f3736b7e'"

That is this is really a string that "encapsulates" the bytes.

The code I tried looked something like:

        for key, value in record.items():
            if pandas.isnull(value):
                continue
            if isinstance(value, bytes): 
                ad_dict[key] = value.decode('latin-1')
            elif isinstance(value, str): 
                ad_dict[key] = value.lstrip("b'").rstrip("'")
            else:
                ad_dict[key] = value

Above achieves the dictionary with only strings and numerics. Still fails with the same error. Needs more work.

StevenCTimm commented 3 years ago

yes it is possible that it has nothing to do with the b' ' quoted string. (seeing that the gwms frontend successfully advertises stuff with b' ' quotes to the factory, although the factory can't parse it properly) I had a print statement in the publisher that actually printed out the classads--there might be some other formatting thing. I will check more carefully and try to advertise a classad manually.

StevenCTimm commented 3 years ago

Next thing to check is if we are even using the right condor_advertise method now.. python3-condor wasn't installed on the DE up until now, now it is. But still not clear if we are calling the right method with the right arguments either.. attempt to write a test with the same args as the DE code is using, didn't work right but neither did it reproduce the same error the code was getting.

StevenCTimm commented 3 years ago

We are using the right condor_advertise method now. And python3-condor was previously installed on the DE after all, via pip. now via rpm. Now quite sure I am calling the method with the right arguments and can reproduce the same .. attaching test code which works with python2 and fails with python3. Have opened htcondor-admin ticket 101502 to ask the condor people what they know.

StevenCTimm commented 3 years ago

testclassad.py.txt

StevenCTimm commented 3 years ago

This is the collectorLog on cmssrv258 when it fails to advertise;

09/15/20 22:09:08 AUTHENTICATE: Exchanging keys with remote side. 09/15/20 22:09:08 AUTHENTICATE: Result of end of authenticate is 1. 09/15/20 22:09:08 DC_AUTHENTICATE: authentication of 131.225.155.64 complete. 09/15/20 22:09:08 DC_AUTHENTICATE: message authenticator enabled with key id cms srv258:2359496:1600225748:196927. 09/15/20 22:09:08 DC_AUTHENTICATE: Success. 09/15/20 22:09:08 PERMISSION GRANTED to hepcloudDE117@cmssrv258.fnal.gov from ho st 131.225.155.64 for command 58 (UPDATE_AD_GENERIC), access level DAEMON: reaso n: cached result for DAEMON; see first case for the full reason 09/15/20 22:09:08 DC_AUTHENTICATE: added incoming session id cmssrv258:2359496:1 600225748:196927 to cache for 80 seconds (lease is 3620s, return address is unkn own). 09/15/20 22:09:08 Command 58 on Sock not followed by ClassAd (or timeout occured ) 09/15/20 22:09:08 Received malformed ad from command (58). Ignoring. 09/15/20 22:09:08 DaemonCommandProtocol: Not enough bytes are ready for read. 09/15/20 22:09:08 DC_AUTHENTICATE: received DC_AUTHENTICATE from <131.225.240.95 :17263> 09/15/20 22:09:08 DC_AUTHENTICATE: resuming session id cmssrv258:2359496:1600225 745:196926:

StevenCTimm commented 3 years ago

TJ Knoeller got back to me, he said that on his installation (running the htcondor build of master) that my testclassad.py script worked for him . He suggested a couple debugging statements to add. I added them and discovered that although the server was authenticating my client fermicloud117 OK, the opposite was not true, mutual authentication was required and my client was rejecting the server.

I added an appropriate condor_mapfile on the decision engine and now the standalone test works in python3. However I have not yet been able to get the corresponding publisher.py code within the decision engine to do the same. Still working on it.

StevenCTimm commented 3 years ago

Now I can see glideclient and glideclientglobal classads in fermicloud117 but they are not yet being correctly processed by the factory into glideins. Have to figure out what is going on.

StevenCTimm commented 3 years ago

I would appreciate it if Marco Mambelli could please have a look at cmssrv258.fnal.gov We are seeing the glideclient and glideclientglobal classads in the factory but they are not getting processed into glideins. I can't see any obvious errors but such must be happening.

StevenCTimm commented 3 years ago

pub_key_obj <glideinwms.factory.glideFactoryConfig.GlideinKey instance at 0x7f2a53c22440>) [2020-09-21 12:59:29,544] ERROR: glideFactoryCredentials:168: Error occurred processing the globals classads. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/glideinwms/factory/glideFactoryCredentials.py", line 145, in process_global sym_key_obj, frontend_sec_name = validate_frontend(classad, frontend_descript, pub_key_obj) File "/usr/lib/python2.7/site-packages/glideinwms/factory/glideFactoryCredentials.py", line 214, in validate_frontend sym_key_obj = get_key_obj(pub_key_obj, classad) File "/usr/lib/python2.7/site-packages/glideinwms/factory/glideFactoryCredentials.py", line 189, in get_key_obj raise CredentialError(error_str) CredentialError: Symmetric key extraction failed. [2020-09-21 12:59:29,544] ERROR: glideFactory:515: Error occurred processing the globals classads: Traceback (most recent call last): File "/usr/sbin/glideFactory.py", line 513, in spawn frontendDescript) File "/usr/lib/python2.7/site-packages/glideinwms/factory/glideFactoryCredentials.py", line 169, in process_global raise CredentialError(error_str) CredentialError: Error occurred processing the globals classads. [2020-09-21 12:59:29,956] DEBUG: glideFactoryMonitorAggregator:861: aggregate_data, missing total data from file rrd_Log_Completed.xml [2020-09-21 12:59:29,997] DEBUG: glideFactoryMonitorAggregator:861: aggregate_data, missing total data from file rrd_Log_Completed_Stats.xml [2020-09-21 12:59:30,054] DEBUG: glideFactoryMonitorAggregator:861: aggregate_data, missing total data from file rrd_Log_Completed_WasteTime.xml [2020-09-21 12:59:30,103] DEBUG: glideFactoryMonitorAggregator:861: aggregate_data, missing total data from file rrd_Log_Counts.xml

StevenCTimm commented 3 years ago

Could I be up against an m2crypto incompatibility? Currently using M2crypto 0.36 from pip3 on fermicloud117. factory cmssrv268 (still python2 based) has m2crypto version 0.21

StevenCTimm commented 3 years ago

Found and fixed three more problems with Marco's help: 0) I had not patched in the latest released glideinwms code to fermicloud117, there were a couple late bug fixes. Patched 1) there was a problem with frontend.xml whereby we were sending the wrong ID to the cmssrv258 factory, claiming to be hepcsvc03 instead of fermicloud117. Fixed 2) The frontend.xml contains match_expr that had python2-specific syntax in them, had to change job.has_keys to if "DESIRED_Sites" in job.

Once this was done, we observed that a glidein was submitted on cmssrv258. Still investigating a bit because we are getting a warning that some of the entries we are requesting, are supposedly in downtime

[2020-09-21 14:35:52,618] WARNING: glideFactoryEntry:1189: Security class frontend is currently in a downtime window for entry: CMSHTPC_T3_US_TACC. Ignoring request.

Have to look at the security class configuration on the factory and frontends and see what is happening here.

StevenCTimm commented 3 years ago

So now there is one more mystery to figure out: The resource_request section says we are sending 1 idle 4 max requests to most of the resources

2020-09-21 14:44:41,202 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 7( 10 7 7 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 1 4 | Up 0.0012 CMSHTPC_T3_US_NERSC_Cori_shared@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov 2020-09-21 14:44:41,219 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 4( 20 4 4 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 1 4 | Up 0.0045 CMSHTPC_T3_US_SDSC_osg-comet@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov 2020-09-21 14:44:41,228 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 1( 10 1 1 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 1 4 | Up 0.0112 CMSHTPC_T3_US_Bridges@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov 2020-09-21 14:44:41,239 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 1( 10 1 1 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 1 4 | Up 0.0900 FNAL_HEPCLOUD_K8S@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov 2020-09-21 14:44:41,460 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 0( 0 0 0 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 0 0 | Up INFINITY CMSHTPC_T3_US_TACC_FRONTERA@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov 2020-09-21 14:45:28,048 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 7( 10 7 7 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 1 4 | Up 0.0012 CMSHTPC_T3_US_NERSC_Cori_shared@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov 2020-09-21 14:45:28,073 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 4( 20 4 4 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 1 4 | Up 0.0045 CMSHTPC_T3_US_SDSC_osg-comet@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov 2020-09-21 14:45:28,083 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 1( 10 1 1 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 1 4 | Up 0.0112 CMSHTPC_T3_US_Bridges@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov 2020-09-21 14:45:28,105 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 1( 10 1 1 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 1 4 | Up 0.0900 FNAL_HEPCLOUD_K8S@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov 2020-09-21 14:45:28,522 - root - glide_frontend_element - 21907 - GlideinRequestManifests - INFO - 0( 0 0 0 0) 0( 0 40000) | 0 0 0 0 | 0 0 0 | 0 0 | Up INFINITY CMSHTPC_T3_US_TACC_FRONTERA@gfactory_instance_cmssrv258@gfactory_service_cmssrv258@cmssrv258.fnal.gov

But in practice the only one that really gets requested is CMSHTPC_T3_US_NERSC_Cori_shared with the 1, 4 as above.. the others in the list above end up as 0, 2. (idle, total respectively)

That includes CMSHTPC_T3_US_Bridges, FNAL_HEPCLOUD_K8S, and CMSHTPC_T3_US_SDSC_osg-comet

mambelli commented 3 years ago

Hi, here is a section I wrote for GlideinWMS about bytes, str, and encoding/decoding: https://cdcvs.fnal.gov/redmine/projects/glideinwms/wiki/Wiki#section-19 In glideinwms the libraries about cryptography (pubCrypto.py, symCrupto.py) and encodings do accept bytes and at times (unicode) stings, but outputs are always bytes. This is the same as python std libraries (md5, base64, ...)

StevenCTimm commented 3 years ago

OK I can now confirm that the classads are being sent correctly by the Decision Engine. Dmitry had added a patch above to do extra decoding on the classads. I have now removed that patch and reverted to what is in trunk and verified that the classads are still working OK. All the issues with the number of classads mentioned above have already been resolved. We can close this issue.

StevenCTimm commented 3 years ago

Closing this one, this version now in production.