HEPCloud / decisionengine_modules

Apache License 2.0
2 stars 19 forks source link

Exception in BillingInfoSourceProxy.py on gpde01. CRITICAL #106

Closed DmitryLitvintsev closed 4 years ago

DmitryLitvintsev commented 4 years ago

''BillingInfoSourceProxy': { 'module': 'decisionengine_modules.AWS.sources.BillingInfoSourceProxy', 'name': 'BillingInfoSourceProxy', 'parameters': { 'channel_name': 'gp_AWSbilling', 'Dataproducts': ['AWS_Billing_Info', 'AWS_Billing_Rate'], 'retries': 3, 'retry_timeout': 20, } },

Once I add the above configuration to gp_resource_request.conf on gpde01 I get the following exception:

2019-03-08 15:21:03,682 - decision_engine - TaskManager - BillingInfoSourceProxy - ERROR - Exception running source BillingInfoSourceProxy Could not get all data. Expected ['AWS_Billing_Info', 'AWS_Billing_Rate'] Filled [] 2019-03-08 15:21:03,959 - decision_engine - TaskManager - MainThread - ERROR - Error occured during initial run of sources. Task Manager gp_resource_request exits

The debug log also shows

2019-03-08 15:20:03,638 - decision_engine - SourceProxy - BillingInfoSourceProxy - DEBUG - KEYERROR KeyError('taskmanager_id=417 or generation_id=None or key=AWS_Billing_Info not found',) 2019-03-08 15:20:03,639 - decision_engine - SourceProxy - BillingInfoSourceProxy - DEBUG - KEYERROR KeyError('taskmanager_id=417 or generation_id=None or key=AWS_Billing_Rate not found',)


This same BillingInfoSourceProxy works on the other two decision engines hepcsvc03 and cmsde01 just fine. All three are configured to look at the same files.

The data blocks AWS_Billing_Info and AWS_Billing_Rate are being generated correctly by the gp_AWSbilling decision channel.

[root@gpde01 decisionengine]# de-client --print-product AWS_Billing_Info Product AWS_Billing_Info: Found in channel gp_AWSbilling +----+-----------------+-------------+---------------+---------------------------+-------------+----------------------+---------------+-------------------+-----------------+--------------------+------------------+-----------------------------+-----------------+-----------------+------------------+-----------------------------------+----------------------------+------------------------------+-------------------------------+-----------+---------------------+-------------------------+-----------+----------------+ | | AWSCloudTrail | AWSConfig | AWSIoT | AWSKeyManagementService | AWSLambda | AWSSupportBusiness | AccountName | AdjustedSupport | AdjustedTotal | AmazonCloudWatch | AmazonDynamoDB | AmazonElasticComputeCloud | AmazonGlacier | AmazonRoute53 | AmazonSimpleDB | AmazonSimpleNotificationService | AmazonSimpleQueueService | AmazonSimpleStorageService | AmazonSimpleWorkflowService | Balance | Date | EstimatedTotalDataOut | Total | TotalDataOut | |----+-----------------+-------------+---------------+---------------------------+-------------+----------------------+---------------+-------------------+-----------------+--------------------+------------------+-----------------------------+-----------------+-----------------+------------------+-----------------------------------+----------------------------+------------------------------+-------------------------------+-----------+---------------------+-------------------------+-----------+----------------| | 0 | 0.160235 | 0 | nan | 0 | 0 | 0.128996 | CMS | 0.0207718 | 0.21343 | 0 | 3.5e-08 | 0 | nan | 0 | nan | 0 | 0 | 0.0324231 | nan | 221870 | 2019-03-01 00:00:00 | 0 | 0.192658 | 0 | | 1 | 0.276951 | 0 | nan | 0 | 0 | 1.31725 | Fermilab | 4.2057 | 43.2135 | 0.0124664 | 5.04e-08 | 38.6534 | nan | 0.9275 | nan | 0 | 0 | 0.0650356 | nan | 99694.8 | 2019-03-01 00:00:00 | 0.00549571 | 39.0078 | 0 | | 2 | 0 | 0 | nan | 0 | 0 | 0.0122968 | NOvA | 0.0017213 | 0.0189343 | 0.00336022 | 0 | 0 | nan | 0 | nan | 0 | 0 | 0.0138528 | nan | 4207.99 | 2019-03-01 00:00:00 | 0 | 0.017213 | 0 | | 3 | 0.186817 | 0.083475 | 1.50162e-05 | 0.031166 | 0 | 1.35381 | RnD | 1.44321 | 14.829 | 0.00587998 | 3.886e-07 | 12.3049 | 4.29e-08 | 0.46375 | 0 | 1.37e-08 | 0 | 0.309821 | 0 | 3392.37 | 2019-03-01 00:00:00 | 8.678e-07 | 13.3858 | 0 | +----+-----------------+-------------+---------------+---------------------------+-------------+----------------------+---------------+-------------------+-----------------+--------------------+------------------+-----------------------------+-----------------+-----------------+------------------+-----------------------------------+----------------------------+------------------------------+-------------------------------+-----------+---------------------+-------------------------+-----------+----------------+ Found in channel gp_resource_request 'taskmanager_id=429 or generation_id=0 or key=AWS_Billing_Info not found'

============

From what I have been able to see thus far, by walking through the underlying code in the SourceProxy.py class from which BillingInfoSourceProxy.py inherits, is that the code is successfully finding the DataSpace, TaskManager, and DataBlock, but the keys we are looking for in the DataBlock are not there.

We may be looking at some kind of a race condition in which gpde01 is processing the channels in a different order.. need developers to look and see why the data block is not available to the SourcePRoxy and any number of restarts doesn't clear the condition.

Dmitry has already started to look.. this is critical, we can't go live unless this is resolved.

Steve Timm

DmitryLitvintsev commented 4 years ago

See https://hepcloud-git.fnal.gov:8443/hepcloud/decisionengine_modules/issues/169

StevenCTimm commented 4 years ago

We still get timeouts on this from time to time but can work around by setting retries to 100 rather than 3. We can close this issue now.