US-EPA-CAMD / easey-ui

Project Management repo for EPA Clean Air Markets Division (CAMD) Business Suite of applications
MIT License
0 stars 0 forks source link

Unit_Fact job failed again #6109

Open shahbazkhan1999 opened 7 months ago

shahbazkhan1999 commented 7 months ago

failed with this error.. FnName: Execute Direct -- [Informatica][ODBC PostgreSQL Wire Protocol driver]Socket closed., SQLSTATE [08S01]

Image

spetros-do commented 7 months ago

The 3/12 discussion b/w EPA/DPC identifies multiple potential issues contributing to the ETL/DM job's failure and explores various solutions:

  1. Extended Truncation Time:

    • A job attempted to truncate a table at 1:44 AM but failed at 4:23 AM, taking approximately three hours. This delay is hypothesized to be due to a streaming service API causing locks on the table.
  2. Socket Closed Error:

    • A job failed due to a "socket closed" error, speculated to be because another job might have prematurely closed the tunnel needed for database connection.
  3. Streaming API and External Queries:

    • The possibility of external users running intensive queries through a public API, potentially flooding the system. This led to considering rate limiting or setting a maintenance window to restrict access during critical job operations.
  4. Investigation and Next Action Items:

    • DPC team is tasked with further investigation, including adjusting API rate limits and potentially identifying users causing the load.
    • A consideration to move away from truncate and reload strategies towards using a log table approach for data management.
    • Implement a technical/policy level approach by limiting user queries during specific maintenance times.
    • Confirming driver versions and database compatibility, along with the potential need to update system configurations to support current operational requirements.
mark-hayward-erg commented 7 months ago

Follow-up question:

Another option for altering the reloading process:

annalbrecht commented 7 months ago

Blocked by #5997.

j-tafoya commented 7 months ago

"[Mike] and I configured the API gateway for staging to limit the streaming requests to 1 every 10 seconds. If a user exceeds this threshold they will receive an http status 429 “Over Rate Limit”. This does not effect any requests coming from the CAMPD application, only users that have their own API key. We’ll test this in staging for a bit and then plan for a rollout to production. If you are testing in staging and notice any strange behavior please let us know. "

alangmaid commented 4 months ago

@ibarra-michelle @maheese DPC/ERG suggest punting this to 2025 Q1

ibarra-michelle commented 4 months ago

@alangmaid and @annalbrecht , yes @maheese and I agree to punt to 2025 Q1, thx.

szintgraff commented 4 months ago

Record of DataMart Job Failure. Reruning this job at 9 am caused production issues. Refer to RT tickets: CAMPD Support #74781 and CAMPD Support #74782

s_m_Load_CAMD_UNIT_FACT_1719389958.log

DataMart Job Failure Notification.pdf

szintgraff commented 3 months ago

Record of DataMart Job Failure Notification Saturday 7/6 at 4:23 am.

s_m_Load_CAMD_UNIT_FACT_1720254150.log