dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

ORA-01400: cannot insert NULL into ("CMS_TRANSFERMGMT"."T_AGENT_LOG"."MESSAGE") #1012

Closed nikmagini closed 8 years ago

nikmagini commented 8 years ago

On Oct 27th at 17:00 the central FileRouter agent became unable to reconnect to the DB for the following error:

2015-10-28 16:33:42: FileRouter[11881]: alert: database error: DBD::Oracle::st execute failed: ORA-01400: cannot insert NULL into ("CMS_TRANSFERMGMT"."T_AGENT_LOG"."MESSAGE") (DBD ERROR: error possibly near <*> indicator at char 212 in 'insert into t_agent_log (time_update, reason, host_name, user_name, process_id, working_directory, state_directory, message) values (:now, :reason, :host_name, :user_name, :process_id, :working_dir, :state_dir, :<*>message)') [for Statement "insert into t_agent_log (time_update, reason, host_name, user_name, process_id, working_directory, state_directory, message) values (:now, :reason, :host_name, :user_name, :process_id, :working_dir, :state_dir, :message)" with ParamValues: :host_name='vocms0214.cern.ch', :message=undef, :now=1446050022.58719, :process_id=11881, :reason='AGENT RECONNECTED', :state_dir='/data/ProdNodes/Prod_Mgmt/state/mgmt-router/', :user_name='phedex', :working_dir='/data/ProdNodes'] at /data/ProdNodes/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322.

The "MESSAGE" column should contain the output of ps -p $$ wwwwuh 2>/dev/null - not sure how it could be NULL in the agent, running it manually returned

phedex 11881 29.6 35.9 3289188 2900480 ? S Sep29 12366:47 perl /data/ProdNodes/PHEDEX/Toolkit/Infrastructure/FileRouter -state /data/ProdNodes/Prod_Mgmt/state/mgmt-router/ -log /data/ProdNodes/Prod_Mgmt/logs/mgmt-router -db /data/ProdNodes/SITECONF/CH_CERN/PhEDEx/DBParam:Prod/CENTRAL -request-alloc BY_AGE -window-size 15

nikmagini commented 8 years ago

Happened again on Dec 3rd, causing the FileRouter to stop working for 12h.

nikmagini commented 8 years ago

Happened again on Feb 19th, causing the FileRouter to stop working for 24h

nikmagini commented 8 years ago

Happened again on April 4th; Jorge noticed the issue and restarted the agent after 12h

With increased debug logging, found out that the ps command is not producing any stdout or stderr

Considering turning this into a fatal error in the agent, so that at least the agent will stop itself, and then Watchdog will restart it automatically

nikmagini commented 8 years ago

Change release with PhEDEx 4.1.8 and deployed in production in central agents on 2016-06-09 - now the agents throw a 'fatal' error when they go in this state and exit. The Watchdog will then restart the agents automatically and the agents recover automatically. Closing.