Closed nikmagini closed 8 years ago
Happened again on Dec 3rd, causing the FileRouter to stop working for 12h.
Happened again on Feb 19th, causing the FileRouter to stop working for 24h
Happened again on April 4th; Jorge noticed the issue and restarted the agent after 12h
With increased debug logging, found out that the ps command is not producing any stdout or stderr
Considering turning this into a fatal error in the agent, so that at least the agent will stop itself, and then Watchdog will restart it automatically
Change release with PhEDEx 4.1.8 and deployed in production in central agents on 2016-06-09 - now the agents throw a 'fatal' error when they go in this state and exit. The Watchdog will then restart the agents automatically and the agents recover automatically. Closing.
On Oct 27th at 17:00 the central FileRouter agent became unable to reconnect to the DB for the following error:
2015-10-28 16:33:42: FileRouter[11881]: alert: database error: DBD::Oracle::st execute failed: ORA-01400: cannot insert NULL into ("CMS_TRANSFERMGMT"."T_AGENT_LOG"."MESSAGE") (DBD ERROR: error possibly near <*> indicator at char 212 in 'insert into t_agent_log (time_update, reason, host_name, user_name, process_id, working_directory, state_directory, message) values (:now, :reason, :host_name, :user_name, :process_id, :working_dir, :state_dir, :<*>message)') [for Statement "insert into t_agent_log (time_update, reason, host_name, user_name, process_id, working_directory, state_directory, message) values (:now, :reason, :host_name, :user_name, :process_id, :working_dir, :state_dir, :message)" with ParamValues: :host_name='vocms0214.cern.ch', :message=undef, :now=1446050022.58719, :process_id=11881, :reason='AGENT RECONNECTED', :state_dir='/data/ProdNodes/Prod_Mgmt/state/mgmt-router/', :user_name='phedex', :working_dir='/data/ProdNodes'] at /data/ProdNodes/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322.
The "MESSAGE" column should contain the output of
ps -p $$ wwwwuh 2>/dev/null
- not sure how it could be NULL in the agent, running it manually returnedphedex 11881 29.6 35.9 3289188 2900480 ? S Sep29 12366:47 perl /data/ProdNodes/PHEDEX/Toolkit/Infrastructure/FileRouter -state /data/ProdNodes/Prod_Mgmt/state/mgmt-router/ -log /data/ProdNodes/Prod_Mgmt/logs/mgmt-router -db /data/ProdNodes/SITECONF/CH_CERN/PhEDEx/DBParam:Prod/CENTRAL -request-alloc BY_AGE -window-size 15