fermi-ad / controls

Central repo for reporting bugs, making feature requests, managing RFCs, and requesting seminar topics.
https://www-bd.fnal.gov/controls/
2 stars 0 forks source link

Unable to reboot Booster BPM chassis via ACNET FE clx38 #52

Closed kengell closed 5 months ago

kengell commented 5 months ago

The node that handles the remote reboots of the booster crates is not working again. My guess is it needs a reboot to re add them to its list of node.

-Bobby From: Salah J Chaurize [chaurize@fnal.gov](mailto:chaurize@fnal.gov) Sent: Monday, January 22, 2024 9:53 AM To: Robert R. Santucci [rsantucc@fnal.gov](mailto:rsantucc@fnal.gov) Cc: John C Kuharik [kuharik@fnal.gov](mailto:kuharik@fnal.gov); Cheng-Yang Tan [cytan@fnal.gov](mailto:cytan@fnal.gov) Subject: Remote boots

All booster Bpm nodes will not reboot remotely. Is clx list bad or clx down that manages boots? Thanks,

-Salah

kengell commented 5 months ago

Erlang logs on clx38 are filling rapidly each half hour with stack traces of failed BPM communications

$ ll erlang.log.* -rw-r--r-- 1 frontend frontend 193K Feb 14 10:25 erlang.log.4 -rw-r--r-- 1 frontend frontend 977K Feb 14 10:24 erlang.log.3 -rw-r--r-- 1 frontend frontend 977K Feb 14 10:20 erlang.log.2 -rw-r--r-- 1 frontend frontend 977K Feb 14 10:17 erlang.log.1 -rw-r--r-- 1 frontend frontend 977K Feb 14 10:14 erlang.log.10 -rw-r--r-- 1 frontend frontend 977K Feb 14 10:10 erlang.log.9 -rw-r--r-- 1 frontend frontend 977K Feb 14 10:07 erlang.log.8 -rw-r--r-- 1 frontend frontend 977K Feb 14 10:04 erlang.log.7 -rw-r--r-- 1 frontend frontend 977K Feb 14 10:00 erlang.log.6

Stack: [{public_key,pkix_verify_hostname_match_fun,[https],[]}, {httpc,ssl_verify_host_options,1,[{file,"httpc.erl"},{line,471}]}, {httpc,'-http_options_default/0-fun-5-',0,[{file,"httpc.erl"},{line,1015}]}, {httpc,http_options,3,[{file,"httpc.erl"},{line,961}]}, {httpc,handle_request,9,[{file,"httpc.erl"},{line,771}]}, {tcp_ascii_client_gpib,get_data,3, [{file,"tcp_ascii_client_gpib.erl"},{line,1145}]}, {tcp_ascii_client_gpib,update_cmd_result_loop,2, [{file,"tcp_ascii_client_gpib.erl"},{line,880}]}, {tcp_ascii_client_gpib,data_pool_updater_loop,1, [{file,"tcp_ascii_client_gpib.erl"},{line,824}]}] tcp_ascii_client_gpib:data_pool_updater_loop: "bbpl06-crate_Port_undefined" Exception: undef

kengell commented 5 months ago

Erlang logs filling rapidly.

If we remove the call to a cmdlist that reduces the log files filling so quickly. So something w/in the commandset_wiener3 is not correct and needs to be fixed.

previously the config was

116 ,{cmdlist,{commandset, commandset_wiener3}}

Now the configuration for each BPM crate looks like

109 {38, gpib_drv, 110 [{server, "bbpl24-crate"} 111 ,{interface, http} 112 ,{log_level, 3} 113 ,{reset, "*rst"} 114 ,{send_timeout, 500} 115 ,{recv_timeout, 500} > 116 %% ,{cmdlist,{commanset, commandset_wiener3}} 117 ,{req_options,[{headers_as_is, true}]} 118 ,{data_pool_update_rate_in_msec, 2000} 119 ]

kengell commented 5 months ago

Somewhat related to the inability to reboot BPM chassis remotely...

We are unable to get read backs of any kind from clx38e. When accessing a BPM device (e.g.B:BBPL24VME) we get no read backs at all only zero bytes. Front end is generating stack traces like below and filling all log files in 30 minutes.

Stack: [{public_key,pkix_verify_hostname_match_fun,[https],[]}, {httpc,ssl_verify_host_options,1,[{file,"httpc.erl"},{line,471}]}, {httpc,'-http_options_default/0-fun-5-',0,[{file,"httpc.erl"},{line,1015}]}, {httpc,http_options,3,[{file,"httpc.erl"},{line,961}]}, {httpc,handle_request,9,[{file,"httpc.erl"},{line,771}]}, {tcp_ascii_client_gpib,get_data,3, [{file,"tcp_ascii_client_gpib.erl"},{line,1145}]}, {tcp_ascii_client_gpib,update_cmd_result_loop,2, [{file,"tcp_ascii_client_gpib.erl"},{line,880}]}, {tcp_ascii_client_gpib,data_pool_updater_loop,1, [{file,"tcp_ascii_client_gpib.erl"},{line,824}]}]

rneswold commented 5 months ago

undef usually means an attempt was made to call a function that isn't been defined. Looking at the stack trace, I'd say the missing function is public_key:pkix_verify_hostname_match_fun. What's weird is that it doesn't give an arity (i.e. number of arguments) for the function. I wonder if the front-end config needs to specify a public_key application to get loaded in order for the function to be there. I'll look at the .rel file.

kengell commented 5 months ago

@rneswold , with the upgrade of the ERLANG library from 21 -> 26, I believe Erlang added public keys/certs to the https.erl library.

Dennis has generate a new erlang tar file and now (on clx38) we are able to read/set parameters vi the acsys-fe-gpib-tcp library.

Image

kengell commented 5 months ago

We discovered, somewhat belatedly, that the erlang library dependencies have changed related to http usage. Hence, I’ve created a new runtime distribution and put it in ~frontend/clx_sync_master (and distributed to clx38e which does http). If you have been running with code that hadn’t been formally “make install”ed, it will be in danger of being overwritten (stomped on, if it hasn’t already).

If you’re making your own versions of the erlang runtime tarballs, please note and pickup the latest changes to acsys_fe.rel https://cdcvs.fnal.gov/redmine/projects/acsys-fe/repository/frontends/changes/acsys_fe.rel?rev=master https://cdcvs.fnal.gov/redmine/projects/acsys-fe/repository/frontends/revisions/ea07e26bb3faf6d68b55bc3229b47e165e57c211

Dennis