Support for TCP Searches

sveseli commented 10 months ago

This PR has been modified from the original (support for direct tcp connections) to include support for tcp searches.

Server side changes:

Servers have ability to respond to tcp searches on the server port

Client side changes:

By default, only udp channel searches are performed as before
Specifying EPICS_PVA_NAME_SERVERS variable will result in tcp search queries to the listed HOST:PORT addresses, in addition to the regular udp search

Here are some examples of how things work. In terminal 1, we run test server using non-default ports:

daq-dss02> EPICS_PVA_SERVER_PORT=11111 EPICS_PVA_BROADCAST_PORT=22222 ./modules/pvAccess/testApp/O.linux-x86_64/testServer
pvAccess Server v7.1.8-SNAPSHOT
Active configuration (w/ defaults)
EPICS_PVAS_INTF_ADDR_LIST = 0.0.0.0:11111
EPICS_PVAS_BEACON_ADDR_LIST = 
EPICS_PVAS_AUTO_BEACON_ADDR_LIST = YES
EPICS_PVAS_BEACON_PERIOD = 15
EPICS_PVAS_BROADCAST_PORT = 22222
EPICS_PVAS_SERVER_PORT = 11111
EPICS_PVAS_PROVIDER_NAMES = local
...

In terminal 2, on a second machine, we run client. If we do not specify anything, client fails to connect:

daq-qss21> pvget testCounter
Timeout
testCounter

If we specify broadcast port, channel connects using udp discovery:

daq-qss21> EPICS_PVA_BROADCAST_PORT=22222 pvget testCounter
testCounter 2024-03-22 14:01:33.332  27 
daq-qss21> EPICS_PVA_BROADCAST_PORT=22222 pvget testCounter
testCounter 2024-03-22 14:01:33.332  27 
daq-qss21> EPICS_PVA_BROADCAST_PORT=22222 pvget -d testCounter
2024-03-22T14:01:41.838 Configured PVA address list: 
2024-03-22T14:01:41.838 Configured server port: 5075
2024-03-22T14:01:41.838 Configured name server address list: 
2024-03-22T14:01:41.838 Configured broadcast port: 22222
...
2024-03-22T14:01:41.839 Server address decoded as 10.6.13.8:11111, transport type is udp
...
2024-03-22T14:01:41.841 Connected to PVA server: 10.6.13.8:11111.
2024-03-22T14:01:41.841 Unregistering search instance: testCounter
testCounter 2024-03-22 14:01:41.333  35 
...

If we specify EPICS_PVA_NAME_SERVERS variable, tcp search will result in channel connection:

daq-qss21> EPICS_PVA_NAME_SERVERS=daq-dss02:11111 pvget testCounter
testCounter 2024-03-22 14:14:16.483  790 
daq-qss21> EPICS_PVA_NAME_SERVERS=daq-dss02:11111 pvget -d testCounter
2024-03-22T14:14:23.016 Configured PVA address list: 
2024-03-22T14:14:23.016 Configured server port: 5075
2024-03-22T14:14:23.016 Configured name server address list: daq-dss02:11111
2024-03-22T14:14:23.016 Configured broadcast port: 5076
2024-03-22T14:14:23.018 Creating datagram socket from: 0.0.0.0:53776.
...
2024-03-22T14:14:23.019 Getting name server transport for address 10.6.13.8:11111
...
2024-03-22T14:14:23.021 Searching for channel: testCounter
2024-03-22T14:14:23.021 Server address decoded as 10.6.13.8:11111, transport type is tcp
...
2024-03-22T14:14:23.021 Connecting to PVA server: 10.6.13.8:11111.
...
testCounter 2024-03-22 14:14:22.485  796 
...

sveseli commented 4 months ago

... Is this PR okay to be merged ...

No. There are several aspects of this proposed design which I do no like. eg. blocking connect() not on a dedicated thread. Pointing these out piecemeal does not seem to be moving us towards a resolution, but I have not found the time to write out an alternate design.

The most recent commits (two months ago) addressed the issue of blocking connect you brought up. The connections are done via a separate set of timers. If there are other issues that need to be resolved, I would be happy to do this.

mdavidsaver commented 4 months ago

The most recent commits (two months ago) addressed the issue of blocking connect you brought up. ...

Then what is going on with this wait(0.1)?

           // Wait to get transport if we do not have it
            m_nsTransportEvent.wait(MAX_NS_TRANSPORT_WAIT_TIME);

As for the rest. I have a limited amount of time which I can spend on EPICS community work, and many projects which ask for my attention. I would have an easier time reviewing your proposed changes if you would say more about what you are trying to achieve, and how you intend to achieve it. Correctly inferring intent from code is both difficult and error prone.

... released as soon as they are no longer needed ...

eg. Do you intend that an idle client will not maintain name server connections? Why not maintain NS connections continually? Are you are making a trade off of resource use vs. (re)connect latency?

eg. Why are connections to name servers treated differently to other PVA server? Do you see this as a "need", a convenience, ...?

sveseli commented 4 months ago

The most recent commits (two months ago) addressed the issue of blocking connect you brought up. ...

Then what is going on with this wait(0.1)?
           // Wait to get transport if we do not have it
            m_nsTransportEvent.wait(MAX_NS_TRANSPORT_WAIT_TIME);

If you look closely, the earlier code had an actual connect() in this thread. Right now a name server connect timer is scheduled that does the actual connect, and notifies corresponding event if connection was established. I used a short wait on this event just to see if connection can be established quickly, which improves search performance greatly when the name server responds, and does not hurt very much if name server does not respond, as we do not have channel connection anyways. I could certainly make sure that this wait happens only when the name server connection timer starts, rather than on every name server search attempt, but I suspect this will have a very little effect in terms of channel timeouts, etc. (see my examples of what happens when trying to connect to channels using non-existent machines).

As for the rest. I have a limited amount of time which I can spend on EPICS community work, and many projects which ask for my attention. I would have an easier time reviewing your proposed changes if you would say more about what you are trying to achieve, and how you intend to achieve it. Correctly inferring intent from code is both difficult and error prone.

... released as soon as they are no longer needed ...

eg. Do you intend that an idle client will not maintain name server connections? Why not maintain NS connections continually? Are you are making a trade off of resource use vs. (re)connect latency?

This code is intended to work for a general case, where the name server connection is different from PVA Server connections. Rather than every client maintaining name server connections (and thus keeping those TCP sockets alive), I chose to close those connections as soon as I did not need them in order to preserve resources. There is obviously a trade off here in terms of resources and re-connection latency. One can perhaps argue one way or another as to what is more important, but given that name server can have hundreds or even thousands of client connections at any given time, I decided to prioritize resources, rather than to worry about reconnecting a bit faster every once in a while.

eg. Why are connections to name servers treated differently to other PVA server? Do you see this as a "need", a convenience, ...?

This stems partly from the fact that in general connections to name servers are different from connections to PVA servers in the sense we use them only in the discovery phase and also (for the reasons mentioned above) I chose to close them as soon as possible. Another reason is trying to minimize amount of the new code and impact of the required changes to the existing code, and making sure that no matter what happens with the name server connections, those do not affect client behavior and existing PVA server connections.

I do appreciate the feedback/suggestions and I am willing to make any changes that are feasible and do not require major rewrite of the existing code.

AppVeyorBot commented 4 months ago

:white_check_mark: Build pvAccessCPP 1.0.110 completed (commit https://github.com/epics-base/pvAccessCPP/commit/a6f58f8049 by @sveseli)

anjohnson commented 4 months ago

Meeting: MD: All servers must support "_server" for RPC to get name list. SV: This PR now just adds support for PVA searches over TCP, it doesn't implement a name-server. That will come in a separate PR (that was removed from this PR earlier). Agreed: MD will see implementation of a future name server before SV distributes it. MD: QOS_REPLY_REQUIRED doesn't need to be set, a searching application cancelling the search is the only thing that can cause the number of searches to go to 0. This should reduce traffic. Careful to avoid stalling the search process if the name-server dies. SV: Client only sends 3 searches to a connected name-server before giving up and trying the next name-server in its list. ANJ: "The nameserver" — we want to be able to query multiple name-servers at the same time, to reduce search latency. SV: Want to save resources by not keeping sockets open when we don't have any names we haven't found. MD: Having multiple open sockets is likely to need less resources than multiple threads. ANJ: Okay with closing NS sockets when all searches done, but wait for some time period after the "last name found" for the client to ask for more channels. MD: Define test cases to prove that this feature is doing what is intended. Measure that UDP searches and TCP searches don't take significantly different periods of time to complete. How long does it take to search for 10,000 PVs over both transports? Use the client API to measure those, so not a micro-benchmark.

AIs on SV:

Client searches should be sent to multiple NSs in parallel.
Wait for a short time after searching before closing TCP socket to the NS, in case client is inefficient and does (search; wait; get; wait) multiple times.
Measure how long connections, and later reconnections, of 10,000 channels take through both transports. Not a unit-test, report the results in a comment here (may prompt further discussion), can use pvaPy. Specifically connect to 5 PVs, wait for all connect and for the NS to disconnect, then connect to 10,0000 PVs and measure how long it takes. Check separately against both 1 and 2 name-servers, with some channels in both. Test for (everything UDP NS; everything TCP NS; split between 2 UDP NSs; split between 2 TCP NS; split between 1 UDP and 1 TCP NS).

epics-base / pvAccessCPP

Support for TCP Searches #192