epics-base / pvAccessCPP

pvAccessCPP is an EPICS V4 C++ module
https://epics-base.github.io/pvAccessCPP/
Other
10 stars 22 forks source link

Support for TCP Searches #192

Open sveseli opened 10 months ago

sveseli commented 10 months ago

This PR has been modified from the original (support for direct tcp connections) to include support for tcp searches.

Server side changes:

Client side changes:

Here are some examples of how things work. In terminal 1, we run test server using non-default ports:

daq-dss02> EPICS_PVA_SERVER_PORT=11111 EPICS_PVA_BROADCAST_PORT=22222 ./modules/pvAccess/testApp/O.linux-x86_64/testServer
pvAccess Server v7.1.8-SNAPSHOT
Active configuration (w/ defaults)
EPICS_PVAS_INTF_ADDR_LIST = 0.0.0.0:11111
EPICS_PVAS_BEACON_ADDR_LIST = 
EPICS_PVAS_AUTO_BEACON_ADDR_LIST = YES
EPICS_PVAS_BEACON_PERIOD = 15
EPICS_PVAS_BROADCAST_PORT = 22222
EPICS_PVAS_SERVER_PORT = 11111
EPICS_PVAS_PROVIDER_NAMES = local
...

In terminal 2, on a second machine, we run client. If we do not specify anything, client fails to connect:

daq-qss21> pvget testCounter
Timeout
testCounter

If we specify broadcast port, channel connects using udp discovery:

daq-qss21> EPICS_PVA_BROADCAST_PORT=22222 pvget testCounter
testCounter 2024-03-22 14:01:33.332  27 
daq-qss21> EPICS_PVA_BROADCAST_PORT=22222 pvget testCounter
testCounter 2024-03-22 14:01:33.332  27 
daq-qss21> EPICS_PVA_BROADCAST_PORT=22222 pvget -d testCounter
2024-03-22T14:01:41.838 Configured PVA address list: 
2024-03-22T14:01:41.838 Configured server port: 5075
2024-03-22T14:01:41.838 Configured name server address list: 
2024-03-22T14:01:41.838 Configured broadcast port: 22222
...
2024-03-22T14:01:41.839 Server address decoded as 10.6.13.8:11111, transport type is udp
...
2024-03-22T14:01:41.841 Connected to PVA server: 10.6.13.8:11111.
2024-03-22T14:01:41.841 Unregistering search instance: testCounter
testCounter 2024-03-22 14:01:41.333  35 
...

If we specify EPICS_PVA_NAME_SERVERS variable, tcp search will result in channel connection:

daq-qss21> EPICS_PVA_NAME_SERVERS=daq-dss02:11111 pvget testCounter
testCounter 2024-03-22 14:14:16.483  790 
daq-qss21> EPICS_PVA_NAME_SERVERS=daq-dss02:11111 pvget -d testCounter
2024-03-22T14:14:23.016 Configured PVA address list: 
2024-03-22T14:14:23.016 Configured server port: 5075
2024-03-22T14:14:23.016 Configured name server address list: daq-dss02:11111
2024-03-22T14:14:23.016 Configured broadcast port: 5076
2024-03-22T14:14:23.018 Creating datagram socket from: 0.0.0.0:53776.
...
2024-03-22T14:14:23.019 Getting name server transport for address 10.6.13.8:11111
...
2024-03-22T14:14:23.021 Searching for channel: testCounter
2024-03-22T14:14:23.021 Server address decoded as 10.6.13.8:11111, transport type is tcp
...
2024-03-22T14:14:23.021 Connecting to PVA server: 10.6.13.8:11111.
...
testCounter 2024-03-22 14:14:22.485  796 
...
AppVeyorBot commented 10 months ago

:white_check_mark: Build pvAccessCPP 1.0.80 completed (commit https://github.com/epics-base/pvAccessCPP/commit/2da7698c75 by @sveseli)

kasemir commented 10 months ago

So with PVXS and the new java lib, this is how you would configure TCP-only searches:

EPICS_PVA_NAME_SERVERS="IP1:PORT1 IP2:PORT2"
EPICS_PVA_AUTO_ADDR_LIST=NO
EPICS_PVA_ADDR_LIST=""

If only setting EPICS_PVA_NAME_SERVERS, it will contact those TCP addresses plus still send UDP searches.

anjohnson commented 10 months ago

@kasemir is there an existing PVA name server?

@sveseli is looking to write one based on pvaPy if there isn't one already out there, but we'd prefer to use something that already exists.

mdavidsaver commented 10 months ago

I feel bound to point out that PVXS already has a more complete "direct connection" (aka. bypassing the search phase entirely). One usage is in pvxlist to connect to the un-searchable server PV. Unlike pvlist in pvAccessCPP, pvxlist is implemented using public API.

Also, fair warning. As I recall pvAccessCPP does not correctly handle reconnection of "direct" channels. So imo. it is not suitable for use with subscriptions specifically and may cause problems with long running clients generally.

sveseli commented 10 months ago

From what I have seen, before this PR direct connection to channels was not possible at all, as the code relied entirely on UDP searches, and if those are not resulting in channel discovery, there would be no connection made. In my tests, this PR handles monitor re-connections correctly, as long as the server comes up on the same machine and same port, regardless of whether UDP search is available or not.

kasemir commented 10 months ago

is there an existing PVA name server?

The Java PVA lib contains a command line demo, https://github.com/ControlSystemStudio/phoebus/blob/master/core/pva/src/test/java/org/epics/pva/server/SearchMonitorDemo.java

It takes list of PVs and their IP:port on the command line and then replies to searches for those PVs with the provided name. That's of course only usable as a demo for one or two PVs, but it allows testing EPICS_PVA_NAME_SERVERS, so clients go to the name server via TCP, get the IP:port of the IOC and then connect via TCP to the IOC.

The plan was somewhat like this:

Extend the channel finder's IOC tool (reccaster?) to provide the TCP port. I think that's been completed. Channel finder will thus not only know the PV names but also the TCP IP and port. Next build a name server that starts like that demo but instead of taking info for 1 or 2 PVs from the command line it queries the channel finder.

anjohnson commented 10 months ago

Thanks Kay. Given that the PVA protocol has the ability for a name-server to query the servers for their complete list of PV names I don't think an oracle for PV names would be essential, although it might be good for performance reasons.

There are 2 modes that I could see it using, maybe both at once: When the name-server sees a search request for a PV name that it doesn't recognize it would request that name through its client-side API, and would handle any response by asking the server that answered for all of its PVs. Alternatively/also it could monitor for beacons from new servers coming up, and proactively ask them for their names. The first mode would be needed for it to work with dynamic servers which could add new PVs at runtime, if it sees a response from a known server it would refresh the list of names for that server.

I'm not quite sure how to handle dynamic servers that can drop PV names, somehow the name-server needs to know that's happened. I suspect the pvlist request can't be issued as a monitor, although that would be really helpful.

It's important for the server to cache the PV names requested that it doesn't have a server for so it doesn't replicate every random request (the CA name-server has one). Supporting a configurable list of regexp's that match names to be forwarded and/or to not be forwarded might be a reasonable alternative to the Channel Finder to help with that.

anjohnson commented 10 months ago

@mdavidsaver Unfortunately APS would need several FTE's of effort to convert all our DAQ software to the PVXS API, and we also rely on the plugin abilities currently unique to the pvAccess and pvDatabase libraries to be able to handle the high data rates coming from those DAQ systems. The data distributor plugin lets us accept updates from one data stream and fan them out to multiple clients to be processed in parallel. This PR provides the ability to connect through firewalls.

I do agree with the need for tests.

kasemir commented 10 months ago

Andrew, you're correct, a name server might either rely on some type of name database (Channel Finder, Oracle, ...), or it might build that database itself. It could issue "pvlist" requests, or operate similar to the gateway by sending its own search request and then memorizing the reply.

@shroffk those are options we could consider if we ever get back to working on a name server.

shroffk commented 10 months ago

We could consider a combination of those actions

I have a very basic name server which uses ChannelFinder populated with recsync and the PVA port. https://github.com/ChannelFinder/cfNameserver

I had imagined that the fall back mechanism for the name server if it fails to find the name resolution in CF would be to do its own name resolution search.

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.82 failed (commit https://github.com/epics-base/pvAccessCPP/commit/1190cb7e1c by @sveseli)

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.83 failed (commit https://github.com/epics-base/pvAccessCPP/commit/d9c8c7b9f4 by @sveseli)

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.84 failed (commit https://github.com/epics-base/pvAccessCPP/commit/606e616ead by @sveseli)

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.84 failed (commit https://github.com/epics-base/pvAccessCPP/commit/606e616ead by @sveseli)

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.85 failed (commit https://github.com/epics-base/pvAccessCPP/commit/8cc23a8eda by @sveseli)

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.86 failed (commit https://github.com/epics-base/pvAccessCPP/commit/368e9f2535 by @sveseli)

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.87 failed (commit https://github.com/epics-base/pvAccessCPP/commit/e953f5b60b by @sveseli)

AppVeyorBot commented 8 months ago

:white_check_mark: Build pvAccessCPP 1.0.88 completed (commit https://github.com/epics-base/pvAccessCPP/commit/1f33448b68 by @sveseli)

mdavidsaver commented 8 months ago

Looking again...

  • Server respects EPICS_PVA_UDP_SENDER_PORT variable; default is 0 (random port), as before

How is EPICS_PVA_UDP_SENDER_PORT meant to be used?

Is this intended to address #159?

  • Specifying EPICS_PVA_ADDR_LIST will result with direct tcp connection attempts using the specified addresses, in addition to the regular udp search

EPICS_PVA_ADDR_LIST is defined as a list of UDP endpoints. How can appropriate TCP ports be deduced from this list without a search phase?

the PR contains a simple name server utility that is capable of discovering PVA servers on the network and polling them for the list of available channels at a regular intervals.

How would this polling scale wrt. bandwidth usage and maximum reply size?

When I think about something like this, I imagine using a subscription to push incremental changes to a server's PV list. (similar to what recsync does)

sveseli commented 8 months ago

Looking again...

  • Server respects EPICS_PVA_UDP_SENDER_PORT variable; default is 0 (random port), as before

How is EPICS_PVA_UDP_SENDER_PORT meant to be used?

Is this intended to address #159?

This is one of possible solutions for #159, although at this point one can also use tcp searches. By default, udp sender port is arbitrary (0) as before. If you set this variable, this will configure sender port to a specific value.

  • Specifying EPICS_PVA_ADDR_LIST will result with direct tcp connection attempts using the specified addresses, in addition to the regular udp search

EPICS_PVA_ADDR_LIST is defined as a list of UDP endpoints. How can appropriate TCP ports be deduced from this list without a search phase?

If udp searches work on the network, there is no difference in behavior as direct connection to the server will not happen. If udp searches do not work, we try to connect directly to servers on the list. If no port is specified as part of the address, we take the default PVA server port.

the PR contains a simple name server utility that is capable of discovering PVA servers on the network and polling them for the list of available channels at a regular intervals.

How would this polling scale wrt. bandwidth usage and maximum reply size?

When I think about something like this, I imagine using a subscription to push incremental changes to a server's PV list. (similar to what recsync does)

I suppose the intention was that this works in a way that is as unobtrusive as possible, but I have not done any large scaling tests yet. For APS DAQ system, we have probably on the order of 40-50 PVA servers, which would translate to the same number of "pvlist" calls per polling period, and should result in a relatively low bandwidth usage. Hopefully, @anjohnson can deploy this soon and see how things work.

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.89 failed (commit https://github.com/epics-base/pvAccessCPP/commit/0de1feb60e by @sveseli)

anjohnson commented 8 months ago

Our CA name-servers currently connect to over 800 IOCs, and we're coming up on 2 million record names. Repeatedly polling probably doesn't scale to that many names or servers and would be a waste of bandwidth and power, although that won't be a problem for us today or in the near future, we aren't planning on building many IOCs with QSRV right now.

My preference would be for the name-server to be able to subscribe to each server's pvlist and get updates when PVs are added or removed. That obviously isn't necessary for the current IOC with QSRV, but other PVA servers can already be dynamic, and we might allow IOCs to add, remove or rename records at some point.

The other change to avoid the need for polling is to have an API for clients to discover when a new beacon (or a beacon anomaly in CA parlance) is seen, indicating a new server coming online. The name-server could then introspect new servers as soon as it sees them, and there's no polling delay before it knows about them. I believe PVA's beacon behavior is slightly different than CA though, is it as reliable as CA in the event that a network segment gets disconnected and reconnected later without the servers beyond it rebooting?

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.90 failed (commit https://github.com/epics-base/pvAccessCPP/commit/8439383ffb by @sveseli)

AppVeyorBot commented 8 months ago

:x: Build pvAccessCPP 1.0.91 failed (commit https://github.com/epics-base/pvAccessCPP/commit/1ba616cee1 by @sveseli)

AppVeyorBot commented 7 months ago

:x: Build pvAccessCPP 1.0.92 failed (commit https://github.com/epics-base/pvAccessCPP/commit/1568875dfe by @sveseli)

AppVeyorBot commented 7 months ago

:white_check_mark: Build pvAccessCPP 1.0.93 completed (commit https://github.com/epics-base/pvAccessCPP/commit/d8b541884e by @sveseli)

AppVeyorBot commented 7 months ago

:white_check_mark: Build pvAccessCPP 1.0.94 completed (commit https://github.com/epics-base/pvAccessCPP/commit/cd5a90ca54 by @sveseli)

AppVeyorBot commented 7 months ago

:x: Build pvAccessCPP 1.0.95 failed (commit https://github.com/epics-base/pvAccessCPP/commit/91e0d54d7f by @sveseli)

AppVeyorBot commented 7 months ago

:x: Build pvAccessCPP 1.0.96 failed (commit https://github.com/epics-base/pvAccessCPP/commit/c6266331f2 by @sveseli)

AppVeyorBot commented 7 months ago

:white_check_mark: Build pvAccessCPP 1.0.97 completed (commit https://github.com/epics-base/pvAccessCPP/commit/71514533a4 by @sveseli)

mdavidsaver commented 7 months ago

I think this PR is getting too large. With several distinct changes.

Adding support for EPICS_PVA_NAME_SERVERS is a reasonable addition, and I think should be uncontentious. Can this change be extracted as a separate PR?

As for the "Direct Connection" change(s). Do I understand correctly that the goal here is to bypass the search phase entirely. That is, send CREATE_CHANNEL without first receiving a positive SEARCH_REPLY?

If so, I can understand fixing bugs in the client so that eg. direct channels created through the existing API will reconnect automatically.

Adding support for URI-like direct connect syntax to pvget and friends might also be acceptable. (I am not enthusiastic, but I suspect at least one person would be)

However, I think that beginning to treat EPICS_PVA_ADDR_LIST as a list of TCP addresses would be unacceptable. The *_ADDR_LIST variables for both PVA and CA have always been lists of UDP endpoints, including multicast and broadcast addresses. Changing this definition now seems to me likely to cause problems with other code and/or user expectations.

wrt. EPICS_PVA_UDP_SENDER_PORT. I am not sure this would actually help with #159, and would anyway be a burden on sysadmins. I think a proper solution to #159 is for a server to always a send search reply back through the same socket which received that search request. This should be a (tedious) exercise in interface re-plumbing. One which I think will anyway be necessary to handle EPICS_PVA_NAME_SERVERS (search over TCP must reply over the same connection).

wrt. pvans, I think it is too soon to include something like this in pvAccessCPP. It sounds to me like this tools is at the stage of small scale prototype.

anjohnson commented 7 months ago

@mdavidsaver We will do this:

  1. Comment out the direct connection code, and using EPICS_PVA_ADDR_LIST as a list of TCP addresses. The direct connection code was already present, and reconnections did work using it, but we don't actually need it.
  2. Separate out the name-server program and internal classes to implement the new API it uses. This would become a separate PR.
  3. What would it take to either allow or drop the EPICS_PVA_UDP_SENDER_PORT parameter — issues #159 and #128 may have said "here be dragons" on duplicating the port number, how to discover it what's a real problem? Would you be happy to keep it in if we just changed the default value from zero (use a random reply port) to the value of the EPICS_PVA_BROADCAST_PORT parameter?

Is that likely to be acceptable, or are there more changes you'd like to see to merge this functionality?

Thanks, Andrew & Sinisa

mdavidsaver commented 7 months ago

2. name-server program ... the new API it uses

A proposed design of any new/extended APIs seems like a good starting point. eg. beginning with the question of where the current ChannelProvider interface is insufficient?

Also, I would strongly encourage that Sinisa and Co. study the design of recsync, which is able to avoid eg. repeatedly polling the record list by maintaining a persistent connection with each server.

fyi. redesigning recsync around PVA in addition to (and eventually instead of) my custom protocol would be a useful improvement. Since it is in active use, the pluses and minuses of that data model are known.

mdavidsaver commented 7 months ago

3. What would it take to either allow or drop the EPICS_PVA_UDP_SENDER_PORT parameter ...

Maybe we could have an example with Linux (iptables or nft)? How can this new parameter be used to configure a firewall on a host with a variable number of PVA clients and server? eg. must each client process be manually assigned a unique sender port?

... may have said "here be dragons" on duplicating the port number ...

These "dragons" lurk in the dark sprawling internal APIs of pvAccessCPP, and present an entirely artificial barrier between the code wishing to send a UDP search reply, and the socket on which the corresponding search request was received.

imo. there is no need for a separate sendTransport socket. The best course of action (short of using PVXS) would be to eliminate that object, redesigning the internal API so that UDP replies are always send through the socket which received the request.

A painful task, but I think preferable to burdening every admin with additional configuration.

AppVeyorBot commented 7 months ago

:white_check_mark: Build pvAccessCPP 1.0.98 completed (commit https://github.com/epics-base/pvAccessCPP/commit/ae0a281dcf by @sveseli)

sveseli commented 7 months ago

Please add some test converge.

Tests have been added.

sveseli commented 7 months ago

@mdavidsaver We will do this:

1. Comment out the direct connection code, and using EPICS_PVA_ADDR_LIST as a list of TCP addresses. The direct connection code was already present, and reconnections did work using it, but we don't actually need it.

2. Separate out the name-server program and internal classes to implement the new API it uses. This would become a separate PR.

3. What would it take to either allow or drop the `EPICS_PVA_UDP_SENDER_PORT` parameter — issues [PVA Server's use of random 'sendTransport' UDP port makes it impractical for gateway #159](https://github.com/epics-base/pvAccessCPP/issues/159) and [Client and server unintentionally sharing a UDP source port #128](https://github.com/epics-base/pvAccessCPP/issues/128) may have said "here be dragons" on duplicating the port number, how to discover it what's a real problem? Would you be happy to keep it in if we just changed the default value from zero (use a random reply port) to the value of the `EPICS_PVA_BROADCAST_PORT` parameter?

Is that likely to be acceptable, or are there more changes you'd like to see to merge this functionality?

Thanks, Andrew & Sinisa

The only thing that is left as part of the PR is the support for tcp channel searches via EPICS_PVA_NAME_SERVERS variable. If this is not set, the code should behave as before.

shroffk commented 7 months ago

Sorry about the accidental close

AppVeyorBot commented 7 months ago

:white_check_mark: Build pvAccessCPP 1.0.99 completed (commit https://github.com/epics-base/pvAccessCPP/commit/1b3a3f5095 by @sveseli)

anjohnson commented 7 months ago

Core Group: MAD will review this and provide feedback before the next meeting.

AppVeyorBot commented 7 months ago

:white_check_mark: Build pvAccessCPP 1.0.103 completed (commit https://github.com/epics-base/pvAccessCPP/commit/9d7930478a by @sveseli)

AppVeyorBot commented 6 months ago

:white_check_mark: Build pvAccessCPP 1.0.104 completed (commit https://github.com/epics-base/pvAccessCPP/commit/c81093cb2c by @sveseli)

AppVeyorBot commented 6 months ago

:white_check_mark: Build pvAccessCPP 1.0.105 completed (commit https://github.com/epics-base/pvAccessCPP/commit/84eb3d6400 by @sveseli)

mdavidsaver commented 6 months ago

With @sveseli s recent pruning, I think I see what is going on wrt. name server handling. A couple of questions to check my understanding.

Why are connections to name servers treated differently to other connections?

Is each client context limited to connecting to one name server at a time?

wrt. releaseNameServerTransport(). This looks like manual ref. counting. Can it be avoided?

wrt. the design around getNameServerSearchTransport(). Can you avoid looping for blocking connect() from the search timer callback?

eg. doing so stalls searching while name server(s) are offline.

$ time ./bin/linux-x86_64/pvget foo
Timeout
foo 
real    0m5.029s
$ time EPICS_PVA_NAME_SERVERS=10.127.127.2 ./bin/linux-x86_64/pvget foo
Timeout
foo 
real    0m9.227s
$ time EPICS_PVA_NAME_SERVERS="10.127.127.2 10.127.127.3" ./bin/linux-x86_64/pvget foo
Timeout
foo 
real    0m12.285s

Compare with:

$ time EPICS_PVA_NAME_SERVERS="10.127.127.2 10.127.127.3" ./bin/linux-x86_64/pvxget foo
2024-05-19T14:17:47.878231057 ERR pvxs.tcp.io connection to Server 10.127.127.3:5075 closed with socket error 113 : No route to host
2024-05-19T14:17:47.878889508 ERR pvxs.tcp.io connection to Server 10.127.127.2:5075 closed with socket error 113 : No route to host
Timeout with 1 outstanding

real    0m5.031s
AppVeyorBot commented 6 months ago

:white_check_mark: Build pvAccessCPP 1.0.106 completed (commit https://github.com/epics-base/pvAccessCPP/commit/8fc112b9b2 by @sveseli)

sveseli commented 6 months ago

The blocking call for name server connections has been removed. Name server connections now use separate set of timers, so that they can be established at the same time. In the examples below I used non-existent hosts:

$ time pvget foo
Timeout
foo 
real    0m5.021s
user    0m0.011s
sys 0m0.015s

$ time EPICS_PVA_NAME_SERVERS="192.168.0.112:11111" pvget foo
Timeout
foo 
real    0m5.128s
user    0m0.010s
sys 0m0.023s

$ time EPICS_PVA_NAME_SERVERS="192.168.0.112:11111 192.168.0.113:22222" pvget foo
Timeout
foo 
real    0m5.113s
user    0m0.000s
sys 0m0.021s

Client can now establish connections to multiple name servers at the same time, and they are released as soon as they are no longer needed or aren't useful (hence methods that release those connections). Name server connections reuse existing classes (e.g. TransportRegistry, BlockingTCPConnector, etc.) as much as possible.

AppVeyorBot commented 5 months ago

:white_check_mark: Build pvAccessCPP 1.0.107 completed (commit https://github.com/epics-base/pvAccessCPP/commit/5f53b171aa by @sveseli)

sveseli commented 5 months ago

@mdavidsaver Is there anything else you would like to see done or modified before this PR can be merged?

sveseli commented 4 months ago

@mdavidsaver @anjohnson Is this PR okay to be merged, or would you like to see some other changes? I would like to make sure that this goes into the next release, if that would be possible.

mdavidsaver commented 4 months ago

... Is this PR okay to be merged ...

No. There are several aspects of this proposed design which I do no like. eg. blocking connect() not on a dedicated thread. Pointing these out piecemeal does not seem to be moving us towards a resolution, but I have not found the time to write out an alternate design.