Registrar aliveness problem and text

Ran across a generic issue (IMHO) of the stateless proxy while updating BRSKI discovery... I think:

Example:

A proxy using whatever form of discovery discovers e.g.: 2 possible registrar (does not need to be anything advanced from BRSKI discovery. Whatever minimum discovery we have in the stateless proxy draft will suffice).

When pledge sends packet to proxy forwards packet to the registrar it selected. e.g.: registrar 1

Unfortunately, that registrar is dead. For example DNS-SD may be configured for several minutes DNS holdtime, for efficiency reasons, so whenever during this period the server dies, the client has to discover this through its connection attempts. And conclude its dead - and then select the next-best server.

Here is some initial text thought to solve this for the draft:

When a proxy selects one out of more than two possible registrars through some discovery mechanism including but not limited to the ones described in this specification, that registrar may not be alive/responsive because the discovered information is stale. For example, in DNS-SD the TTL of information may be minutes old.

Proxies SHOULD automatically and timely switch to a next-best registar when they observe a non-responsive registrar, and have discovered alternative registrar(s).

In stateful mode, registrar unresponsiveness can be discovered by timeouts of TCP connection connection attempts, and the proxy can connect to the next best discovered registrar, transparent to the pledge. If the connection was already established and the registrar becomes unresponsible the proxy MUST close the connection from the pledge. When the pledge then re-attempts to connect, the proxy needs to connect to the next-best discovered registrar.

In stateless mode, aliveness SHOULD be supported using a stateless method. For example, the proxy can maintain a count of packets forwarded to the discovered registrar within the last 10 seconds. If no packets are received back from the registrar for 3 consecutive periods in which the proxy did forward packets to the registrar, then the registrar should be considered to be unresponsive and the next best registrar should be used.

ICMP/ICMPv6 messages from the registrar indicating non-responsiveness of the registrar (such as port unreachable) SHOULD equally lead to using the next best registrar.

Some notes here:

The described issue looks similar to how a Pledge can discover multiple Join Proxies and needs to "switch" to a new Join Proxy if the one it selected isn't making any progress. See RFC 8995 Section 4.1 (3 paragraphs on the topic).
Current draft only defines CoAP and GRASP Registrar discovery - if DNS-SD is to be added, this would need to be decided first I think. I'm inclined to actually remove GRASP (since that's only securely applicable to ANIs) and only support CoAP.
Current draft assumes only one Registrar is available/selected. We could stay with this assumption to keep things simple, and handle multi-Registrar cases in the future (e.g. in brski-discovery). In practice, a Join Proxy might not even need discovery if the IP address of the Registrar is provided as network configuration in the mesh network. Or set using some DNS/DHCP/ND Option.
Doing multi-Registrar in this way adds a lot of trouble and complications on the JP that we would need to describe/define.
Noting that if Registrars are discovered using mDNS (instead of infrastructure-managed DNS-SD) there's a higher risk of entities posing as Registrar, where the JP might select the attacker as its Registrar.
The constrained JP draft assumes data is sent over UDP to the Registrar (not TCP)

anima-wg / constrained-join-proxy

Registrar aliveness problem and text #65