Open leonardb opened 2 years ago
I don't know why 1026 doesn't show up in the documentation, but the description in the code is just "Network connection failed" so that's not really any help either.
The _bytes
variants don't seem to be documented, but glancing through the C client code it does seem like they're still available. What version of FDB are you using? Back in 6.2.x I believe there was a rewrite of the TLS stack.
Hi @kocolosk
This was testing with 6.2.28-1 (and now with 6.3.23 as well)
Noticed that in fact tls_ca_path
is currently not supported, only tls_ca_bytes
. erlfdb will error when trying to use path.
In erlfdb_nif the following tls ca/key/cert options are types for network_option()
| tls_cert_bytes
| tls_cert_path
| tls_key_bytes
| tls_key_path
| tls_ca_bytes
| tls_ca_path
But, tls_ca_path
does not exist in erlfdb_network_set_option
in main.c, while all the others are.
^^ This is probably an oversight. The following patch adds it to network options handling
index 8cb03b0..fc27e00 100644
--- a/c_src/main.c
+++ b/c_src/main.c
@@ -488,6 +488,8 @@ erlfdb_network_set_option(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[])
option = FDB_NET_OPTION_CLIENT_BUGGIFY_SECTION_FIRED_PROBABILITY;
} else if(IS_ATOM(argv[0], tls_ca_bytes)) {
option = FDB_NET_OPTION_TLS_CA_BYTES;
+ } else if(IS_ATOM(argv[0], tls_ca_path)) {
+ option = FDB_NET_OPTION_TLS_CA_PATH;
} else if(IS_ATOM(argv[0], tls_password)) {
option = FDB_NET_OPTION_TLS_PASSWORD;
I've also had a look at a few of the 'blessed' language bindings at https://github.com/apple/foundationdb/tree/main/bindings and they all seem to support the bytes
options. Also seems they're all there in the native api code. https://github.com/apple/foundationdb/blob/1800259d7cb35be79541b848ddb83fd62d1ddd15/fdbclient/NativeAPI.actor.cpp#L2031-L2063
Good catch @leonardb. I think we just need to add the option. It looks like we validate the network options so it's surprising it didn't blow up at https://github.com/apache/couchdb-erlfdb/blob/main/c_src/main.c#L514 from return enif_make_badarg(env)
if you try to set that option.
https://github.com/apache/couchdb-erlfdb/blob/main/src/erlfdb_nif.erl#L105-L106
@kocolosk is right that we may have to wrangle the correct API versions here if it's not documented or was available only earlier/later we could resort to ifdefs possibly.
BTW did not mean to close the issue, sorry.
I'm really just worried about the tls_cert_bytes
and tls_key_bytes
options not working at all.
They're accepted when setting the network options (no errors) but obviously 'something' is not working as erlfdb is not connecting properly to the cluster. Maybe it's a bug in the C lib itself, but that's beyond my capabilities to track down.
Enabling trace on the connection gives very limited info. All I can see is a connection failure.
<?xml version="1.0"?>
<Trace>
<Event Severity="10" Time="1643300975.803927" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TRACE_ENABLE" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.804052" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TLS_CA_BYTES" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.804202" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TLS_KEY_BYTES" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.804275" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TLS_CERT_BYTES" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.805842" DateTime="2022-01-27T16:29:35Z" Type="Net2Starting" ID="0000000000000000" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.806136" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="SUPPORTED_CLIENT_VERSIONS" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.805868" DateTime="2022-01-27T16:29:35Z" Type="Net2Running" ID="0000000000000000" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.805868" DateTime="2022-01-27T16:29:35Z" Type="ProcessTimeOffset" ID="0000000000000000" ProcessTime="1643300975.807528" SystemTime="1643300975.807528" OffsetFromSystemTime="-0.000001" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="ClientStart" ID="0000000000000000" SourceVersion="47b9a81d1c10897c863098fe2d66e827fed0d239" Version="6.3.23" PackageName="6.3" ClusterFile="/opt/mfdb/test/fdb.local.cluster" ConnectionString="foo:bar@10.0.0.205:4500:tls" ActualTime="1643300975" ApiVersion="630" ImageOffset="0x7ff1b2c6b000" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" TrackLatestType="Original" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="MachineLoadDetail" ID="0000000000000000" User="31710408" Nice="169636" System="8478089" Idle="475535396" IOWait="3923774" IRQ="0" SoftIRQ="1135099" Steal="0" Guest="0" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="Net2TLSConfig" ID="0000000000000000" CAPath="" CertificatePath="/etc/foundationdb/fdb.pem" KeyPath="/etc/foundationdb/fdb.pem" HasPassword="0" VerifyPeers="Check.Valid=1" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="NotifyAddressHealthy" ID="0000000000000000" SuppressedEventCount="0" Address="10.0.0.205:4500:tls" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.812328" DateTime="2022-01-27T16:29:35Z" Type="CodeCoverage" ID="0000000000000000" File="/root/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbclient/ReadYourWrites.actor.cpp" Line="1357" Condition="true" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.812701" DateTime="2022-01-27T16:29:35Z" Type="ConnectingTo" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="10.0.0.205:4500:tls" PeerReferences="2" FailureStatus="OK" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="20" Time="1643300977.812787" DateTime="2022-01-27T16:29:37Z" Type="N2_ConnectHandshakeError" ID="aac42de923f58eed" SuppressedEventCount="0" ErrorCode="104" Message="Connection reset by peer" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300977.812787" DateTime="2022-01-27T16:29:37Z" Type="ConnectionTimedOut" ID="aac42de923f58eed" SuppressedEventCount="0" PeerAddr="10.0.0.205:4500:tls" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300977.812787" DateTime="2022-01-27T16:29:37Z" Type="ConnectionClosed" ID="aac42de923f58eed" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" PeerAddr="10.0.0.205:4500:tls" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
I see the logs are referring to paths Type="Net2TLSConfig" ID="0000000000000000" CAPath="" CertificatePath="/etc/foundationdb/fdb.pem" KeyPath="/etc/foundationdb/fdb.pem" .../>
did it find both or are those just default values?
To separate the values being mismanaged in the erlfdb erlang/c code vs fdb client not handling it properly see if you can pass the exact values via the environment variables https://apple.github.io/foundationdb/tls.html
This did not help so I decided to just nuke my entire local DB and start from scratch.
After setting up a fresh install of 6.3.23 with new certs it appears that passing the _bytes
variants works fine.
So, not a bug and apologies for the distraction.
I do still have an open question about the ability to defer loading and initialization of the NIF though.
Currently I have a branch which:
application:env(erlfdb, init, auto)
and if it's anything but manual
if continues the init as usual.init_manual/1
function from erlfdb_nif
which allows me to manually init with all the cert detailsWould this be considered as a contribution? It does not change or alter default behavior but it does give consumers the ability to control initialization and network settings outside of settings in sys.config
Fascinating. I was just going to comment that I cannot seem to get the _bytes
variants to work in either Erlang or Python, so now I wonder what I was doing wrong!
"I wish I knew....." what I was doing wrong previously. All I can add is that when I added {trace_enable, <<"/tmp/">>}
to network options to have a separate trace file for the client connection I was seeing 'tls' errors and the KeyPath="/etc/foundationdb/fdb.pem"
@nickva pointed out seemed very off since I don't even have that file on my machine
That sent me down a rabbit-hole of checking with different settings and keys and finally re-installing/initializing with newly generated keys, at which point I discovered you now have to manually add all the key settings to the backup_agent command in foundationdb.conf to stop that throwing a ton of errors.
Once I had a 6.3.23 fresh cluster up and could connect, erlfdb just started working with all the _bytes
options.
I honestly wish I knew why though
Also added {api_version, 630}
to the erlfdb env in my config
OK I figured out the issue on my side, just some whitespace issues with the way I was producing the byte strings for the key and cert to pass to the application environment.
Agreed that tls_ca_path
was just an oversight, clearly that should exist.
Regarding the loading and initialization behavior for the NIF ... I'm wondering if we ought to just always defer the part that sets up the network thread until the first call to erlfdb:open/1
. That seems to be more in line with e.g. the behavior of the Python bindings. Might be a little work around the error handling for the sentinel that ensures we only attempt to set up the network thread once, but not a big deal.
FDB allows for:
tls_ca_path
tls_ca_bytes
tls_key_path
tls_key_bytes
tls_cert_path
tls_cert_bytes
When erlfdb inits using the
path
env values I can connect to the cluster and use the DB. If I use thebytes
options erlfdb 'seems' to connect and returns a database whenerlfdb:open(Cluster).
, but all further operations just hang. A simple transactionalset
ends up stuck in erlfdb:wait and never returns.The only error I see in the fdb trace logs is a 1026 network error which is of course helpfully undocumented.
Rationale: I currently have a branch which does a lazy init of erlfdb_nif and passes in the network options as they are pulled from a key server on init of the parent application. Currently I'm having to store keys to disk and pass in the
path
variants, but would definitely prefer that keys do no get saved to disk at all.