apache / couchdb-erlfdb

Erlang API for FoundationDB
https://www.foundationdb.org
Apache License 2.0
28 stars 20 forks source link

Using `bytes` options for TLS rather than `path` does not appear to work #52

Open leonardb opened 2 years ago

leonardb commented 2 years ago

FDB allows for:

tls_ca_path tls_ca_bytes tls_key_path tls_key_bytes tls_cert_path tls_cert_bytes

When erlfdb inits using the path env values I can connect to the cluster and use the DB. If I use the bytes options erlfdb 'seems' to connect and returns a database when erlfdb:open(Cluster)., but all further operations just hang. A simple transactional set ends up stuck in erlfdb:wait and never returns.

The only error I see in the fdb trace logs is a 1026 network error which is of course helpfully undocumented.

Rationale: I currently have a branch which does a lazy init of erlfdb_nif and passes in the network options as they are pulled from a key server on init of the parent application. Currently I'm having to store keys to disk and pass in the path variants, but would definitely prefer that keys do no get saved to disk at all.

kocolosk commented 2 years ago

I don't know why 1026 doesn't show up in the documentation, but the description in the code is just "Network connection failed" so that's not really any help either.

The _bytes variants don't seem to be documented, but glancing through the C client code it does seem like they're still available. What version of FDB are you using? Back in 6.2.x I believe there was a rewrite of the TLS stack.

leonardb commented 2 years ago

Hi @kocolosk

This was testing with 6.2.28-1 (and now with 6.3.23 as well)

Noticed that in fact tls_ca_path is currently not supported, only tls_ca_bytes. erlfdb will error when trying to use path.

In erlfdb_nif the following tls ca/key/cert options are types for network_option()

    | tls_cert_bytes
    | tls_cert_path
    | tls_key_bytes
    | tls_key_path
    | tls_ca_bytes
    | tls_ca_path

But, tls_ca_path does not exist in erlfdb_network_set_option in main.c, while all the others are. ^^ This is probably an oversight. The following patch adds it to network options handling

index 8cb03b0..fc27e00 100644
--- a/c_src/main.c
+++ b/c_src/main.c
@@ -488,6 +488,8 @@ erlfdb_network_set_option(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[])
         option = FDB_NET_OPTION_CLIENT_BUGGIFY_SECTION_FIRED_PROBABILITY;
     } else if(IS_ATOM(argv[0], tls_ca_bytes)) {
         option = FDB_NET_OPTION_TLS_CA_BYTES;
+    } else if(IS_ATOM(argv[0], tls_ca_path)) {
+        option = FDB_NET_OPTION_TLS_CA_PATH;
     } else if(IS_ATOM(argv[0], tls_password)) {
         option = FDB_NET_OPTION_TLS_PASSWORD;

I've also had a look at a few of the 'blessed' language bindings at https://github.com/apple/foundationdb/tree/main/bindings and they all seem to support the bytes options. Also seems they're all there in the native api code. https://github.com/apple/foundationdb/blob/1800259d7cb35be79541b848ddb83fd62d1ddd15/fdbclient/NativeAPI.actor.cpp#L2031-L2063

nickva commented 2 years ago

Good catch @leonardb. I think we just need to add the option. It looks like we validate the network options so it's surprising it didn't blow up at https://github.com/apache/couchdb-erlfdb/blob/main/c_src/main.c#L514 from return enif_make_badarg(env) if you try to set that option.

https://github.com/apache/couchdb-erlfdb/blob/main/src/erlfdb_nif.erl#L105-L106

@kocolosk is right that we may have to wrangle the correct API versions here if it's not documented or was available only earlier/later we could resort to ifdefs possibly.

kocolosk commented 2 years ago

BTW did not mean to close the issue, sorry.

leonardb commented 2 years ago

I'm really just worried about the tls_cert_bytes and tls_key_bytes options not working at all.

They're accepted when setting the network options (no errors) but obviously 'something' is not working as erlfdb is not connecting properly to the cluster. Maybe it's a bug in the C lib itself, but that's beyond my capabilities to track down.

Enabling trace on the connection gives very limited info. All I can see is a connection failure.

<?xml version="1.0"?>
<Trace>
<Event Severity="10" Time="1643300975.803927" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TRACE_ENABLE" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.804052" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TLS_CA_BYTES" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.804202" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TLS_KEY_BYTES" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.804275" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TLS_CERT_BYTES" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.805842" DateTime="2022-01-27T16:29:35Z" Type="Net2Starting" ID="0000000000000000" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.806136" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="SUPPORTED_CLIENT_VERSIONS" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.805868" DateTime="2022-01-27T16:29:35Z" Type="Net2Running" ID="0000000000000000" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.805868" DateTime="2022-01-27T16:29:35Z" Type="ProcessTimeOffset" ID="0000000000000000" ProcessTime="1643300975.807528" SystemTime="1643300975.807528" OffsetFromSystemTime="-0.000001" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="ClientStart" ID="0000000000000000" SourceVersion="47b9a81d1c10897c863098fe2d66e827fed0d239" Version="6.3.23" PackageName="6.3" ClusterFile="/opt/mfdb/test/fdb.local.cluster" ConnectionString="foo:bar@10.0.0.205:4500:tls" ActualTime="1643300975" ApiVersion="630" ImageOffset="0x7ff1b2c6b000" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" TrackLatestType="Original" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="MachineLoadDetail" ID="0000000000000000" User="31710408" Nice="169636" System="8478089" Idle="475535396" IOWait="3923774" IRQ="0" SoftIRQ="1135099" Steal="0" Guest="0" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="Net2TLSConfig" ID="0000000000000000" CAPath="" CertificatePath="/etc/foundationdb/fdb.pem" KeyPath="/etc/foundationdb/fdb.pem" HasPassword="0" VerifyPeers="Check.Valid=1" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="NotifyAddressHealthy" ID="0000000000000000" SuppressedEventCount="0" Address="10.0.0.205:4500:tls" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.812328" DateTime="2022-01-27T16:29:35Z" Type="CodeCoverage" ID="0000000000000000" File="/root/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbclient/ReadYourWrites.actor.cpp" Line="1357" Condition="true" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.812701" DateTime="2022-01-27T16:29:35Z" Type="ConnectingTo" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="10.0.0.205:4500:tls" PeerReferences="2" FailureStatus="OK" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="20" Time="1643300977.812787" DateTime="2022-01-27T16:29:37Z" Type="N2_ConnectHandshakeError" ID="aac42de923f58eed" SuppressedEventCount="0" ErrorCode="104" Message="Connection reset by peer" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300977.812787" DateTime="2022-01-27T16:29:37Z" Type="ConnectionTimedOut" ID="aac42de923f58eed" SuppressedEventCount="0" PeerAddr="10.0.0.205:4500:tls" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300977.812787" DateTime="2022-01-27T16:29:37Z" Type="ConnectionClosed" ID="aac42de923f58eed" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" PeerAddr="10.0.0.205:4500:tls" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
nickva commented 2 years ago

I see the logs are referring to paths Type="Net2TLSConfig" ID="0000000000000000" CAPath="" CertificatePath="/etc/foundationdb/fdb.pem" KeyPath="/etc/foundationdb/fdb.pem" .../> did it find both or are those just default values?

To separate the values being mismanaged in the erlfdb erlang/c code vs fdb client not handling it properly see if you can pass the exact values via the environment variables https://apple.github.io/foundationdb/tls.html

leonardb commented 2 years ago

This did not help so I decided to just nuke my entire local DB and start from scratch.

After setting up a fresh install of 6.3.23 with new certs it appears that passing the _bytes variants works fine.

So, not a bug and apologies for the distraction.

I do still have an open question about the ability to defer loading and initialization of the NIF though.

Currently I have a branch which:

Would this be considered as a contribution? It does not change or alter default behavior but it does give consumers the ability to control initialization and network settings outside of settings in sys.config

kocolosk commented 2 years ago

Fascinating. I was just going to comment that I cannot seem to get the _bytes variants to work in either Erlang or Python, so now I wonder what I was doing wrong!

leonardb commented 2 years ago

"I wish I knew....." what I was doing wrong previously. All I can add is that when I added {trace_enable, <<"/tmp/">>} to network options to have a separate trace file for the client connection I was seeing 'tls' errors and the KeyPath="/etc/foundationdb/fdb.pem" @nickva pointed out seemed very off since I don't even have that file on my machine

That sent me down a rabbit-hole of checking with different settings and keys and finally re-installing/initializing with newly generated keys, at which point I discovered you now have to manually add all the key settings to the backup_agent command in foundationdb.conf to stop that throwing a ton of errors.

Once I had a 6.3.23 fresh cluster up and could connect, erlfdb just started working with all the _bytes options.

I honestly wish I knew why though Also added {api_version, 630} to the erlfdb env in my config

kocolosk commented 2 years ago

OK I figured out the issue on my side, just some whitespace issues with the way I was producing the byte strings for the key and cert to pass to the application environment.

Agreed that tls_ca_path was just an oversight, clearly that should exist.

Regarding the loading and initialization behavior for the NIF ... I'm wondering if we ought to just always defer the part that sets up the network thread until the first call to erlfdb:open/1. That seems to be more in line with e.g. the behavior of the Python bindings. Might be a little work around the error handling for the sentinel that ensures we only attempt to set up the network thread once, but not a big deal.