Support for encrypted performance metrics for JITServer

mpirvu commented 2 years ago

JITServer can export internal performance metrics to a monitoring agent like Prometheus (#14762). At the moment these metrics are sent in plain text over the network. It would be best if we offer the user the possibility of encrypting this network traffic. I am thinking that the options to encrypt performance metrics should be separate from the options to encrypt compilation requests and responses. That is, a user may elect to encrypt only the performance metrics or only the client-server communication.

mpirvu commented 2 years ago

Attn: @cjjdespres

cjjdespres commented 2 years ago

This would be optionally creating an SSL connection and communicating over that, like the JITServer does?

mpirvu commented 2 years ago

Yes that's the intent

cjjdespres commented 2 years ago

Since Listener.cpp already creates an SSL context for the JITServer, that context could be shared with the metrics thread. That would require that the SSL keys be the same between the metrics and JITServer threads, though.

mpirvu commented 2 years ago

That would require that the SSL keys be the same between the metrics and JITServer threads, though.

That's ok if it works from Prometheus point of view. A JVM client connecting securely to JITServer needs -XX:JITServerSSLRootCerts=cert.pem. Will this work the same for Prometheus?

cjjdespres commented 2 years ago

I'm not that familiar with Prometheus, but the FAQ and the linked documentation seems to suggest that it would, if I understand it correctly - the Prometheus server would be acting as a client when scraping, and so would be using the CA certificate. It looks like the scrape request config also allows you to set the SSL key and cert, but I'm not sure if that's necessary.

mpirvu commented 2 years ago

The key is that scrape_config which has a field:

# Configures the scrape request's TLS settings.
tls_config:
  [ [<tls_config>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tls_config) ]

That link leads to:

# CA certificate to validate API server certificate with.
[ ca_file: [<filename>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#filename) ]

which I am hoping to play the same role as JITServerSSLRootCerts for a JVM client, but I cannot guarantee it.

cjjdespres commented 2 years ago

I noticed in the OpenSSL documentation that BIO_read can still block even if poll has been used. That might be an issue given the current poll/read structure of the server. That page does say that non-blocking IO can be used with poll instead.

I was also curious about where a good place to free the SSL_CTX would be. Right now I've changed things so it's created in startJITServer in HookedByTheJIT and stored in CompilationInfo, but it's still being freed in serveRemoteCompilationRequests in Listener. I think it might not be a good idea to leave it there now that the context is also going to be used in the metrics server, since the threads may stop at different times.

mpirvu commented 2 years ago

BIO_read being sometimes blocking is not an issue for the compilation requests. Before the poll was added, we had blocking operations. poll was added to be able to shutdown nicely, so the only thing that can happen is to delay the shutdown until the data comes. For the metrics exporter this could sometimes be problematic. We say that we allow 4 concurrent requests through poll. If, for whatever reason one of the BIO_reads blocks forever, we lose the metrics exporter functionality. It may worth looking into async IO, though that would increase complexity of the implementation.

Regarding the SSL_CTX, are you planning in reusing the context for the metrics exporter? Where would the SSL_CTX be stored in this case? We also don't know which piece is going to create the SSL_CTX since the listener thread may work in parallel with the metrics thread and we also have to deal with synchronization between the two. Is there a downside of creating two contexts even though they use the same certificates/keys? I know it's a bit of extra overhead, but are there any other show-stoppers?

cjjdespres commented 2 years ago

Yes, I was planning on reusing the SSL_CTX. I thought it could be stored in the CompilationInfo and initialized just before the listener and metrics threads were spawned. The OpenSSL documentation describes it as a

global context structure which is created by a server or client once per program life-time and which holds mainly default values for the SSL structures which are later created for the connections

which prompted me to look at sharing it between the two threads, but I haven't seen anything that recommends against creating two with the same keys.

As for the blocking issue, the documentation for SSL_read and SSL_get_error seem to suggest that if the socket in a BIO is non-blocking, a program can recover from a failure to read from that BIO due to insufficient data by waiting again with poll():

For socket BIOs (e.g. when SSL_set_fd() was used), select() or poll() on the underlying socket can be used to find out when the TLS/SSL I/O function should be retried.

I don't think that would require too much change to the implementation, though those functions will need to be added to the OpenSSL lib loading.

cjjdespres commented 2 years ago

Looking at the threading issue a little more, I see the following in the OpenSSL FAQ:

Is OpenSSL thread-safe?

Yes but with some limitations; for example, an SSL connection cannot be used concurrently by multiple threads. This is true for most OpenSSL objects.

For version 1.1.0 and later, there is nothing further you need do.

For earlier versions than 1.1.0, it is necessary for your application to set up the thread callback functions. To do this, your application must call CRYPTO_set_locking_callback(3) and one of the CRYPTO_THREADID_set… API’s.

The library initialization function itself must be called only once as well. (This is distinct from the context initialization, which can happen multiple times). They also say that (for 1.1.0 and up)

OpenSSL can be safely used in multi-threaded applications provided that support for the underlying OS threading API is built-in. Currently, OpenSSL supports the pthread and Windows APIs.

It may also be the case that the system OpenSSL was built without threading support, but I believe that can be detected by checking a particular header constant.

All of this is necessary, I think, because the library has some global state that needs to be kept consistent. There's also this worrying section in the documentation for 1.0.2 threading

OpenSSL can generally be used safely in multi-threaded applications provided that at least two callback functions are set, the locking_function and threadid_func. Note that OpenSSL is not completely thread-safe, and unfortunately not all global resources have the necessary locks.

but they don't go into further detail, so I don't know if that affects what we're using.

cjjdespres commented 2 years ago

Is there a cross-platform function to set a socket to be non-blocking? I see the internal setNonBlocking in port/unix/j9process.c, at least. Though, isn't JITServer is only supported on Linux and Z?

mpirvu commented 2 years ago

That setNonBlocking function is marked static so you cannot use outside that file https://stackoverflow.com/questions/1150635/unix-nonblocking-i-o-o-nonblock-vs-fionbio is a good read and it seems to recommend

int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);

There is also our portlib for socket operations in omrsock.c

**
 * Set socket non_block and asychronous.
 *
 * @param[in] portLibrary The port library.
 * @param[out] sock Pointer to the socket structure.
 * @param[in] arg The socket flag you want to set on the socket.
 * \arg OMRSOCK_O_NON_BLOCK
 * \arg OMRSOCK_O_ASYNC
 *
 * @return 0, if no errors occurred, otherwise return an error.
 */
int32_t
omrsock_fcntl(struct OMRPortLibrary *portLibrary, omrsock_socket_t sock, int32_t arg)

on Unix it has this implementation

int32_t
omrsock_fcntl(struct OMRPortLibrary *portLibrary, omrsock_socket_t sock, int32_t arg)
{
    int32_t existingFlags = 0;
    int32_t socketFlag = get_os_socket_flag(arg);

    if (NULL == sock || 0 == socketFlag){
        return OMRPORT_ERROR_INVALID_ARGUMENTS;
    }

    existingFlags = fcntl(sock->data, F_GETFL, 0);

    if (fcntl(sock->data, F_SETFL, existingFlags | socketFlag) < 0) {
        return portLibrary->error_set_last_error(portLibrary, errno, OMRPORT_ERROR_FILE_OPFAILED);
    }
    return 0;
}

Ideally, you would use this port library, but then you have to change all the existing implementation to use omrsock. That's a bigger item for another day.

mpirvu commented 2 years ago

I think, because the library has some global state that needs to be kept consistent. There's also this worrying section in the documentation for 1.0.2 threading

What exactly can cause multithreading issues? For the metrics exporter there is exactly one thread that handles all the reading and writing. For the compilation request/response dialog there are multiple threads, but each thread works on its own socket (meaning that a socket is not shared by different threads). Is that a problem?

cjjdespres commented 2 years ago

According to the documentation (for one of the callbacks that needs to be implemented in 1.0.2)

locking_function(int mode, int n, const char *file, int line) is needed to perform locking on shared data structures. (Note that OpenSSL uses a number of global data structures that will be implicitly shared whenever multiple threads use OpenSSL.) Multi-threaded applications will crash at random if it is not set.

In later versions there is still some global state that needs to be guarded with locks, but the locking is all implemented automatically, assuming that the library has been compiled with threading support (which it usually is).

Based on a brief internet search, I haven't seen other mention of the "not completely thread-safe" warning, and people seem to have been using 1.0.2 in a multithreaded setting with only the callbacks set, so it's possible that the warning only applies to more obscure library functionality, if at all.

cjjdespres commented 2 years ago

Otherwise I don't think there's a problem with multi-threading, as long as the SSL connections themselves are not shared between threads, and we wouldn't be doing that.

cjjdespres commented 2 years ago

It's interesting that this hasn't come up before, since as you mentioned there can be multiple threads all using OpenSSL at once even now.

mpirvu commented 2 years ago

locking_function(int mode, int n, const char *file, int line) is needed to perform locking on shared data structures

What would those shared data structures be? I see the context being shared, but BIO is per thread.

cjjdespres commented 2 years ago

I think it's the internal, global data structures that OpenSSL initializes when SSL_library_init is called that they're referring to, not the context or the BIOs. The documentation doesn't really go into detail about what the global data structures are, only that they're implicitly shared between threads. This only affects 1.0.2, according to the documentation, and only when those couple of callbacks aren't set.

cjjdespres commented 2 years ago

It's possible that the functions we call don't actually use the implicit global state in a way that would cause a crash. The documentation is not very detailed on this point; it only insists that these callbacks need to be set when OpenSSL is used in a multithreaded application.

cjjdespres commented 2 years ago

I've been implementing this feature with the following differences in structure from the current server:

Specifying SSL keys and certificates on the command line will enable SSL for the metrics server implicitly. This currently what happens with SSL for JITServer. I added command line flags to disable SSL for JITServer and the metrics server separately.
If SSL is enabled for the metrics server then a new SSL_CTX is created when the server starts to serve metrics requests.
Sockets that have been opened to handle connections with the server are set to non-blocking when they are created, and if an SSL context was created for the metrics server then an SSL connection is also established and saved at the same time.
The HttpGetRequest objects now buffer the server response to the get request.

The process of interacting with the connection sockets is similar to what happens now. The main changes are:

With SSL enabled, reading from the connection may fail due to insufficient data even if poll() indicated that data was available. When this happens the server just resumes waiting for data.
Since the sockets are now non-blocking, the server will save its response to a request in an HttpGetRequest object in the _incompleteRequests, then will wait for a POLLOUT event on the relevant socket.
With SSL enabled, writing to a connection may fail even if poll() indicated that the socket was ready for writing. Again, the server will simply resume waiting if that happens.

cjjdespres commented 2 years ago

The server appears to be functioning properly with my changes and with OpenSSL disabled if I query it with curl. I had planned to start testing the server's behaviour with OpenSSL enabled, and its interaction with Prometheus, tomorrow.

mpirvu commented 2 years ago

Specifying SSL keys and certificates on the command line will enable SSL for the metrics server implicitly. This currently what happens with SSL for JITServer. I added command line flags to disable SSL for JITServer and the metrics server separately.

I find this a bit confusing: the user provides SSL keys and certificates to enable SSL globally and then he has to disable part of it. Since we are using 2 different contexts we can have two different SSL keys/certificates (or the same key/certificate). Thus we only need additional options for metrics encryption. I am open to other suggestions, but in general I would prefer to not change the current behavior that is described in the docs and in blog posts.

mpirvu commented 2 years ago

Sockets that have been opened to handle connections with the server are set to non-blocking when they are created

Do we have to do this? Meaning, is the current implementation flawed? I thought not, and if that's the case I would prefer to do this change in a separate PR.

cjjdespres commented 2 years ago

The current implementation isn't flawed as it is, no. To keep the current server poll() structure, when SSL is enabled the sockets have to be non-blocking. Otherwise the SSL reads and writes could block even when poll() is used. If that change is made then the handling of reading and writing is very similar in the SSL enabled and disabled cases; the main change is that the server has to poll() for writes and the responses have to be buffered. I can write an initial PR that first changes the sockets to be non-blocking without adding SSL, sure.

Otherwise I can look into changing the server to use async IO like you mentioned, which would also solve this problem.

cjjdespres commented 2 years ago

I should mention that the initial SSL_accept that establishes an SSL connection also may fail due to insufficient data for reading or insufficient capacity for writing, so that also has to be taken into account in the poll() loop.

eclipse-openj9 / openj9

Support for encrypted performance metrics for JITServer #15052