Splitting the monolithic server into multiple microservices

shankari commented 4 years ago

The phone native code is already modularized into several plugins. But the server code is currently in one monolithic repository. While this is a simple design to begin with, it is problematic for several reasons:

as the server functionality grows, it becomes more and more unwieldy
the size of the server repo grows substantially, taking up expensive disk space in cloud environments
it is hard to compose a system by choosing alternate implementations
we can't mix and match technologies - e.g. maybe using node.js for the load/store component since it does not need any in-house analysis.

Further, specially as we move towards the UPC architecture, we really want to have a microservices architecture. Many of the services share common functionality, like in the diagram below.

python_decomposition

So the next question is, what are the best tools to split up the architecture? Some choices that I have considered, along with their limitations are:

Create a python library for the core and register it in PyPi, but that seems to introduce a barrier to others editing/improving the data model
Use git sub-modules (https://github.blog/2016-02-01-working-with-submodules/), but they seem to be tricky to use, particularly wrt updates
Use numpy.distutils.Component(), similar to sklearn, but that doesn't seem to affect the code, only the packaging

@atton16 @jf87 @kafitz @PatGendre @stephhuerre any thoughts on this?

Do you think we will have a lot more additions to the data model?
Have you worked with git sub-modules before?
Have you seen prior examples of including python modules that you felt were particularly elegant?

kafitz commented 4 years ago

I've become increasingly a fan of the first alternative of having a PyPI library, even though I was also hesitant about the friction it introduced. In our case, I've separated the data processing to its own library (https://github.com/TRIP-Lab/itinerum-tripkit) and reference that when needed--it's allowed me to both offer it as a CLI tool and pull it into services as needed. With the ease of pinning, it's also given me more confidence in using it in new projects I hadn't anticipated, or that have a short shelf-life now but I want to continue working for future reference.

That said, I've been very careful about trying to keep large dependencies low. One example I've wanted to emulate is Sentry.io's deprecated raven package which used to have you install add-ons for flask like pip install raven[flask]. Perhaps that could be done with packages that require a C/C++ compiler so builders can choose how much baggage they want to incorporate.

It's probably my own experience with Git, but I'd rather deal with resolving an errant pip dependency than tracking down an issue with Git submodules. I prefer Git submodules in the case of some monolith repo where I'm truly just glueing together complete components at a very high level.

shankari commented 4 years ago

@kafitz thanks for the feedback! For the record, the raven documentation is here https://docs.sentry.io/clients/python/advanced/

Let me see if I can figure out how they accomplished that.

shankari commented 4 years ago

here's the answer. I am not sure we will need this, but it is good to know how to set it up. https://github.com/getsentry/raven-python/blob/master/setup.py#L40

shankari commented 4 years ago

@kafitz good points about the pip install. After I wrote out the options, I realized that in terms of modularity, we don't actually want other services to add new data models to the core.

The core data models are the ones for the incoming data and the basic trip diary. Any new service should have its own data models in its own module. There might be a way for the new service to register its data model with the core module at runtime for greater interoperability.

I need to think through this some more.

shankari commented 4 years ago

I am going to think through this in two separate steps:

what are the core modules,
where are they used now,
what are the use cases for using and extending them in the future

shankari commented 4 years ago

I wrote out a big answer to this question, but then I ran out of memory and had to reboot and lost it. Here I go again.

List of core modules

The main core modules that are used across multiple services are:

the data model wappers
the database/storage calls

Current usage

Data model wrappers

The data model wrappers are used as classes to wrap existing information. So it makes more sense to include them as a library. This is a good candidate for pulling out in PyPi, although we need to figure out how this can be extended (see below).

Database/storage calls

This makes sense to model as a microservice. The service will accept an encryption key, decrypt the information, and make it available via a standard interface to other services. This makes the database layer very similar to the existing XBOS/SMAP layer, which focuses on efficient storage and data accessibility and should make it easier to merge projects in the future. I believe that the current prototype for the UPC already has the database as a separate service - @njriasan can you confirm?

Extension

One of the big differences between e-mission and other platforms is that e-mission is designed to be extensible. This means that people can add new functionality, both from the analysis side, and from the sensing side.

Data model wrappers

We should note that the server-side data model wrapper only represent the server-side representation. There is also a similar client side representation, currently in the data collection plugin. Over the long-term, we should really structure this as follows:

each data type should have its own model in python, Java and Objective C
each service should register for the data types that it consumes, which should automatically include the appropriate wrappers in the build

However, this is fairly complicated, and I am afraid of overengineering a solution in the time we have left. Instead, when any user wants to add a new data type, they will add it in the plugin and in the repository for the wrappers. The python version of the wrappers will be a library, installable via pip install, that will contain all known wrappers. The wrapper classes themselves are fairly small, so this should not involve too much of a burden.

Database/storage calls

I will refactor this officially into a microservice. This should be fairly straightforward since all access to the database should be through the defined Timeseries interface anyway. The methods from the abstract timeseries interface will be API methods for the container. The database will be in the internal docker network, and be accessed using the internal network only. This will also root out any lingering instances where we are not using the Timeseries interface. One challenge will be around dealing with cursors, where we want to iterate through the data without retrieving it all.

@njriasan any thoughts on this?

shankari commented 4 years ago

Another component that we should really split out is the webapp component, which contains the HTML and Javascript for the server UI, including the aggregate heatmap. That should really be served completely separately from the API, which makes calls into python.

The obvious fix would be to have two servers - one for the server UI (web tier), and the other for API layer (app tier). But that would involve having two separate ports, which opens up multiple holes in the firewall and which is generally worse from a security perspective.

An alternative would be to have one front-facing server that serves up the presentation layer and forwards other connections to the underlying API layer. The front-facing server can include a list of calls that should be forwarded, or it can forward everything that it doesn't handle and let the API layer reject it.

shankari commented 4 years ago

Although we don't want to get bogged down in this, it is also worthwhile considering how all this will work in the decentralized world. In the decentralized world, each API call (e.g. /usercache/put, /datastreams/get) is ideally a separate microservice. We want to be able to run multiple microservices at the same time. The phone will establish a connection to the microservice as part of the authentication handshake.

This seems to imply that, of the multiple microservices running in parallel, each of them will need to listen to a separate port.

@njriasan what is the firewall story wrt the cloud cluster? Will we even be able to have a regular firewall, given that each user will have multiple services, each of which will need its own port?

shankari commented 4 years ago

Also, in order to reduce maintenance costs, it would be good if we could standardize on microservices and run them in both decentralized and centralized environments. In the centralized environment, something like https://www.express-gateway.io/, open sourced at https://github.com/ExpressGateway seems like a good solution.

Of course, since our API proxy is fairly basic, we could just implement something simple in python if it turns out that switching to the MEAN stack is too complicated.

shankari commented 4 years ago

Related python project is https://github.com/Miserlou/Zappa which appears to be a wrapper around AWS API Gateway resource and Lambda services. Which is great, but I am not sure I want to build in that kind of dependency.

njriasan commented 4 years ago

Sorry I didn't see this until just now. The database is currently extracted into a separate docker container and run in conjunction in the UPC architecture. I think a service which interacts with a user's database/storage layer is reasonable. I think the alternative approach could be that there is a central UPC database service and that the Database/storage cells rather than being fed the private key send a query to a core UPC component. Similarly I think you could design the database/storage cells to request the data from the UPC instance rather than assuming some internal docker database (unless that's what you mean). These are just other options to consider because I think what you listed sounds great.

I think the firewall situation is complicated at best and its probably not feasible to give each user its own firewall. The exact authentication protocol probably needs to be discussed in more detail, but aside from only making the services addressable from inside the cluster I'm not sure how we can have a true firewall with dynamic ports.

shankari commented 4 years ago

In case anybody is tracking progress here, I got sidetracked by a conda regression (https://github.com/e-mission/e-mission-docs/issues/511), but I have worked around it now, dockerized the testing infrastructure, and got it to work with Github Actions and Travis CI (https://github.com/e-mission/e-mission-server/pull/731).

Now that I have testing in place, I can move out the first part, which is the simulation code, and create a dockerized setup with the OTP setup from Helsinki. Onward!

shankari commented 2 years ago

Removed all the old webapp code from the server pending creation of a separate, modular webapp. https://github.com/e-mission/e-mission-server/pull/854

This should make the server a lot smaller, and ensure that the dependabot alerts disappear since all of them were related to the javascript code.

e-mission / e-mission-docs