cortoproject / corto

A hierarchical object store for connecting realtime machine data with web applications, historians & more
https://www.corto.io
MIT License
86 stars 14 forks source link

Positioning this framework / SDK #530

Closed KrishnaPG closed 6 years ago

KrishnaPG commented 7 years ago

Came across this Corto SDK and gone through the tutorials.

Trying to understand what is the exact place of this SDK in the bigger IOT landscape.

For example, one prominent concept I see (from the tutorials and issues here) is store. In a typical IOT deployment you are looking at ingesting the data into DB through MQs. It is not clear if Corto is trying to replace the traditional DBs (such as InfluxDB, Postgres etc.). If so, what is the advantage of this Corto store over the DBs.

The website has this tagline ...bridging of data between IoT protocols, databases, webservices ...

It not clear what is meant by "bridging of data between some X and database". A database is supposed to be "the store" of data. Not sure how Corto injecting another store before the database is helpful.

If this is more like a 'in memory datastore' in front of the traditional DBs, then one would have to question: how is this Corto "in memory store" is better than 'Redis' or similar in memory databases.

Any info that can clarify this product positioning can greatly help the community adapt this at much faster pace.

SanderMertens commented 7 years ago

@KrishnaPG all great questions! You catch us at a great time; after being in development for almost 5 years we're close to releasing version 1.0. Note I'm updating our tutorial, since I agree the current one doesn't highlight very well why you would use Corto.

TL;DR: Corto is not a replacement for postgres, influxdb, redis or any other existing database. It is a light-weight library (~400kb) that connects datasources and provides a uniform API to access/modify/bridge data.

Think of Corto as a community-powered toolkit for IoT platform developers. Corto handles the mundane "data plumbing", like fetching data from multiple sources, forwarding it from one source to another, make it available to (web) APIs, transform between different formats and install/manage/update connector packages.

The long version: Data in Corto is organized in an object hierarchy which emits CRUD-events whenever a change occurs. These events can be generated and caught by anyone, from connectors (to bridge data from one technology to another) to application developers.

For example, if I want to store sensordata from MQTT in a historian like influxdb, I would setup an MQTT and influxdb connector to the same endpoint. Corto then takes care of forwarding the data from MQTT to influxdb, and also translate automatically between the native format of the connectors (for example, between JSON and the line-format of influxdb).

Corto allows connectors to be "ephemeral", meaning that data doesn't have to be stored in the (RAM) object store. Connectors implement an interface which lets corto request records/objects iteratively (one-by-one, for an offset+limit). That way, it loads only what you need, and you can iterate through datasets with millions of records, even on small devices.

Corto's API and framework are designed to keep as little data in RAM as possible, and should be regarded as a thin data-access wrapper, not "yet another datastore".

It is our goal to keep Corto a free, community powered project. We added rudimentary package management facilities in the library that make it easy to share connectors between users. Currently we have (open source) connectors to postgresql, influxdb, mqtt, omg dds, HTTP/REST, DDP, sigar, an HTML5 data browser, edmunds and openweatherapi, and about to release mongodb and lmdb.

Hopefully that all makes sense :) Please reach out if you have any follow-up questions!

KrishnaPG commented 7 years ago

Thank you @SanderMertens for taking time to explain the details. It is helpful.

I think the source of the confusion came from the fact that the tutorial on the website started (and ended) with explaining about the store part and not clearly demonstrating the connectivity part.

If I am not much mistaken, the differentiation you are offering through Corto is the inter-connectivity part, where by, in C I could gather data from disparate sources (MQTT, REST clients etc.) and then do some processing and then send back the data to disparate sources (say InfluxDB, ElasticSearch ...) without actually worrying about the connectivity drivers/interfaces. Correct?

If that is indeed the case, then this is a very good work. In that case, perhaps I would have to a take more deep look into this to see how it can fit some use-cases we have.

For example,

The concern I have is: Performance.

C is very good for CPU oriented work. But the I/O part is always tricky for it to handle, which is where the underlying event loop clearly determines the performance. When you mentioned about events and connectors, how is the underlying network communications implemented? For example, frameworks such as SeaStar take full-advantage of the multi-core to achieve high-performance with shared-nothing memory model. On the other extreme we have this libUV style of models that are single-threaded but with callbacks that can reuse CPU (the NodeJS model, which choke the CPU when supplied with CPU bound operations).

In the typical IOT middleware scenarios, it is always a combination of CPU + IO bound operations. How does Corto handle this combination?

One example case is, say, the data is coming from sensors and use Corto between MQTT and InfluxDB to do anomaly detection and raise an alert series data that goes into say RethinkDB (which is monitored by Browser clients), calculate moving-averages as another new series that goes into say ES - can Corto handle that kind of load pressure / performance criticality ?

Perhaps any simple demo app that can demonstrate the connectivity part with one or two of the existing connectors you have listed, could help much.

SanderMertens commented 7 years ago

...without actually worrying about the connectivity drivers/interfaces. Correct?

That is correct! The Corto API offers a set of common data management operations which are mapped to disparate connectors/datasources. Developers thus only needs to know the Corto API.

you have mentioned about "change events". What are their limits? Would they work across browser connected clients (something like pubsub)? Or, is it more like they work within the process boundary.

Corto implements a pub/sub event model where datasources are decoupled from observers. It relies on connectors to cross process boundaries. The MQTT connector (https://github.com/cortoproject/mqtt) is a simple example of a connector that can forward events between different processes.

What is the ingestion rate? for example, can this be put in direct line of high-frequency sensor data (to do, say, event processing to detect anomalies, or compute moving averages etc. in real-time), or is it more like for offline processing

I did a few simple benchmarks today to give you a correct answer. Here are the results:

benchmark time / event events / sec
0 observers 31 nanoseconds 32 million
1 observers 44 nanoseconds 23 million
5 observers 86 nanoseconds 11 million
10 observers 147 nanoseconds 6.7 million
50 observers 575 nanoseconds 1.7 million
100 observers 1103 nanoseconds 0.9 million

The measurements were obtained with one process using a release build of Corto on a 2.5Ghz i7. The code can be found here: https://github.com/cortoproject/examples/tree/master/c/PerfTest. Connectors in Corto are observers, so 100 observers equate to 100 different technologies observing a datapoint.

These numbers will go up as observers need to do more (in the benchmarks the observers are empty). If an observer has to do a lot of work or relies on IO it could be better to offload to a different thread, which brings me to:

In the typical IOT middleware scenarios, it is always a combination of CPU + IO bound operations. How does Corto handle this combination?

Excellent question! Corto supports both single-threaded as well as multithreaded notifications. By default, observers are triggered within the same thread as where the event occurs. However, to make optimal usage of multi-core architectures, Corto allows observers to use a so called dispatcher.

Dispatchers are entities that intercept events at a low level and allow users to override how an observer should be invoked. For example, a dispatcher could distribute events to a threadpool to distribute load evenly across cores. Dispatchers don't enforce a particular threading strategy, but rather provide a mechanism that allows freedom in how to implement threading.

can Corto handle that kind of load pressure / performance criticality ?

It is usually not Corto that is the bottleneck. As shown in the benchmarks, the overhead for events is very low. The connectors typically determine how fast a system can go.

For example, Mosquito will start choking if you try to send messages at >1000Hz. InfluxDb ingests data with REST, and keeps time with a precision of 1 second, so storing data at higher intervals doesn't make sense. RDBM's and NoSQL stores have to be ACID, so insertion times are usually measured in microseconds, not nanoseconds. Even a fast protocol like DDS still takes 6 microseconds per write, which is two orders of magnitude higher than an update in Corto.

Perhaps any simple demo app that can demonstrate the connectivity part with one or two of the existing connectors you have listed, could help much.

This 80-lines-of-code example shows bridging between DDS and MQTT (both IoT protocols): https://github.com/cortoproject/ospl/blob/master/examples/mqttbridge/src/mqttbridge.c

The connectors can be found here: DDS -> https://github.com/cortoproject/ospl MQTT -> https://github.com/cortoproject/mqtt

The DDS and MQTT connectors connect to the same data endpoint (topicScope), which causes data to be automatically bridged between them. The MQTT connector configures the corto sampleRate policy. DDS typically publishes at much higher frequencies on a LAN, and this policy ensures you don't swamp the MQTT broker with traffic.

Perhaps noteworthy at this point is that Corto implements mechanisms at a low level to prevent feedback loops between connectors. Without those, the above example would start looping data between DDS and MQTT.

I hope to finalize work on the InfluxDb connector this weekend, at which point I will host a small demo on the Corto website that shows data flowing from MQTT into InfluxDB, and expose it through REST.

A lengthy reply. Hope it helps!

KrishnaPG commented 7 years ago

Thank you @SanderMertens for the detailed response. The benchmark results are impressive.

I will try to checkout the examples you have pointed. I have one small nagging concern though: why rake and why not cmake?

Usually cmake is the go to choice now for the c/c++ build systems these days - any specific reason for not using it? Would be happy to do a PR for the CMake, but not sure if this project has some specific dependencies on rake, that cmake cannot provide.

SanderMertens commented 7 years ago

Initially I was using make, but the portability combined with the sometimes obscure scripting made me look for alternatives. I considered cmake, but went eventually with rake because it is available on many platforms, but mostly because it is based on ruby and thus has a "proper" and widely adopted scripting language.

What do you see as the downside of rake vs. cmake? Note that rake is only required for building corto, for your own projects you can use any buildsystem (though most corto users do use the corto buildsystem, because they like its ease of use).

SanderMertens commented 7 years ago

@KrishnaPG a new version of Corto has been released

SanderMertens commented 7 years ago

@KrishnaPG I have pushed a new version of corto which contains many new features and bugfixes. If you'd like, you can try it out by just repeating:

sudo curl corto.io/install | sh

Also, I have just uploaded the new tutorial, which explains in greater depth how to build a connector, and the underlying architecture. You can find it here: https://corto.io/doc/tutorial.html