Closed OneCricketeer closed 6 years ago
Hi cricket007,
thanks for asking! StreamSets is a great tool as well, but there are some conceptual/technical differences and also differences concerning the purpose and target audience between StreamPipes and StreamSets.
First of all, StreamSets is primarily focused on continuous data harmonization targeting users with a data engineering focus. In contrast, the main goal of StreamPipes is to support non-technical people, e.g., production experts in designing data analytics pipelines. Many users use StreamPipes for (often stateful) tasks such as condition monitoring, situation detection or IoT analytics. A main focus of StreamPipes is to provide real-time analytics based on reusable pipeline elements, for example, to monitor a production plant. Of course, the rather generic architecture of StreamPipes allows for other use cases and also data harmonization tasks (we also provide features like field renaming or hashing), but you won't find pipeline elements that can be configured on a rather technical level as it is possible with StreamSets. While most data sinks in StreamSets are third party applications where you'd like to route data to, sinks in StreamPipes are often also things like visualizations, notifications or web services connected to actuators.
From a technical perspective, StreamPipes uses the concept of runtime wrappers. Those wrappers can be used to extend the functionality of StreamPipes with new algorithms in the form of pipeline elements (which are standalone elements that connect to the editor using REST interfaces and exchange data using message brokers/protocols like Kafka or MQTT). Currently we have wrappers for Flink, Spark, Kafka Streams, Java, and Siddhi (although some of them are not yet fully ready for production usage). Wrappers can be extended for any kind of data processing technology to ensure to keep up to with the fast-changing ecosystem of stream processors.
In addition, a major benefit of StreamPipes (in our opinion) are advanced matching features provided to users when building pipelines by employing semantics-based consistency checking. Each of the pipeline elements and data sources can be enriched with semantic metadata, which is used by StreamPipes to ensure that only semantically correct pipelines are built (e.g., you can add requirements for a minimum frequency of events in a stream or require that an event provides some "temperature" value that measures temperature in the unit "degree celsius". The system supports users during the definition of pipelines by recommending processors and also helps to avoid building inconsistent pipelines. Such semantic descriptions can be easily added using the SDK. In addition, we are currently working on the release of a graphical model editor that is connected to a code generator in the backend.
Although StreamPipes can be used for data routing and transformation tasks, we mainly focus on generating value out of raw, continuous, preferably sensor/time-series data by enabling non-technical users to build data pipelines.
Having said that, since both StreamPipes and StreamSets have been developed over the last couple of years, there is clearly some overlap regarding the overall idea and some data processors that are provided by both tools.
We hope that answers your question! If you have any feedback, feature suggestions or things you'd like to see in StreamPipes, please let us know (and feel free to contact us directly on our mailing list or slack (streampipes-community.slack.com)!
Thanks for the response! It would be useful if you could add such differentiators on your site so that other's won't be wondering as well. I've got too many Slack teams nowadays, unfortunately 😞
Few comments,
tasks such as condition monitoring, situation detection or IoT analytics
I could see companies using StreamSets for this as well. I know at least one client using it for IoT already.
won't find pipeline elements that can be configured on a rather technical level as it is possible with StreamSets
I assume you are referring to the Groovy Evaluator here?
sinks ... are often also things like visualizations, notifications or web services ...
This would using some combination of HTTP / REST / Kafka / MQTT / other, I assume. Not sure if "connected to actuators" really is a critical detail for differentiating the product offerings.
Currently we have wrappers for Flink, Spark, Kafka Streams, Java, and Siddhi
I didn't dig too far into the documentation, but does this mean an API layer for these tools that you call out to make a request to a "StreamPipes API", or that you can say "run StreamPipes pipelines within these tools"; basically, like an "embedded mode"?
Hi Jordan,
thanks!
It would be useful if you could add such differentiators on your site so that other's won't be wondering as well.
That's a good idea, we are currently in the process of releasing a new version next week and will then also work on an updated documentation. We'll add a section to our FAQ that compares StreamPipes to other products (actually, we are also often asked for the difference to NodeRed).
This would using some combination of HTTP / REST / Kafka / MQTT / other, I assume. Not sure if "connected to actuators" really is a critical detail for differentiating the product offerings.
Yes, often some of these protocols are used for connecting sinks to things wenn we connect to actuators or third-party systems. The main idea is that we try to abstract from those technical details for the end user (e.g., by providing an "alarm light" sink that triggers the alarm at some machine or so...)
to make a request to a "StreamPipes API", or that you can say "run StreamPipes pipelines within these tools"; basically, like an "embedded mode"?
You can think of a pipeline element as a self-contained component that exposes its description as a JSON-LD graph via REST. In addition, a pipeline element implements its processing logic using one of the wrappers. Once a pipeline is started, StreamPipes invokes a pipeline element (again using a JSON-LD graph that now includes information on the data sources to connect to, the broker/topic where results should be sent to and other information such as user-provided parameters). Afterwards, the pipeline element itself executes the processing logic based on this configuration. Wrappers that rely on a distributed system (e.g., Flink), submit the processing logic to the configured cluster. At runtime, pipeline elements communicate with each other by using a message broker (usually Kafka, but other brokers could be used as well). One advantage of this architecture is that geographically distributed pipelines can be built (e.g., some pipeline elements are deployed at the edge and some others run centrally in some cloud).
Hope that helps! Don't hesitate to ask if you have any further questions and we really appreciate your feedback!
-Dominik
Hey, so I stumbled across this project and looked interesting, so digging around I see it's just a graphical editor on top of a bunch of Java SDKs, so I am really curious what the differences might be and why someone may chose your product over another?