Set up and review functional scope of Logstash

richard-jones commented 3 years ago

[x] Install Logstash on test infrastructure
[x] Compare import pipeline requirements to Logstash capabilties.
[x] Produce report detailing what can be done in Logstash and what cannot
[ ] Open new issues to cover the work required to deliver the import pipeline

RK206 commented 3 years ago

Logstash has been setup on test infrastructure

RK206 commented 3 years ago

Compare import pipeline requirements to Logstash capabilties.

Logstash can be part of import pipeline. Logstash supports multiple pipelines through multiple config files to import data from different sources. Logstash can be connected to different connectors like Kafka, Beats etc. For example, different source send data to Kafka directly or through ETL code (if source tool does not support kafka directly) to different topics on Kafka. Separate pipeline can be created for each topic from Kafka in Logstash and send the data to Elasticsearch. There can be one or more Logstash instances depending on the pipeline requirement. See the attached graphical view of pipeline Logstash pipeline

RK206 commented 3 years ago

Produce report detailing what can be done in Logstash and what cannot

Following can be done with with Logstash:

1) Logstash has lot of input plugins which can be sufficient for most of the requirements. Input plugins: beats, file, kafka, http, tcp etc 2) Data can be parsed using filters before sending to output 3) An example for parsing can be, if the input data is log file or stream with each log as one line, a filter can be added to parse the log into JSON format before sending to output 4) Parallel pipelines can be created for multiple input sources using multiple config files 5) Output can be sent to different destinations by using different output plugins. 6) Logstash integrates very well with elasticsearch.

Following cannot be done with with Logstash:

1) Loststash does not has it own cluster. That means if Logstash instance is down, the pipeline is down. 2) We need to manage our own cluster if we need Logstash in High Availability mode. 3) By default Logstash uses in-memory bounded queues. Data can be persisted by using persistent queue feature. But there is a chance of data loss for some input plugins that does not support achnowledgement to sender. Example of such plugins TCP, UDP

richard-jones commented 3 years ago

From our discussions today:

We are not sure what is happening with Logstash with regard to future forks by Open Distro, and this may have an impact on our choice to use it.
It may not be easy to build the custom bits of the pipeline, such as connecting to the Reporting Context, though it is possible to write Logstash steps in Ruby.

We think that we should spec out the import pipeline and the pre-compute pipelines before taking this one further. (will open that as a separate issue)

richard-jones commented 3 years ago

We have had a number of discussions and investigations into Logstash internally, and following recent news from Elastic on their further licensing and functional changes to their various components, we have decided that pursuing Logstash is no longer a suitable Open Source approach for this project.

To that end, we have evaluated a number of alternatives, and concluded that Apache Kafka https://kafka.apache.org/ is the best alternative solution. In fact, this gives us more flexibility that Logstash, and may be integrated into the Event API directly, to create a high performance import pipeline.

@Pcolar I'm assigning this one to you just to make sure that you get the information. We can leave this open to discuss at our next call, or if you are happy as-is you can close it off.

Pcolar commented 3 years ago

IMO - kafka is an excellent choice for event topic queues and feeds. I will see if there are any queries from the NGLP side and close this issue.

richard-jones commented 3 years ago

Just to finalise this issue:

We made the decision based on the ongoing licensing shenanigans at Elastic to drop Logstash from the project.

Instead we have adopted Apache Kafka, which is a high scale/high throughput event handler. It is providing us real-time event processing, meaning that events reported to the Event API are available in the reporting interfaces in near real-time.

I'm closing this issue off now.

NGLPteam / NGLP-Analytics

Set up and review functional scope of Logstash #1

Compare import pipeline requirements to Logstash capabilties.

Produce report detailing what can be done in Logstash and what cannot