elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.15k stars 4.91k forks source link

Make metricbeat sql module more versatile #22779

Closed dagwieers closed 12 months ago

dagwieers commented 3 years ago

The metricbeat sql module is a lifesaver, we are very happy with this inclusion. \o/

But we think there are a few features that could make this even better.

Support a unique identifier for _id

We like to perform queries and updates docs in Elastic. For this we would need to provide the _id value, so it would be nice if we could configure which field is to be used as the _id value. This would allow us to have a rolling window for updating documents (e.g. every hour process the aggregated data for the previous X hours) and have it update those entries in Elastic.

I may be possible to related this to the primary key of a table. But a configurable column would be more convenient IMO

Support an identifier for the @timestamp

We already use a timestamp provided in the query as the timeField in the Index pattern, and this works very well already. But I think it could be useful to have the default @timestamp selected from the query as well so this works out of the box.

For metrics at the time of the query this offers no value, but for time-based aggregations in tables, having a way to influence @timestamp without the need to have ingest pipelines would be very useful.

Storing and using the timestamp of the last successful ingest

We are processing aggregated data grouped by hour, the first time we would like to get to the historical information (all information), but subsequent queries should only cover the most recent windows. In fact, if there would be a variable that would hold the timestamp of the last queried/ingested time we could use this to calculate the timeframe we need to process. In this case if we miss to run the queries in a timely fashion (agent was down) it would pick up where it left off (just like filebeat does when processing logfiles).

cc @amandahla @jsoriano

Originally posted by @dagwieers in https://github.com/elastic/beats/issues/13257#issuecomment-734846337

elasticmachine commented 3 years ago

Pinging @elastic/integrations-services (Team:Services)

halfa commented 3 years ago

This usage (getting data from an SQL table in the form of logs) would be better in filebeat rather than metricsbeat, as the spirit is closer to how log collection operate (have a single instance of each record) rather than metrics (collection of a sensors at a certain point in time).

jsoriano commented 3 years ago

The metricbeat sql module is a lifesaver, we are very happy with this inclusion. \o/

Thanks for your feedback, happy to see you like this module :slightly_smiling_face:

This usage (getting data from an SQL table in the form of logs) would be better in filebeat rather than metricsbeat, as the spirit is closer to how log collection operate (have a single instance of each record) rather than metrics (collection of a sensors at a certain point in time).

Yeah, I tend to agree with this, the features requested here can be problematic in Metricbeat (specially the one about setting the @timestamp, and the one about the last successful ingest). They may be more suitable as a filebeat input, that has a registry to continue from a given timestamp. Or for this use case, maybe logstash could be used, with the JDBC input plugin.

Regarding the support for a unique identifier: something like this may be already possible by using the fingerprint processor (not tested). It can create a unique fingerprint based on a list of other fields, then you could rename it to @metadata.id, so it is used as the id of the document. But even if possible, it can have problems. Indexes created by default with Metricbeat are not intended for this kind of documents, these indexes are rotated using ILM, and on rotation, documents with the same _id can exist on different indexes. Also, there are some discussions about allowing indexes without _id (see https://github.com/elastic/elasticsearch/issues/48699), in the future indexes/data_streams storing time series might not have _id at all.

So I don't think this should be supported in the sql module (or any other Metricbeat module), but there are workarounds that may already work using processors and custom indexes, or using Logstash.

dagwieers commented 3 years ago

@jsoriano I get your point wrt. the id not being unique beyond the index used, but that would be a concern for the specific implementation. For some use-cases there would be no roll-over, and in our case we would also push only X historical records (e.g. the past 6 hours) so that a restart of the agent would recover from this kind of downtime.

But I agree with @halfa, that filebeat would be a better place to have such an sql module.

BTW We use a different timestamp in the definition of the index pattern and that works fine as well.

botelastic[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

botelastic[bot] commented 1 year ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!