Closed dagwieers closed 12 months ago
Pinging @elastic/integrations-services (Team:Services)
This usage (getting data from an SQL table in the form of logs) would be better in filebeat rather than metricsbeat, as the spirit is closer to how log collection operate (have a single instance of each record) rather than metrics (collection of a sensors at a certain point in time).
The metricbeat sql module is a lifesaver, we are very happy with this inclusion. \o/
Thanks for your feedback, happy to see you like this module :slightly_smiling_face:
This usage (getting data from an SQL table in the form of logs) would be better in filebeat rather than metricsbeat, as the spirit is closer to how log collection operate (have a single instance of each record) rather than metrics (collection of a sensors at a certain point in time).
Yeah, I tend to agree with this, the features requested here can be problematic in Metricbeat (specially the one about setting the @timestamp
, and the one about the last successful ingest). They may be more suitable as a filebeat input, that has a registry to continue from a given timestamp. Or for this use case, maybe logstash could be used, with the JDBC input plugin.
Regarding the support for a unique identifier: something like this may be already possible by using the fingerprint
processor (not tested). It can create a unique fingerprint based on a list of other fields, then you could rename it to @metadata.id
, so it is used as the id of the document.
But even if possible, it can have problems. Indexes created by default with Metricbeat are not intended for this kind of documents, these indexes are rotated using ILM, and on rotation, documents with the same _id
can exist on different indexes.
Also, there are some discussions about allowing indexes without _id
(see https://github.com/elastic/elasticsearch/issues/48699), in the future indexes/data_streams storing time series might not have _id
at all.
So I don't think this should be supported in the sql
module (or any other Metricbeat module), but there are workarounds that may already work using processors and custom indexes, or using Logstash.
@jsoriano I get your point wrt. the id not being unique beyond the index used, but that would be a concern for the specific implementation. For some use-cases there would be no roll-over, and in our case we would also push only X historical records (e.g. the past 6 hours) so that a restart of the agent would recover from this kind of downtime.
But I agree with @halfa, that filebeat would be a better place to have such an sql module.
BTW We use a different timestamp in the definition of the index pattern and that works fine as well.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi! We just realized that we haven't looked into this issue in a while. We're sorry!
We're labeling this issue as Stale
to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1
.
Thank you for your contribution!
The metricbeat sql module is a lifesaver, we are very happy with this inclusion. \o/
But we think there are a few features that could make this even better.
Support a unique identifier for
_id
We like to perform queries and updates docs in Elastic. For this we would need to provide the
_id
value, so it would be nice if we could configure which field is to be used as the_id
value. This would allow us to have a rolling window for updating documents (e.g. every hour process the aggregated data for the previous X hours) and have it update those entries in Elastic.I may be possible to related this to the primary key of a table. But a configurable column would be more convenient IMO
Support an identifier for the
@timestamp
We already use a timestamp provided in the query as the timeField in the Index pattern, and this works very well already. But I think it could be useful to have the default
@timestamp
selected from the query as well so this works out of the box.For metrics at the time of the query this offers no value, but for time-based aggregations in tables, having a way to influence
@timestamp
without the need to have ingest pipelines would be very useful.Storing and using the timestamp of the last successful ingest
We are processing aggregated data grouped by hour, the first time we would like to get to the historical information (all information), but subsequent queries should only cover the most recent windows. In fact, if there would be a variable that would hold the timestamp of the last queried/ingested time we could use this to calculate the timeframe we need to process. In this case if we miss to run the queries in a timely fashion (agent was down) it would pick up where it left off (just like filebeat does when processing logfiles).
cc @amandahla @jsoriano
Originally posted by @dagwieers in https://github.com/elastic/beats/issues/13257#issuecomment-734846337