elastic / ecs

Elastic Common Schema
https://www.elastic.co/what-is/ecs
Apache License 2.0
1.01k stars 418 forks source link

log.offset not ECS compliant? #2234

Open nicpenning opened 1 year ago

nicpenning commented 1 year ago

Summary

The field log.offset seems to be a standard when using any of the Elastic Integrations and the file input type. It seems logical that this become a standard ECS field or something equivalent because the line number is very helpful when ingesting from files.

Motivation:

The motivation came from when I was developing my first Elastic integration and was surprised to see that log.offset (which automatically got added to my data ingest) was not found in the ECS. I was using a sample ingested event that contained this field. I ended up removing it from my sample event because I imagine something under the hood in Elastic is creating and mapping this field.

Detailed Design:

log.offset

120

Type: Long

About any file that is ingested with filebeat will contain this field.

Really, I am just asking the question, why shouldn't this exist in the ECS?

Does log.origin.file.line replace it?

ebeahan commented 1 year ago

Does log.origin.file.line replace it?

No, log.origin.file.line is the line number read from the original file. I believe log.offset is a cursor offset in bytes from the beginning of the file.

log.offset was originally in ECS but removed pre-1.0.0: https://github.com/elastic/ecs/pull/131. With log.offset populated by the log (and now filestream) input, it may have been considered too low-level to capture in a common schema at the time. There are other input-specific fields not defined in ECS, like input.type.

nicpenning commented 1 year ago

Okay, that makes sense. So it is possible that fields that could be used very commonly won't need to be ECS compliant? I thought this could be a slam fun for a field but looks like that road has already been navigated :)

nicpenning commented 1 year ago

Also, is using this custom field an issue since log.* is reserved for ECS?

nicpenning commented 1 year ago

If there are no plans to address the log.offset I can go ahead to close this Issue and move on. I just thought it was an interesting find, but apparently not so much haha

ebeahan commented 1 year ago

Also, is using this custom field an issue since log.* is reserved for ECS?

Yes, it's breaking the conventions a bit. There's a handful of legacy fields in places, like Beats, that are commonly used, predate ECS, but for one reason or another never formally added to the schema.

If there are no plans to address the log.offset I can go ahead to close this Issue and move on.

No plans currently, but we can also leave this discussion open if anyone has thoughts to share around log.offset.

nicpenning commented 1 year ago

I think it could be a worthwhile discussion. I will leave it open for a bit until I get a better understanding of the scope of the situation. Thanks for your input!

zez3 commented 10 months ago

@ruflin do you remember perhaps what was the reasoning behind removing log.offset type deninition and how should we continue or not with this ?

ruflin commented 10 months ago

The main reason was that for the initial release of ECS we wanted to keep the scope to a minimum. It is easy to add the fields, but really hard to remove them later, see also https://github.com/elastic/ecs/pull/131.

It leaves open the question, what should we do with log.offset. For anything shipped to Elastic, we should make sure the mapping is accurate even if it is not part of ECS. @felixbarny Should it be mapped by a default template? @AlexanderWert Is there anything similar in semconv? @cmacknz I assume filestream uses the same field?

StephanErb commented 10 months ago

Mapping with a default template sounds reasonable to me. The offset fields takes notable amount of space and is rarely searched on, so disabling indexing would make sense.

AlexanderWert commented 10 months ago

@AlexanderWert Is there anything similar in semconv?

In semconv there's there's nothing similar. There's a log.record.uid but that's different, representing a unique record id.

ruflin commented 10 months ago

The offset fields takes notable amount of space and is rarely searched on, so disabling indexing would make sense.

Agree, also something we are discussing internally. @StephanErb If I remember correctly, you use it for sorting if timestamps are identical? Will you still need it with nano precision or does the field then become obsolete?

cmacknz commented 10 months ago

@cmacknz I assume filestream uses the same field?

Yes

StephanErb commented 10 months ago

We planned to do the sorting based on the offset, but never got it fully working. As you have said, once nanoseconds come in the offset would not really be needed any longer.

zez3 commented 10 months ago

We planned to do the sorting based on the offset, but never got it fully working. As you have said, once nanoseconds come in the offset would not really be needed any longer.

So how do we stop this field from being created, as close as possible to the filebeat(source) level? I have some hundreds of Agents configured with different policies that contain file log collection/ingestion. What would be the best/recommend approach here?

ruflin commented 10 months ago

Ideally, I think we should make the field "opt-in" on the edge (in filebeat). The problem is that some users will consider this a breaking change, so we need to figure out how to do it gracefully.

I did a quick try with filebeat on how to do it manually and the following seems to work:

processors:
- drop_fields:
    fields: ["log.offset"]
    ignore_missing: true

This drops the field directly on the edge.

ruflin commented 10 months ago

As a follow up I have filed https://github.com/elastic/elastic-agent/issues/3934