Add an new input type to backfill gzipped logs

lminaudier commented 8 years ago

Hi,

Following this discussion on filebeat forum I would like to ask if it is possible to implement a solution to easily backfill old gzipped logs with filebeat.

The proposed solution mentioned in the topic is to add a new dedicated input_type.

It is also mentioned in the topic that when filebeat reaches the end of input on stdin it does not give you the hand back and waits for new lines which makes things hard to script to perform backfilling.

What are your thoughts on this ?

Thanks for your hard work.

ruflin commented 8 years ago

I would see the implementation as following:

Prospector: A gzip prospector would be added. Harvesters would only be opened based on filenames (no inode etc.). It is assume, that the files are not renamed and don't have to be tracked. A file has a completion state and if it is once completed, it is never read again. This simplifies the implementation. If a .gz file with a new filename is found, a harvester is started.
Harvester: A gzip harvester is added. It unzips and reads the full file only once. After finishing reading the file, the harvester stops and is never started again. The offset is only stored, if it is interrupted in the middle of reading. In a first implementation, this could even be removed for simplicity.

This would be nice to have but I think it is not on the top of our priority list. A community contribution here would be more then welcome.

For your second issue about running filebeat only until first completion, lets refer to this issue: https://github.com/elastic/filebeat/issues/219

lminaudier commented 8 years ago

Thanks for the fast reply and the pointer to the issue.

I will try to look at your implementation proposal. I am still quite new to Golang and the project, so I can't promise anything :)

ruflin commented 8 years ago

@lminaudier Always here to help.

mryanb commented 8 years ago

This would be a great feature addition. Currently the splunk-forwarder does something similar and will index log rotated files that have been gziped automatically.

Ragsboss commented 8 years ago

+1. Is anyone working on this? If not I could possibly take it up..

ruflin commented 8 years ago

@Ragsboss I don't think anyone is working on that. Would be great to have a PR for that to discuss the details.

cFire commented 8 years ago

@Ragsboss @ruflin I'm delighted to see there's someone looking to pick this up. Is this happening? The reason I ask is because it may be possible for me to spend some time helping out with this in lieu of building another solution for gziped logs to use internally.

Ragsboss commented 8 years ago

@cFire please feel free to take this up. I've started looking at the code familiarizing myself with Go in general and FileBeat code. But I haven't started the real work yet. I'll try to help you in anyway you want, just let me know. Few thoughts I had. From a pure functional viewpoint, defining a new input_type doesn't seem ideal as that would force users to author a new prospector in the config file. Instead I felt it may be better for the code to automatically deal with compressed files as long as they match the given filename patterns in the config file. The code can instantiate a different harvester (IIUC) based on the file extension/type. But from implementation viewpoint if this is turning out to be difficult, I think it's ok to burden/ask the users for some extra config...

ruflin commented 8 years ago

From an implementation point of view I think we should go the route of having a separate input type. Currently the way filebeat is designed is that a prospector and harvester type are tightly coupled, so a prospector only starts one type of harvesters. It is ok if gzip harvester reuses lots of code from the log harvester (which I think it will) but as log tailing and reading a file completely only once are form my perspective two quite different behaviours. The question which will also be raised is if the gzip files will change their name over time (means have to be tracked based on inode / device) or it is enough to just store the filename and a read / unread flag in the registry.

ruflin commented 8 years ago

Here is the PR related to the above discussions: https://github.com/elastic/beats/pull/2227

willsheppard commented 7 years ago

We would like filebeat to be able to read gzipped files, too.

Our main use of filebeat would be to take a set of rotated, gzipped logs that represent the previous day's events, and send them to elasticsearch.

No tailing or running as a service is required, so a "batch mode" would also be good, but other workarounds solely using filebeat would also be acceptable.

ruflin commented 7 years ago

@willsheppard Thanks for the details. For the batch mode you will be interested in https://github.com/elastic/beats/pull/2456 (in progress)

ruflin commented 7 years ago

Now that https://github.com/elastic/beats/pull/2456 is merged this feature got even more interesting :-)

collabccie7 commented 7 years ago

Hello,

Has there been any update regarding the support for gzip files ? Please let know.

Thanks.

cFire commented 7 years ago

Idk about the others, but I've not gotten any time to work on this.

ruflin commented 7 years ago

No progress yet on this from my side.

willsheppard commented 7 years ago

This gzip input filter would be the killer feature for us. We're being forced to consider writing an Elasticsearch ingest script from scratch which writes to the Bulk API, because we need to operate on logs in-place (no space to unzip them), and we would be using batch-mode (#2456) to ingest yesterday's logs from our web clusters.

maddazzaofdudley commented 7 years ago

Has there been any movement on this as this too is the killer feature for me too

woodchalk commented 7 years ago

Throwing in my support for this feature.

plumpNation commented 7 years ago

Would be an awesome feature to have.

ruflin commented 7 years ago

There is this open PR here that still needs some work and also involves quite some discussions: https://github.com/elastic/beats/pull/3070

jordansissel commented 7 years ago

Harvesters would only be opened based on filenames (no inode etc.).

@ruflin I believe inodes may to be tracked because logrotate (assuming this is a target use case) renames files and reuses file names. Unless another tracking mechanism (when is 'hello.txt.1.gz' a "new file" below, for example)

Example:

% ls -il /tmp/hello.txt*
103196 -rw-rw-r--. 1 jls jls 12 Jan 24 03:17 /tmp/hello.txt
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.1.gz

% cat test.conf
/tmp/hello.txt {
  rotate 5
  compress
}

% logrotate -s /tmp/example.logrotate -f test.conf

% ls -il /tmp/hello.txt*
103218 -rw-rw-r--. 1 jls jls 32 Jan 24 03:17 /tmp/hello.txt.1.gz
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.2.gz

^^ Above, 'hello.txt.2.gz' is the same file (inode) as previous 'hello.txt.1.gz'.

We can probably achieve this without tracking inodes (tracking modification time and only reading .gz files after they have been idle for some small time?), but I think the filename alone is not enough because file names are reused, right?

ruflin commented 7 years ago

I exactly hit this issues during the implementation. That is why the implementation in https://github.com/elastic/beats/pull/3070 is not fully consistent with the initial theory in this thread.

The main difference now to a "normal" file is that it is expected, that a gz never changes, and if it changes, the complete file is read from the beginning again.

jpdaigle commented 7 years ago

+1 vote for gzip input

smsajid commented 7 years ago

+1 for gzip support

ruflin commented 7 years ago

@jpdaigle @smsajid Could you share more details on how exactly you would use this feature?

jpdaigle commented 7 years ago

@ruflin In our case, log files from hundreds of servers are streamed to a central log processor, which then outputs gzip-ed logs in chunks of a few seconds on a host that's accessible to developers. This is where a filebeat would come into play: prospect for the appearance of new .gz files, grab them and process them through a logstash pipeline.

The above is pretty "single use-case specific" though, in more generic terms, how we would use a .gz input filter is for grabbing the output of one logging system and "gluing" it to the input of a logstash pipeline.

christiangalsterer commented 7 years ago

We would also be interested in this feature as we also have scenarios where zipped files are generated by an intermediate system and where we can't read the plain files directly. It would be great if this can be combined with #4373 so that we can specify how many days back and/or based on file name patterns we can read zipped files.

smsajid commented 7 years ago

@ruflin My use case requires back-filling of gzipped log files. We have requirement to keep original log files for a minimum of 6 months. Right now, the only option for us is to unzip the log files from the central location to a staging location and read the files from there using file beats. This has the disadvantage that we need additional storage and regular cleanup of unzipped files once they are processed

christiangalsterer commented 7 years ago

Very similar requirement also here which seems to be not so uncommon in regulated environments.

ruflin commented 7 years ago

@jpdaigle @smsajid @christiangalsterer Thanks for sharing. The use case that you mention where you "only" have gzip files and no other files could actually be covered by https://github.com/elastic/beats/pull/3070 Now that we have prospectors better abstracted out, it should also be possible to make it a separate prospector type.

@christiangalsterer Which issue did you want to refer above? Not sure if the one you have in is the right one.

strootman commented 7 years ago

Good Luck

eperpetuo commented 6 years ago

+1 for gzip support

ruflin commented 6 years ago

@eperpetuo Could you also share your detailed use case? I want to keep collecting data on how people would use the feature to then validate a potential implementation.

eperpetuo commented 6 years ago

@ruflin In my case, during the log rotation, files are automatically gzipped. That is, the current log file is compressed and renamed to foo.log.gz and a new foo.log file is created. New log events start to be written in this new foo.log file.

Now, I have experienced some delay during high throughput events. Imagine filebeat is 2 seconds behind the last line in the log file. When log rotation occurs and the file is gzipped, filebeat is not able to continue reading the file.

Although new lines written to the just created foo.log file are perfectly collected by filebeat since from the beggining, last few lines of the now gzipped file are never shipped to Elasticsearch and in this case there is loss of information.

We currently use Splunk and it also presents some delay. However, the splunk-forwarder is able to collect all events even during log rotation and no message loss occurs, which is the most important.

This situation is preventing to move the solution to the production environment because we can't afford to lose messages.

praseodym commented 6 years ago

@eperpetuo For now, a workaround could be to set up logrotate to only gzip the second time a file is rotated (i.e. you end up with log, log.1, log.2.gz, log.3.gz, etc.). This is how many Linux distros do rotation as well.

eperpetuo commented 6 years ago

@praseodym Thanks. We'll look into this workaround.:metal:

ruflin commented 6 years ago

@eperpetuo If you disable close_renamed and close_removed filebeat should continue reading the file as it will keep it open. This should give you your expected behaviour. Feel free to open a topic on discuss for it to discuss further.

jhnoor commented 6 years ago

+1 for gzip support, we have a ton of xml backlogs that are gunzipped that I would like to send to logstash. Unzipping everything would take up 30x more space.

sly4 commented 6 years ago

+1 for gzip support. I have a pile of alerts.json.gz files I would like to re-run.

brennentsmith commented 6 years ago

+1 as well. Our use case is that our CDN provider (Edgecast) gzip's logs before shipping to us - as a result we have no way of receiving them raw. We could have a process that takes the gzipped objects, and decompress them to a second directory that filebeat watches, but that's frankly wasteful when we're dealing with terabytes of logs.

ruflin commented 6 years ago

@sly4 @jhnoor @brennentsmith The cases you mentioned should be easier to cover as there is no overlap between zip and unzipped files. Thanks for sharing the uses cases.

djjoshuad commented 6 years ago

+1 on behalf of the DFIR world. ELK is/was my go-to for ingesting large dumps of log data in whatever crazy format a customer provides them in. I have no use for "tailing" a log file; I only need to read in static files. Those files are almost always gzipped, and rarely is decompressing them a viable option. As an example, the case I'm working right now involves (among many other things) about 200G of compressed log evidence spanning the last 18 months. Decompression is not a viable option. I'd love to adopt the beats way of doing things, if it becomes possible to do so.

yodog commented 6 years ago

+1

georgezoto commented 6 years ago

+1 for "reading gzipped files should be supported out of the box"

kevin0211 commented 6 years ago

+1

arenard commented 6 years ago

Any updates on this ?

iahmad-khan commented 6 years ago

+1 for gz files as input

mishat-realityi commented 6 years ago

Are there any updates on this?

cando-1p commented 6 years ago

If I understand the documentation for filebeat here, filebeat can take logs from stdin or a udp socket. So for most of the cases talked about hear, you could setup the config file appropriately and do something in bash like:

for f in *.gz ; do
   zcat $f | filebeat
done

The important use case I see for reading gzip files is logs that are rotated before they are shipped in the cases of a prolonged network outage. For example on my busy server I have log rotation that happens every minute and I have delaycompress enabled. If there is a network outage to where filebeat sends the logs for more than a minute, It could end up missing logs because they were rotated into a gzip file. I understand the difficulties of tracking between the original file and the gzipped file and you don't want to read logs from a gzipped file that the logs have already been read from. So I think it would be great if filebeat could take care of doing the rotating. :-D That way it could update the registry info with the file that it caused to be gzipped. The rotation could have nice rich support for rotating with date formatted path names and cleanup.

elastic / beats

Add an new input type to backfill gzipped logs #637