elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.17k stars 4.92k forks source link

filebeat: feature request, delete files after harvesting finished #714

Closed nheath closed 8 years ago

nheath commented 8 years ago

Has there been any consideration to having filebeat be able to delete files on disk after harvesting is done? This use case isn't strictly a part of harvesting and forwarding log files contents, but its related, and ultimately a part of a complete solution for log file maintenance. Additionally, filebeat would already be monitoring these directories, and having filebeat delete would mean you would be sure files weren't deleted before harvesting is finished.

Alternately, if this is out of scope for this project, has anyone had good experiences with tools that fulfill this functionality in windows? We currently have a variety of solutions for various pieces, including some old home-grown central service to monitor remote directories and zip up files, local scheduled tasks running powershell scripts and/or forfiles.exe, etc. I'm sure there are better approaches, but if a service we're already installing (filebeat) can accomplish the same job, that would be ideal.

ruflin commented 8 years ago

Beats are currently focused on fetching data and shipping it to elasticsearch or logstash. You're above request would mean modify and even delete data which I think is a completely different area and brings a lot of new challenges and risks. Dealing with access rights is only one. From my perspective this is out scope of filebeat but I will leave the issue open for others to comment their view.

For the deletion: I would actually expect that your log rotation tool should take care of that.

nheath commented 8 years ago

Thanks for the feedback, I fully understand. Interested to hear any other user opinions. Feel free to close this issue at your discretion.

For context on my case, it has been my experience that windows systems don't have a log rotation tool that handles both rotation and deletion as linux systems do. Instead, logging frameworks (log4net, enterprise library) internally handle log rotation, leaving deletion/archival up to a different tool. I've used multiple solutions over the years including scheduled tasks, custom services, extending the logging frameworks, etc depending on the requirements. We currently have a mix of these solutions, that I'm looking to improve and standardize.

ruflin commented 8 years ago

Interesting to hear that you are on windows. As I mainly use Linux I wasn't aware of this "log rotation" issue on Windows. In case you use windowslogs, have a look at https://github.com/elastic/beats/tree/master/winlogbeat

ruflin commented 8 years ago

@nheath Thanks for sharing your ideas with us. I'm going to close this one as it will not make it into the closer roadmap. I hope you find a tool that can do the above job for you on Windows.

nathannis commented 8 years ago

@nheath We use windows and have used several home-grown file cleanup tools as well. All simple and work OK. What is your current favorite? I am planning to use scheduled tasks that will delete files that are much older (2 weeks) than what we would allow filebeats to import (filebeats should be importing nearly immediately after log entry is made. However, we will allow it to import anything up to a week old.

ruflin commented 8 years ago

@nathannis Could you post a link to these tools here in case they are public? I'm sure others will also stumble over this Github issue an search for it.

dokki767 commented 7 years ago

+1 for auto-deletion of files when harvesting of files complete.

yvmster commented 7 years ago

+1 for auto-deletion of files when harvesting of files complete. It will be complete solution for log file maintenance.

ruflin commented 7 years ago

For everyone not using log rotation, it should be possible to create some scripts based on {your-language} that reads the registry file from filebeat and compares it to the existing files and the offset. If offset = file-size and did not modify for your predefined time, the script can remove it from disk.

nathannis commented 7 years ago

Nice idea. I'll try to make a quick powershell script and post it back.

On Nov 22, 2016, at 3:09 AM, Nicolas Ruflin notifications@github.com wrote:

For everyone not using log rotation, it should be possible to create some scripts based on {your-language} that reads the registry file from filebeat and compares it to the existing files and the offset. If offset = file-size and did not modify for your predefined time, the script can remove it from disk.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

cryptooman commented 7 years ago

+1 for deleting logs after harvesting. This is an optional feature and is only for convenience. Less tools - less support you need. And no one interfere to use any logs rotation solutions instead.

sektorcap commented 7 years ago

Thanks @ruflin for your suggestion to develop a script with the "offset = file-size" logic. We implemented it but unfortunately we have had the issue well described here.

In our use case we have a lot of files that we have to ingest. No tail, no log rotation: only a bulk of files that we receive in batch during the night.

Deleting logs after harvesting seems to solve every problems ;-) Any other suggestions?

sektorcap commented 7 years ago

Hi @ruflin can you take a look to this patch

ruflin commented 7 years ago

I left a minor comment on the patch, but I saw you already adressed it in a follow up PR. Please be aware that this could work in some very specific use cases but would probably break in lots of others. For example use case which include file rotation etc.

sektorcap commented 7 years ago

Thanks for your review. I tested it and tried it also in a production environment and it seems to work.

Let me know if you, as Elastic, are interested in this patch. If so I can continue to work on it adding tests and verifying all other use cases.

ruflin commented 7 years ago

Great to hear that you got it working in your environment. My view on the scope of filebeat are still the same as state here: https://github.com/elastic/beats/issues/714#issuecomment-171564683

Instead of introducing file deletion I think we should improve the options that people have to circumvent inode reuse. There are more and more people which not use filebeat in the way it was initially designed for which is reading log files in real time, but for example for batch imports or small files. This problem is often actually simpler as no file rotation happens. Means the unique identifier does not have to be an inode but can just be the file name. This is not only easier to track but also prevents inode reuse as long as file names are unique.

sektorcap commented 7 years ago

Ok, are you thinking to introduce a new prospector type or a new option which change the behavior of the log prospector in order to check file name instead of inode and deviceid?

ruflin commented 7 years ago

Yes, or even a config option in the log prospector to define what should be used as a unique identifier and let the user pick. But be aware this is only theoretical at the moment.

Moscagus commented 7 years ago

Hi @sektorcap can you share the script with the "offset = file-size" logic. I think it's better to use that script to move the file to ".PROCESSED" instead of deleting it. In the configuration add: exclude_files: ['\ .PROCESSED $']. This will clean those files from the registry in the next scan. Then a second script can verify that the ".PROCESSED" files are no longer in the registry and remove them. This would avoid the problem of inode reuse.

lishengxian commented 7 years ago

HI @Moscagus Did you got the scripts. waiting for someone share.

sektorcap commented 7 years ago

Hi All, here https://github.com/andrea-romagnoli/filebeat-cleaner the project developed by a colleague of mine.

Bye

sektorcap commented 7 years ago

Remember, everything works fine if you move the files in the same partition while if you move them in other partitions this will not prevent the inode reuse issue.

Moscagus commented 7 years ago

Thanks @sektorcap, I start a script in bash. I am going to compare with the one you send me. I also plan to add a check to delete the files as long as they are not in the registry. For example in the case that filebeat is down and the prospector has not cleared the registry after the move. Thanks again

Oliboy50 commented 6 years ago

will this be handled by Filebeat one day? 🤞

DarrienG commented 5 years ago

I'm interested in this as well.

I would not be opposed to writing a pull request that would support this if the maintainers are ok with it.

In my mind it would be triggered after close_inactive or ignore_older triggers.

@ruflin What do you think?

DarrienG commented 5 years ago

Bump

My team is interested this feature as well now. We are slowly migrating from old reporting tools to ELK.

I have no problem trying to implement this at work, but I want the go-ahead because I am certainly not going to fork Filebeat and maintain my own separate branch.

I'd propose a new flag: delete_inactive

The user could define:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /home/arastra/.jenkins/jobs/*/builds/*/log
      close_inactive: 30h
      clean_inactive: 30h
      delete_inactive: true
      close_removed: true

delete_inactive would follow the same logic as [close,clean]_inactive, except it would actually delete the file once it hits this point.

It is assumed this would also close the file too, because no leaking file descriptors.

Can I at least get a - we're interested from someone?

edit: I don't do devops anymore (thank god). I won't write the feature, but feel free to thumb for show of interest.

DarrienG commented 5 years ago

In terms of some comments above - we do not have an issue with running out of inodes, but our Jenkins instance alone generates Terabytes of data every day. We literally cannot buy enough disk space to keep up.

Our old solution is to run a cron every few days to delete files older than a few days. But this does guarantee Filebeat is done uploading (although it probably does).

Likewise, any sort of solution that reads from the filebeat registry A - polls, and B - is dependent on the internal Filebeat registry structure. If it ever changes, the whole solution is hosed.

Adding a hook at the end of when FileBeat keeps track of when a file is inactive is the most logical solution (and maintainable).

Yet again, I am for keeping things small in scope, but I don't think deleting a file when it is finished harvesting is too outrageous. FileBeat already closes file handlers to files, why not just delete them too? It is the next logical step.

ManouchehrRasoulli commented 5 years ago

hi guise i wrote a module and put on file beat source stack to remove the log after harvesting the log

pop line module ` package log

import ( "bytes" "io" "os" "sync" )

// define the file structure type Popcorn struct { sync.Mutex path string file *os.File }

func (f *Popcorn) Initial(path string) { f.path = path }

func (f *Popcorn) close() error { return f.file.Close() }

func (f *Popcorn) open() error { file, e := os.OpenFile(f.path, os.O_RDWR|os.O_CREATE, 0666) if e != nil { return e } else { f.file = file } return nil }

func (f *Popcorn) PopLine(size int64) ([]byte, error) { // lock the critical section f.Lock() // unlock after checking out the data defer f.Unlock()

// read the file
err := f.open()
if err != nil {
    return nil, err
}

buf := bytes.NewBuffer(make([]byte, 0, size))

_, err = f.file.Seek(0, os.SEEK_SET)
if err != nil {
    return nil, err
}
_, err = io.Copy(buf, f.file)
if err != nil {
    return nil, err
}
line, err := buf.ReadString('\n')
if err != nil && err != io.EOF {
    return nil, err
}

_, err = f.file.Seek(0, os.SEEK_SET)
if err != nil {
    return nil, err
}
nw, err := io.Copy(f.file, buf)
if err != nil {
    return nil, err
}
err = f.file.Truncate(nw)
if err != nil {
    return nil, err
}
err = f.file.Sync()
if err != nil {
    return nil, err
}

_, err = f.file.Seek(0, os.SEEK_SET)
if err != nil {
    return nil, err
}
err = f.close()
if err != nil {
    return nil, err
}
return []byte(line), nil

} ` so you need to change your harvester code to call this function after forwarding the log message

harvester.go ` // initiate the popcorn to pop the content of log file if data.Event.Fields["popenable"] == true { popcorn := Popcorn{} popcorn.Initial(state.Source) , err := popcorn.PopLine(state.Offset) if err != nil { fmt.Println(err) } else { state.Offset = 0 } }

    // Update state of harvester as successfully sent
    h.state = state.

`

thanks for reading .

ManouchehrRasoulli commented 5 years ago

also this changes are temporal and you need to have a backup of your data because of if there is any network problem may cause of data lost and you need to restore your data if network problem happened

Slawka commented 4 years ago

Storing backup data is a separate process, filebeat is used to transfer data to processing systems and this will be until the receipt is confirmed

also this changes are temporal and you need to have a backup of your data because of if there is any network problem may cause of data lost and you need to restore your data if network problem happened

g13013 commented 4 years ago

I don't understand why this issue is closed !

ltmleo commented 4 years ago

I would like a feature to delete inactive files as well. To me, is more intuitive that filebeat handle with this than scripts or applications.

sektorcap commented 4 years ago

With the versions 6.x the filebeat registry can be easily managed because for each files there is just one entry so you can use "offset = file-size" logic.

With the version 7.x (I don't know the exact version) the registry is a "log" so the approach cannot use anymore.

@ruflin do you have any suggestions?

ruflin commented 4 years ago

Probably @urso is best to comment on this or @kvch ^?

urso commented 4 years ago

With the version 7.x (I don't know the exact version) the registry is a "log" so the approach cannot use anymore.

A new registry format was introduced with 7.9. The registry contains a snapshot plus a log. the snapshot is still created every 5 minutes, or once the log file reaches 10MB. The active.dat file contains the path to the current snapshot file. The snapshot file still uses the same format as the old file. In case you don't want to wait for the snapshot file be written the strategy to reconstruct the filebeat in memory state is to read the active snapshot file and stream through the log file applying all set/remove operations. The python tests do contain a helper class that can read the registry. For new entries the encoding of timestamps has changed. The get_registry helper function does apply some filtering and parses timestamps.

All in all the registry on disk format is an internal implementation detail, without any official interface. Therefore it can change in the future. Instead of having users access the file we might have to consider other alternatives in the future, e.g. some kind of API that would allow scripts to query the internal registry state.

sektorcap commented 4 years ago

Thanks @urso for the details.

I agree with you, I should not use the registry but at the moment, since there is not any safe way for removing files after harvesting finished, parsing the registry information seems the only solution. Please, consider to add this feature for next releases and re-open this issue.

However, I can see the active.dat file and the checkpoint file only when log file reaches 10MB but never after 5 minutes. Is it a bug or am I missing something?

Thanks.

g13013 commented 4 years ago

@urso Or at least cosider #20410

urso commented 4 years ago

However, I can see the active.dat file and the checkpoint file only when log file reaches 10MB but never after 5 minutes. Is it a bug or am I missing something?

Hm, originally it was planned to checkpoint every 5min. But checking the code this capability was eventually removed. Currently the best way is to parse the log file as is done in the Registry class in the system tests. The format is quite straight forward. An entry always consist of 2 json objects. The first object contains a sequence number that must be > the current data file (otherwise the entry must be ignored) and the operation type. The next entry has a K and V field. The 'V' is the actual entry to be inserted. The K will be stored as "_key" in the documents in the json file. when reading/merging one just reads all entries into a hashtable.

I understand reading from the log file can be a hassle, especially if there is not much load on the registry file. Feel free to open an issue/enhancement request for filebeat to query the registry state and ping me on it. I will be in vacation the next 2 weeks, but would like to pick this issue up when back.

since there is not any safe way for removing files after harvesting finished...

Normally Filebeat keeps files open, even if they have been deleted. As long as filebeat is not restarted logs won't get lost. Some close_x that enforce an early close before the file has been finished can indeed lead to data loss in. If files get deleted by the log rotation strategy faster than they can be shipped, this is often a bandwidth problem. But every setup/use-case is different.