crawling on recently created files in addtion to periodic folder scan

dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)

https://fscrawler.readthedocs.io/

Apache License 2.0

1.36k stars 299 forks source link

crawling on recently created files in addtion to periodic folder scan #943

Open ghost opened 4 years ago

ghost commented 4 years ago

Is your feature request related to a problem? Please describe.

It usually takes a lot of time when crawling on a very deep and huge folder. Currently fscrawler seems to traversal the folder every scan, even if there is only one file updated.

Describe the solution you'd like

A list of file paths can be given to fscrawler via the stdin, or maybe HTTP API, in addition to the existing crawling path settings.

Fscrawler puts the received paths in the queue and process them first, and resumes periodic scanning when the there is no paths left in the queue.

Describe alternatives you've considered

It would be even better if fscrawler can listen on the file update events from OS and put updated files into the queue automatically.

ghost commented 4 years ago

I was playing with the existing REST API and it seems to be almost ready for this job.

Paths can be overwritten using a JSON documents, and index name can be overwritten by curl ... -F "file=@test.txt" -F "tags=@tags.txt" -F "index=my-index".

The only thing is that the actual file must be submitted via the REST API.

It would be perfect if fscrawler could automatically update the content using file from the disk, and sets the virtual/real paths, when we call the API with something like curl ... -F "file=/path/to/file" -F "index=my-index"

dadoonet commented 4 years ago

It would work only if FSCrawler has access to the local/mount disk, right?

Is the goal to manually upload a file which has been missed during the last run? Or because you don't want to wait for the next run?

ghost commented 4 years ago

@dadoonet My goal was to manually add the paths to index that are recently created, and missed during the last periodic scan on local/mount disk.

The REST APIs is great for many purposes, but it's not ideal for this case, I think.

The REST APIs takes only one file each time, and it requires the actual content of the file to be submitted via HTTP requests. To use this API I have to write a script to glob the files within a given path, and send each of them with HTTP requests. The script also has to deal with the path filtering, which is kind of redundant since fscrawler already has built-in support of that.

I think the suggested API can serve as a supplementary to the periodic scan on local/mount disk, and the existing API is good for someone who wants to crawl files that is not on disk at all.

dadoonet commented 4 years ago

I see. The thing is that path only supports today one entry. It's not available as an array of paths to monitor. But this is a valid ask.

Adding this as a feature request.

ghost commented 4 years ago

The reason why I suggested a stdio API is that I thought I could use fswatch, which takes care about listening to file updates. It doesn't require any scripting, we can do it like fswatch /path/to/watch | fscrawler on Linux. But I realized that it's not good for Windows users.

Then I tried REST API, since I have to write some scripts anyways. It's kinda working but I need to actually bypass the crawling from fscrawler.

It would be even better if fscrawler can listen on the file update events from OS and put updated files into the queue automatically.

Maybe this is a better solution instead of providing an API?

dadoonet commented 4 years ago

It would be even better if fscrawler can listen on the file update events from OS and put updated files into the queue automatically.

Maybe this is a better solution instead of providing an API?

That's the goal with #399. Something I'd like to start at some point.