Multi-machine Desktop Search Engine with Web UI based on ElasticSearch

meladawy commented 9 years ago

By @AmrEldib

Multi-machine Desktop Search Engine with Web UI based on ElasticSearch. Let me explain: Desktop search sucks at best. To add to that, desktop search over multiple machines sucks even more. The idea is to use ElasticSearch as the backend for indexing and search. It's very powerful and can sync indecies across multiple machines. A Windows service would monitor file system and feed the content of the files to ElasticSearch to index. A plugin system can be used to add support to different file formats (PDF plugin, ZIP plugin, ISO plugin, Images plugin, etc). Then a web UI uses the ElasticSearch API to search the index and display the results. Check out ElasticSearch http://www.elasticsearch.org/

kashwaa commented 9 years ago

I'm in for this one

mtayseer commented 9 years ago

This is a great idea. It can be a freemium product.

kashwaa commented 9 years ago

Ok, lets discuss that a little bit in details, there would be a search server that hosts elasticsearch and a web interface for searching, each client will have a small program that initially scans the drives for files and feeds them to the search server for indexing, and keeps running to trace file changes and feeds them as the occur, some thing that looks like this drawing

The search interface would look like this drawing 2

meladawy commented 9 years ago

Yea i like the idea as well...we still need to create more discussions about controlling the privacy & accessibility and What next after finding the files. should the users be able to download it, if so how to download it directly from a private computer...etc

More to come ISA !

kashwaa commented 9 years ago

I thought about privacy as well, we have several issues to consider, The user will have the ability to select specific directories to index,system, config and password files will be excluded by default I don't think we should let the searcher download the file, neither from the search server nor directly from the clients, or at least we should have more discussion about this point

AmrEldib commented 9 years ago

Thanks @kashwaa for creating that diagram. I'm working on a version that reflects what I had in mind which is slightly different but very much along the same lines. @meladawy I agree with @kashwaa that downloading the file is out of scope for a search engine. You can view a 'cached version' of the file but it will only be the content that the engine indexed. It won't be a copy of the file. Say, the engine can index the content of PDF files. For that to happen, the engine keeps a striped copy of the file's content. Users can view that, but not a copy of the file. This idea is mainly for computers instead a LAN. The main scenario is one user (or multiple users with the same trust level), maybe later on, it can be built to support multiple users instead a small company for example.

AmrEldib commented 9 years ago

Okay, let explain the diagram. desktop search engine diagram 2014 08 28

The system has 3 main components:

Elasticsearch instance (database icon).
Indexer process (gears).
Web page as the search interface (globe icon).

The indexer monitors the folders and reads the files and adds them to Elasticsearch. The indexer usually indexes the local machine but can also indexes an external drive or NAS.

The web page consumes index stored in Elasticsearch and displays results in a browser. The web page can be sitting on an API instead of directly consuming Elasticsearch. This allows for later creating desktop clients and creating 3rd party apps (launchers, mobile apps, integration with File browsers, etc.).

In the diagram, you can see how the different components are be connected and the main scenario I have in mind. The laptop is mobile and can usually be disconnected from the network. So, having its own copy of the index and its own UI. The problem is that a copy of the index can be huge. I'm not sure how we can get around a problem like this.

The Laptop indexer indexes the laptop (duh) and stores the data in Laptop's Elasticsearch. The Elasticsearch instance on the laptop synces with the Server's Elasticsearch. Laptop index gets copied to the server, and server/desktop index gets copied to the laptop. When the laptop is disconnected, the user can still find the items stored on the desktop and server. This allows the user to know the answer of "where's my file?". Getting a hold of the file is a different matter. However, a number of tools already exist to solve this problem (Skydrive, Tonido, etc.). Once the user know where is the file, they can get it. It's not in the scope of this project to solve an already solved problem.

The desktop and server are always connected. So, having one copy (one instance of Elasticsearch) and one UI makes total sense. The indexer must be able to send its findings to a remote machine. Having two indexer also makes sense because it breaks down the effort of indexing on two machines. It's reasonable to assume that for brief period of time, the server might not be available. The desktop indexer has to be able to "queue" all the requests that should've been sent to the server but didn't. Then, when the server is online, those requests are completed.

That's what I have in mind for now. What do you think?

AmrEldib commented 9 years ago

What I wanted to highlight in this diagram is:

Having multiple copies of the index.
Every machine should have its own indexer that directly submits requests to Elasticsearch. The exception is attached drives which can't run an indexer process.
As @kashwaa mentioned, the indexer is a Windows Service. The UI is a web site (which can be sitting on an API allowing for creating other UIs later).

AmrEldib commented 9 years ago

If you would like to play around with the diagram above and make modifications, I've uploaded the original .pdn file (use Paint.net) to OneDrive.

kashwaa commented 9 years ago

@AmrEldib I think that having multiple indices complicates things -unless of course elastic search has a built in means to synchronize multiple instances-, how about creating 2 versions of the project,one that works locally, replacing the old desktop search, this one can be given for free, another version that functions over multiple machines, having a central server with one index that all clients connect to, that can be later monetized.

AmrEldib commented 9 years ago

I think Elasticsearch makes things simpler, but let me get a proof that this can actually be done.

MohamedAlaa commented 9 years ago

I think making the Indexer to work alone and index everything might be a bit useless as you will index system files! My recommendation is to make like a folders watcher where you can tell him ok, scan my documents, desktop and another folder.

Not sure also of the 2 elastic search instances. You can make the small app as the indexer of files first and keep watching these folders and you send them to a Redis database in the cloud and when you ask for something you search directly in the Redis database this will be more efficient and faster.

Once you get the basics correct, You can start of thinking crawling files contents and build more over the basic approach.

AmrEldib commented 9 years ago

The indexer service is a folder watcher with the ability to re-index everything on demand. When the service indexes something (on demand or because of file change), it sends an updated description of the item (file/folder) to ElasticSearch which actually "indexes" the item. You can think of the indexer service as a advanced crawler, because it crawls all the files at the start and on demand, and watches them for changes. Indexing system files is optional and not in the defaults. You can control that by:

Choosing which folders to index.
The indexer service has an "Ignore List". If file/folder has certain criteria, then it's not indexed. Examples for things in the Ignore List by default are:
- Git folders.
- desktop.ini
- Thumbs.db ElasticSearch has capabilities to replicate indecies (as far as I understand). By changing configuration, replication can be enabled to allow index to be stored on multiple machines and available regardless of which machine is up and running.

@MohamedAlaa I'm not sure where the Redis database can fit in this implementation. It seems to be replacing ElasticSearch. Please clarify.

I should be writing this stuff in the projects page. I'll try to do that soon.

egyptian-geeks / Activities

Multi-machine Desktop Search Engine with Web UI based on ElasticSearch #7