Project: Attachment parsing and indexing

Project info

Title: Attachment parsing and indexing
Goals: Make private full-text search system for big amount of emails and attachments
Priority: high but not critical

Description

We have a table (ip, login, password) of email accounts, we need a crawler to download and index all mails and attachments. Services:

Distributed crawler for downloading attachments and save all files to a private cloud file storage. Need to have multiple crawling nodes and mechanisms to add new nodes. Adding new tasks will be implemented with MQ (queue) system so all nodes can take tasks from this queue and mark tasks as done, if node fails, after end of timeout another node can take the same task from queue if it is not marked as done.
Parsing service (also distributed nodes with queue). Convert all files to test format, MIME, DOC, DOCX, XLS, XLSX, DPF, RTF, etc. and storing all text content to full text search engine.
Indexing service (may be combined with previous service). Creating full-text search index for all texts. We can use Sphinx full text, apache lucene, elasticsearch or other open-source engine.
Search interface. Authorized

Required resources

Add your thoughts on the required resources:

Time: 7 days - prototype with no scaling, + 5 days for scaled implementation, next steps will be discussed
Money and other resources: need estimation
People: 3-7 developers (5 optimal), 1 lead with management and coordination skills, no testers or QA for a first release
Technologies: languages, dependencies and tools will be selected by team and lead

What we have as of now

C# implementation for parsing all file types we need. Author and sources are available on request.

How to join

https://github.com/Cartesianism/Manifesto/blob/main/README.md

Cartesianism / Registration

Project: Attachment parsing and indexing #6