Backend Management of Document Import

This is a collection of issues that identify various tasks and concepts in the creation of our document import path.

The setup

There are three major pieces to importing documents into Krang.

1: Document gathering

In this step we gather documents from agencies and put those documents (in a determined directory structure) in an S3 bucket. This is just the document collection piece and operates independently from the other steps.

2: Extracting text and processing documents

A script processes the newly gathered documents, uses the document processing toolkit to convert the documents into text and potentially extracts other metadata (such as titles). The manifest will be created here.

[ ] #677 Explore extracting a document title
[ ] #680 Running the document processing on documents in an S3 bucket.
3: Importing documents

This script will actually import new documents from step 2 into Krang. This will currently copy the documents upon import into the Krang production document bucket. The importing step will ensure that the elasticsearch index is updated.

[ ] #679 Import script (reads S3 bucket)

18F / 2015-foia-hub