18F / 2015-foia-hub

A consolidated FOIA request hub.
Other
49 stars 17 forks source link

Backend Management of Document Import #678

Open khandelwal opened 9 years ago

khandelwal commented 9 years ago

This is a collection of issues that identify various tasks and concepts in the creation of our document import path.

The setup

There are three major pieces to importing documents into Krang.

1: Document gathering

In this step we gather documents from agencies and put those documents (in a determined directory structure) in an S3 bucket. This is just the document collection piece and operates independently from the other steps.

2: Extracting text and processing documents

A script processes the newly gathered documents, uses the document processing toolkit to convert the documents into text and potentially extracts other metadata (such as titles). The manifest will be created here.

This script will actually import new documents from step 2 into Krang. This will currently copy the documents upon import into the Krang production document bucket. The importing step will ensure that the elasticsearch index is updated.