cc-archive / open-ledger

Prototype code and examples for work on the Creative Commons "CC Search" project
MIT License
48 stars 23 forks source link

Creative Commons Search prototype

build-status

This is an in-progress prototype for a consolidated "front-door" to the Commons of visual imagery. The project has two near-term goals:

It is not the goal of this project to:

Ancillary benefits of this project may include:

Installation for development

Configuration

Create some local configuration data by copying the example file:

cp openledger/local.py.example openledger/local.py

You will want to set the following settings:

# Make this a long random string
SECRET_KEY = 'CHANGEME'

# Get these from the AWS config for your account
AWS_ACCESS_KEY_ID = 'CHANGEME'
AWS_SECRET_ACCESS_KEY = 'CHANGEME'

Docker

The easiest way to run the application is through Docker Compose. Install Docker, then run:

docker-compose up

If everything works, this should produce some help output:

docker-compose exec web python3 manage.py

Elasticsearch

Create the elasticsearch index named openledger. You can change its name in settings/openledger.py.

curl -XPUT 'localhost:9200/openledger?pretty' -H 'Content-Type: application/json' -d
{
    "settings" : {
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 2
        }
    }
}
'

postgresql

Set up the database:

docker-compose exec db createdb -U postgres openledger
docker-compose exec web python manage.py migrate
docker-compose exec web python manage.py createcachetable

This should create the database tables. Everything should work locally, though you won't have any content yet. Visit http://localhost:8000 to see the site.

Testing

Verify that the test suite runs:

docker-compose exec python manage.py test

All tests should always pass. Tests assume that both Postgres and Elasticsearch are running locally.

Tests are set up to run automatically on master commits by Travis CI. When getting started with the app, it's still a good idea to run tests locally to avoid unnecessary pushes to master.

Deployment

Elastic Beanstalk deployment

Install the EC2 keypair associated with the Elastic Beanstalk instance (this will be shared privately among technical staff).

Install the AWS CLI tools: https://aws.amazon.com/cli/

In the openledger directory, run:

eb init

When you are ready to deploy, run the tests first.

If tests pass, commit your changes locally to git.

Then deploy to staging:

eb deploy open-ledger-3

Verify that your changes worked as expected on staging by clicking the thing you changed.

If that works out, deploy to production:

eb deploy open-ledger-prod

Don't forget to push your changes upstream!

EC2 Data Loader

At times it will be necessary to spin up purpose-built EC2 instances to perform certain one-off tasks like these large loading jobs.

Fabric is set up to do a limited amount of management of these instances. You'll need SSH keys that are registered with AWS:

fab launchloader

Will spin up a single instance of INSTANCE_TYPE, provision its packages, and install the latest version of the code from Github (make sure local changes are pushed!)

The code will expect a number of environment variables to be set, including:

export OPEN_LEDGER_LOADER_AMI="XXX" # The AMI name
export OPEN_LEDGER_LOADER_KEY_NAME="XXX" # An SSH key name registered with Amazon
export OPEN_LEDGER_LOADER_SECURITY_GROUPS="default,open-ledger-loader"
export OPEN_LEDGER_REGION="us-west-1"
export OPEN_LEDGER_ACCOUNT="XXX"  # The AWS account for CC
export OPEN_LEDGER_ACCESS_KEY_ID="XXX" # Use an IAM that can reach these hosts, like 'cc-openledger'
export OPEN_LEDGER_SECRET_ACCESS_KEY="XXX"

...and most of the same Django-level configuration variables expected in local.py.example. These values can be extracted from the Elastic Beanstalk config by using the AWS console.

Open Images dataset

To include the Google-provided Open Images dataset from https://github.com/openimages/dataset you can either download the files locally (faster) or use the versions on the CC S3 bucket (used by the AWS deployments)

  1. Download the files linked as:
  1. Run the database import script as a Django management command:

The script expects:

. venv/bin/activate

./manage.py loader /path/to/openimages/images_2016_08/validation/images.csv openimages images
./manage.py loader /path/to/openimages/dict.csv openimages tags
./manage.py loader /path/to/openimages/human_ann_2016_08/validation/labels.csv openimages image-tags

(This loads the smaller "validation" subject; the "train" files are the full 9 million set.)

This loader is invoked in production using the Fabric task, above:

fab launchloader --set datasource=openimages-small

See fabfile.py for complete documentation on loader tasks, including loading of other image sets.

  1. Index the newly imported data in Elasticsearch.
    ./manage.py indexer