johnthepink commented 8 years ago

Data Processing for Heighliner

Heighliner currently functions really well as a uniform way to access data from all of our data stores. However, sometimes it will be necessary to make additional modifications to data beyond resolving it to a predictable schema. These types of modifications are additions or supplements to the existing data stores that are necessary for applications querying Heighliner, but are not essential to the data stores/applications being queried.

There are currently two cases we need to handle for data processing, both of which I think we can leverage AWS Lambda functions for.

updating existing stores with new information
triggering data processing from Heighliner
Updating

Updating an existing data store shouldn't require Heighliner's direct involvement. This type of processing is needed when a new requirement is added to an existing data store, and old data needs to be updated to meet the new requirement. This should come in the form of basically a script that runs one time to do handle updating all old data entries.

Triggering

There may also be instances where putting a new requirement on an existing data store is not possible/ideal. In this case, Heighliner itself will need to trigger the processing of the data, and handle the logic behind providing this data to the requesting application. This processing should not block the response to the application, but may return data that is not ideal or incomplete initially. That way the application can go about it's business. The processing should also be done external to the Heighliner application, so as not to consume resources allocated to Heighliner's main concern of responding to client requests.

Triggering this type of processing may look like this:

Application queries Heighliner
Heighliner fetches data available from stores
Heighliner determines data fetched is inadequate for the application making the request
Heighliner responds to request with data available
Heighliner triggers external processing action
Action provides better data via callback, or by uploading to some store
Heighliner provides better data in the future

Actions can probably be handled using AWS Lambda functions. This will require adding a way to trigger these actions to Heighliner. After the Lambda function finishes, it should make the new data generated available. Because we are assuming that updating the existing data store isn't possible/ideal, we will need another way to make this data available. This may be a good case for a Mongo collection. Files will need to be uploaded to S3, but querying S3 directly doesn't sound idea. So, files will probably need to live in S3 with pointers in Mongo.

Use Cases

Sermon Audio

I think updating the existing data store is the best solution for this case. We will need to:

Add new field to EE
SQL that pulls all sermons from EE
Iterate over script and trigger lambda function that:
1. downloads sermon audio
2. runs mp3 duration calculation (since no meta data)
3. updates EE with duration
4. dumps basic error (visible in lambda interface)

Adding the new field should be trivial, but should be done first. Writing the SQL can be done in isolation, and we probably already have that somewhere. I'm not sure where this script needs to be run. We shouldn't need to wait for each lambda function to finish before running the next, so I think we can run the script and trigger the lambda functions all at once from a local machine. It would be nice if this were able to handle our future needs for updating EE, as well. So, maybe set up a small node program running Sequelize, and triggers the Lambda function over the AWS API. Running this program may be like:

heighliner-migration name-of-action

The Lambda Function should be able to handle the influx of requests without issue. We can write this in Node, test it locally, and deploy it. Resources:

Node Lambda Template: https://github.com/motdotla/node-lambda-template
Working lambda function using the template: https://github.com/johnthepink/bitminter-lambda
MP3 duration package (used currently in Heighliner alpha): https://github.com/ddsol/mp3-duration
Sequelize for getting sermons: http://docs.sequelizejs.com/en/latest/
Node AWS SDK: https://aws.amazon.com/sdk-for-node-js/
Image Processing

We need to provide compressed images optimized for mobile and other sizes. Expresssion Engine has a method for providing compressed images, but it lives at the template layer of the application. These images are uploaded to S3 using a naming convention, and are not stored in the database anywhere. We could get in to the business of listing and searching S3 for these images, but that feels dirty to me.

For this case, I think we could go the route of updating, triggering, or a combination. It really depends on when we want the images generated, and what we want to trigger the processing.

Possible places for triggering image processing:

On expression engine upload
On image added to S3 bucket
When Heighliner doesn't receive necessary images
On expression engine upload

Triggering image processing on expression engine upload would require tying in to the CEImage plugin. From there, we could take the necessary steps to produce the images we need by triggering a Lambda function, uploading to S3, and storing in Mongo. This will require us to back process our existing images. I don't think this is the best place to trigger the action.

On image added to S3 bucket

Lambda allows the ability to trigger an action when something is added to an S3 bucket. So, we could watch a bucket for images added, and then trigger a Lambda function to process the images, upload to S3, and store in Mongo. Taking this route will require us to back process our images.

When Heighliner doesn't receive necessary images

When Heighliner receives a request for images, and determines it doesn't have the best images, we could trigger a Lambda function at that point, upload to S3, and store in Mongo. This would not require back processing, which I'm not sure is good or bad at this point.

johnthepink commented 8 years ago

The plan for now is this:

create simple node cli for querying data stores and triggering lambda functions
require sermon audio duration going forward and back process using lambda
trigger image processing lambda function on adding to S3 bucket by EE, and back process using same lambda function

delianides commented 8 years ago

@johnthepink Are you using EE to trigger lambda?

jbaxleyiii commented 8 years ago

@delianides the thought is to use S3 to trigger it

johnthepink commented 8 years ago

@delianides Lambda lets you trigger functions when something is added to an S3 bucket, so the idea is to watch the bucket that the raw files are being uploaded to by EE, and trigger the lambda function that way

delianides commented 8 years ago

Ok, good. Thats what I thought you meant. Just wanted to be sure.

jbaxleyiii commented 8 years ago

@johnthepink 😽

SELECT
  d.entry_id,
  c.channel_id,
  d.field_id_675 as audio,
  d.field_id_1554 as sermon_audio
FROM
  exp_channel_data as d
LEFT JOIN
  exp_channels as c
    ON d.channel_id = c.channel_id
WHERE
    (c.channel_id = 3 OR c.channel_id = 61) AND d.field_id_1554 = '';

c.channel_id = 3 is newspring sermons c.channel_id = 61 is fuse sermons

johnthepink commented 8 years ago

NewSpring / Heighliner