OHDSI / ETL-LambdaBuilder

CDM Builder leveraging AWS Lambda
Apache License 2.0
14 stars 14 forks source link

ETL-LambdaBuilder

CDM Builder leveraging AWS Lambda

Getting Started

The ETL-LambdaBuilder consist of 2 AWS Lambda fuctions and the ETL command line tool:

Prerequisites:

Publish to Lambda

  1. Open org.ohdsi.cdm.sln with Visual Studio

  2. In AWS Explorer setup your AWS Profile, Menu: View -> AWS Explorer. (see, if not installed Toolkit)

    image

  3. Right mouse click and click Publish to AWS lambda... on org.ohdsi.cdm.presentation.lambdabuilder project in the Solution Explorer

    image

  4. Use following settings:

    • Runtime: .NET8
    • Architecture: ARM
    • Function name: CDMBuilder
    • Handler: org.ohdsi.cdm.presentation.lambdabuilder::org.ohdsi.cdm.presentation.lambdabuilder.Function::FunctionHandler

image

And in a similar way for the Merge function, org.ohdsi.cdm.presentation.lambdamerge project

  1. Upload functions to AWS

Configuring CDMBuilder Lambda function

  1. Open AWS Console, Lambda functions page
  2. Open CDMBuilder function
  3. Add s3 trigger
    • Bucket: !!! Use separate bucket for trigger, all PUT events in this bucket will invoke your function
    • Event types: s3:ObjectCreated:Put

image

  1. Setup environment variables
  1. Setup general configuration, setting depends on source data set, below recommended settings:

image

Configuring Merge Lambda function

  1. Open Merge function

  2. Add s3 trigger

    • Bucket: Use CDMBuilder trigger bucket !!! Use separate bucket for trigger, all PUT events in this bucket will invoke your function
    • Event types: s3:ObjectCreated:Put
    • Prefix: merge/
  3. Setup environment variables

    • Bucket: bucket for result
    • CDMFolder: cdmCSV
    • ResultFolder: cdmCSV
    • S3AwsAccessKeyId: AccessKeyId for result bucket
    • S3AwsSecretAccessKey: SecretAccessKey for result bucket
  4. Setup general configuration, setting depends on source data set, below recommended settings:

Publish command line tool

  1. Right mouse click on org.ohdsi.cdm.presentation.etl project in Solution Explorer, Publish..

image

  1. Publish settings, my version of the settings below, different may be used

image

Run ETL conversion

  1. Open command line tool folder
  2. Update appsettings.json file with yours setting
{
  "AppSettings": {
    "s3_aws_access_key_id": "your result bucket access_key_id",
    "s3_aws_secret_access_key": "your result bucket secret_access_key",
    "bucket": "your result bucket",

    "s3_messages_access_key_id": "your trigger bucket access_key_id",
    "s3_messages_secret_access_key": "your trigger bucket secret_access_key",
    "messages_bucket": "your trigger bucket",
    "messages_bucket_merge": "your trigger bucket/merge",

    "raw_folder": "raw",
    "cdm_folder_csv": "cdmCSV",
    "cdm_folder": "cdmCSV",
    "vendor_settings": "",

    "spectrum_cluster": "",
    "spectrum_db": "",

    "iam_role": "",

    "parallel_queries": "1",
    "parallel_chunks": "5",

    "local_path": ""
  },
  "ConnectionStrings": {
    "Builder": "Data Source=builder.db",
    "Vocabulary": "Connection string to database that contains OHDSI Vocabulary tables (concept, concept_ancestor, source_to_concept_map...)",
    "Source": "Connection string to source database"
  }
}

Connection string example:

Driver={Amazon Redshift (x64)};Server=your server name;Database={db};Sc={sc};UID=your user;PWD=your pswd;Port=5439;SSL=true;Sslmode=Require;UseDeclareFetch=1;Fetch=10000000;UseMultipleStatements=1

Following parameters {db} and {sc} will be replaced to command line parameters, during tool startup

  1. Run org.ohdsi.cdm.presentation.etl with below parameters:

Example:

org.ohdsi.cdm.presentation.etl.exe --vendor ccae --rawdb ccae  --rawschema native --batch 500000 --new true --skip_chunk false --resume false --skip_lookup false  --skip_build false --skip_etl false --versionid 1 --skip_vocabulary true --skip_cdmsource false --skip_merge false --skip_validation true

Batch size or Size of the chunk

Lambda function are limited in memory for this reason ETL tool splitting source data to chunks, additionally chunks are divided into smaller part by number of source Redshift cluster slices.

To check number of slices use below query:

select node, slice from stv_slices;

Chunk size depended on source dataset and slice number. Larger size of chunk provides better performance, but can cause Out of memory error in lambda function, so to process chunk you will need to reduce chunk size or increase lambda memory.

The approximate chunk size (for 3000MB CDMBuilder function) can be calculated using this formula:

batch=number of slice * 100k

A couple of examples of conversion time depending on cluster type for IBM CCAE:

ETL tool parameters example:

org.ohdsi.cdm.presentation.etl.exe --vendor ccae --rawdb ibm_ccae --rawschema native --batch 500000 --new true --skip_chunk false --resume false --skip_lookup false  --skip_build false --skip_etl false --versionid 1 --skip_vocabulary true --skip_cdmsource false --skip_merge false --skip_validation true

Lambda function log output

When chunk data will be available on s3, etl tool will trigger function by creating N files in s3 trigger bucket, number of functions/files equivalent number of slices, etl output shouw messages like below:

...
[Moving raw data] Lambda functions for cuhnkId=21 were triggered | 24 functions
...

You can check log of each lambda function with Amazon CloudWatch, CDMBuilder and Merge function will have own log group. File that triggered lambda will be automatically dropped if conversion was successful.

Etl tool output provide information about total number of chunks, like below:

...
***** Chunk ids were generated and saved, total count=36 *****
...

and current progress, for example:

*** 6:17 AM | Running: lambda 7 local 0; Validating 0; Valid 16; Timeout 0; Invalid 0; Error 0; Total 23

Export CDM from s3 to Redshift

  1. Create CDM database dll
  2. Use following template to move data from s3 to Redshift table
copy "schema_name"."cdm_table" 
from 's3://your_result_bucket/vendor_name/conversionId/cdmCSV/cdm_table/cdm_table.' 
credentials 'aws_access_key_id=your_result_bucket_access_key_id;aws_secret_access_key=your_result_bucket_secret_access_key' DELIMITER ',' CSV QUOTE AS '"' GZIP

Important