CDM Builder leveraging AWS Lambda
The ETL-LambdaBuilder consist of 2 AWS Lambda fuctions and the ETL command line tool:
Prerequisites:
Open org.ohdsi.cdm.sln with Visual Studio
In AWS Explorer setup your AWS Profile, Menu: View -> AWS Explorer. (see, if not installed Toolkit)
Right mouse click and click Publish to AWS lambda... on org.ohdsi.cdm.presentation.lambdabuilder project in the Solution Explorer
Use following settings:
And in a similar way for the Merge function, org.ohdsi.cdm.presentation.lambdamerge project
S3AwsSecretAccessKey: SecretAccessKey for result bucket
Open Merge function
Add s3 trigger
Setup environment variables
Setup general configuration, setting depends on source data set, below recommended settings:
{
"AppSettings": {
"s3_aws_access_key_id": "your result bucket access_key_id",
"s3_aws_secret_access_key": "your result bucket secret_access_key",
"bucket": "your result bucket",
"s3_messages_access_key_id": "your trigger bucket access_key_id",
"s3_messages_secret_access_key": "your trigger bucket secret_access_key",
"messages_bucket": "your trigger bucket",
"messages_bucket_merge": "your trigger bucket/merge",
"raw_folder": "raw",
"cdm_folder_csv": "cdmCSV",
"cdm_folder": "cdmCSV",
"vendor_settings": "",
"spectrum_cluster": "",
"spectrum_db": "",
"iam_role": "",
"parallel_queries": "1",
"parallel_chunks": "5",
"local_path": ""
},
"ConnectionStrings": {
"Builder": "Data Source=builder.db",
"Vocabulary": "Connection string to database that contains OHDSI Vocabulary tables (concept, concept_ancestor, source_to_concept_map...)",
"Source": "Connection string to source database"
}
}
Connection string example:
Driver={Amazon Redshift (x64)};Server=your server name;Database={db};Sc={sc};UID=your user;PWD=your pswd;Port=5439;SSL=true;Sslmode=Require;UseDeclareFetch=1;Fetch=10000000;UseMultipleStatements=1
Following parameters {db} and {sc} will be replaced to command line parameters, during tool startup
Example:
org.ohdsi.cdm.presentation.etl.exe --vendor ccae --rawdb ccae --rawschema native --batch 500000 --new true --skip_chunk false --resume false --skip_lookup false --skip_build false --skip_etl false --versionid 1 --skip_vocabulary true --skip_cdmsource false --skip_merge false --skip_validation true
Lambda function are limited in memory for this reason ETL tool splitting source data to chunks, additionally chunks are divided into smaller part by number of source Redshift cluster slices.
To check number of slices use below query:
select node, slice from stv_slices;
Chunk size depended on source dataset and slice number. Larger size of chunk provides better performance, but can cause Out of memory error in lambda function, so to process chunk you will need to reduce chunk size or increase lambda memory.
The approximate chunk size (for 3000MB CDMBuilder function) can be calculated using this formula:
batch=number of slice * 100k
A couple of examples of conversion time depending on cluster type for IBM CCAE:
ETL tool parameters example:
org.ohdsi.cdm.presentation.etl.exe --vendor ccae --rawdb ibm_ccae --rawschema native --batch 500000 --new true --skip_chunk false --resume false --skip_lookup false --skip_build false --skip_etl false --versionid 1 --skip_vocabulary true --skip_cdmsource false --skip_merge false --skip_validation true
When chunk data will be available on s3, etl tool will trigger function by creating N files in s3 trigger bucket, number of functions/files equivalent number of slices, etl output shouw messages like below:
...
[Moving raw data] Lambda functions for cuhnkId=21 were triggered | 24 functions
...
You can check log of each lambda function with Amazon CloudWatch, CDMBuilder and Merge function will have own log group. File that triggered lambda will be automatically dropped if conversion was successful.
Etl tool output provide information about total number of chunks, like below:
...
***** Chunk ids were generated and saved, total count=36 *****
...
and current progress, for example:
*** 6:17 AM | Running: lambda 7 local 0; Validating 0; Valid 16; Timeout 0; Invalid 0; Error 0; Total 23
copy "schema_name"."cdm_table"
from 's3://your_result_bucket/vendor_name/conversionId/cdmCSV/cdm_table/cdm_table.'
credentials 'aws_access_key_id=your_result_bucket_access_key_id;aws_secret_access_key=your_result_bucket_secret_access_key' DELIMITER ',' CSV QUOTE AS '"' GZIP