The carbon data lake guidance with sample code implements a foundational data lake (and ingestion and processing framework) using the AWS Cloud Development Kit (AWS CDK). The deployed asset provides the base infrastructure for customers and partners to build their carbon accounting use cases.
Note: This solution by itself will not make a customer compliant with any end to end carbon accounting solution. It provides the foundational infrastructure from which additional complementary solutions can be integrated.
The carbon data lake reduces the undifferentiated heavy lifting of ingesting, standardizing, transforming, and calculating greenhouse gas emission data in carbon dioxide equivalent (CO2eq). Customers can use this guidance with sample code to advance their starting point for building decarbonization reporting, forecasting, and analytics solutions and/or products. The carbon data lake includes a purpose-built data pipeline, data quality module, data lineage module, emissions calculator microservice, business intelligence services, prebuilt forecasting machine learning notebook and compute service, GraphQL API, and sample web application.
Customer emissions data (such as databases, historians, existing data lakes, internal/external APIs, Images, CSVs, JSON, IoT/sensor data, and third party applications including CRMs, ERPs, MES, and more) can be mapped to the standard CSV format to support centralization and processing of customer carbon data. Carbon data is ingested through the carbon data lake landing zone, and can be ingested from any service within or connected to the AWS cloud. This calculator can be deployed with a sample emissions factor model or can be modified or augmented with additional bring your own standards lookup tables and calculator logic.
This guidance with sample code provides core functionality to accelerate data ingestion, processing, calculation, storage, analytics and insights. The following list outlines the current capabilities of the carbon data lake. Please submit a PR to request additional capabilities and features. We appreciate your feedback as we continue to improve this offering.
The following list covers current capabilities as of today:
Deploying this repository with default parameters builds the following carbon data lake environment in the AWS Cloud.
Figure 1: Solution Architecture Diagram
As shown in Figure 1: Solution Architecture Diagram, this guidance with sample code sets up the following application stacks
The shared resource stack deploys all cross-stack referenced resources such as S3 buckets and lambda functions that are built as dependencies.
Review the Shared Resources Stack and Stack Outputs
The carbon data lake data pipeline is an event-driven Step Functions Workflow triggered by each upload to the carbon data lake landing zone S3 bucket. The data pipeline performs the following functions:
Review the Data Pipeline Stack, README, and Stack Outputs
The carbon emissions calculator microservice comes with a pre-seeded Amazon DynamoDB reference table. This data model directly references the sample emissions factor model provided for development purposes. The sample data model is adapted from the World Resource Institute (WRI) GHG Protocol Guidance. Please consult the WRI guidance to confirm the most up-to-date information and versions.
The sample provided is for development purposes only, and it is recommended that carbon data lake users modify this JSON document and/or create their own using a similar format. Please modify the provided data model when deploying your own application using the instructions found in the Setup section.
A pre-built AWS AppSync GraphQL API provides flexible querying for application integration. This GraphQL API is authorized using Amazon Cognito User Pools and comes with a predefined Admin and Basic User role. This GraphQL API is used for integration with the carbon data lake AWS Amplify Sample Web Application.
Review the AppSync GraphQL API Stack, Documentation, and Stack Outputs
An AWS Amplify application can be deployed optionally and hosted via Amazon Cloudfront and AWS Amplify. To review deployment steps complete a successful carbon data lake application deployment. The AWS Amplify Web Application depends on the core carbon data lake components.
Review the Web Application Stack and Stack Outputs.
An Amazon Quicksight stack can be deployed optionally with pre-built visualizations for Scope 1, 2, and 3 emissions. This stack requires additional manual setup in the AWS console detailed in this guide.
Review the Amazon Quicksight Stack
A pre-built machine learning notebook is deployed on an Amazon Sagemaker Notebook EC2 instance with .ipynb
and pre-built prompts and functions.
Review the Sagemaker Notebook Instance Stack.
The carbon data lake guidance with sample code comes with sample data for testing successful deployment of the application and can be found in the sample-data
directory.
You are responsible for the cost of the AWS services used while running this reference deployment. There is no additional cost for using this.
The AWS CDK stacks for this repository include configuration parameters that you can customize. Some of these settings, such as instance type, affect the cost of deployment. For cost estimates, see the pricing pages for each AWS service you use. Prices are subject to change.
Tip: After you deploy the repository, create AWS Cost and Usage Reports to track costs associated with the guidance with sample code. These reports deliver billing metrics to an S3 bucket in your account. They provide cost estimates based on usage throughout each month and aggregate the data at the end of the month. For more information, see What are AWS Cost and Usage Reports?
This application doesnβt require any software license or AWS Marketplace subscription.
You can deploy the carbon data lake guidance with sample code through the manual setup process using AWS CDK. We recommend use of an AWS Cloud9 instance in your AWS account or VS Code and the AWS CLI. We also generally recommend a fresh AWS account that can be integrating with your existing infrastructure using AWS Organizations.
The aws-cli must be installed -and- configured with an AWS account on the deployment machine (see https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html for instructions on how to do this on your preferred development platform).
This project requires Node.js. To make sure you have it available on your machine, try running the following command.
node -v
For best experience we recommend installing CDK globally:
npm install -g aws-cdk
This repository has been developed using architectural and security best practices as defined by AwsSolutions CDK Nag Pack. CDK Nag provides integrated tools for automatically reviewing infrastructure for common security, business, and architectural best practices.
This repository comes with AwsSolutions CDK Nag Pack pre-configured and enabled by default. This means that any changes to existing code or deployments will be automatically checked for architectural and development best practices as defined by the AwsSolutions CDK Nag Pack. You can disable this feature in cdk.context.json
by switching the nagEnabled
flag to false
.
As part of the shared responsibility model for security we recommend taking additional steps within your AWS account to secure this application. We recommend you implement the following AWS services once your application is in production:
Navigate to the desired parent directory and clone the repository
git clone #insert-http-or-ssh-for-this-repository
Navigate to the repository directory
cd <insert path to parent repository>/guidance-for-carbon-lake-on-aws
aws configure
cdk.context.template.json
or remove .templatecdk.context.json
(see Context Parameters below)generateItem
method and the IDdbEmissionFactor
interface found in the Calculator ConstructBefore deployment navigate to cdk.context.json
and update the required context parameters which include: adminEmail
, and repoBranch
. Review the optional and required context variables below.
adminEmail
The email address for the administrator of the apprepoBranch
The branch to deploy in your pipeline (default is /main
)quicksightUserName
Username for access to the carbon emissions dataset and dashboard.deployQuicksightStack
Determines whether this stack is deployed. Default is false
, change to true
if you want to deploy this stack.deploySagemakerStack
Determines whether this stack is deployed. Default is false
, change to true
if you want to deploy this stack.deployWebStack
Determines whether this stack is deployed. Default is false
, change to true
if you want to deploy this stack.nagEnabled
Enables cdk_nag audit tool. Default is true
. Change to false
if you want to disable.Quicksight Note: If you choose to deploy the optional Quicksight Module make sure you review QuickSight setup instructions
Web Application Note: If you choose to deploy the optional Web Module make sure you review web application setup instructions
npm ci
npm run build
aws sts get-caller-identity
cdk bootstrap aws://ACCOUNT-NUMBER/REGION
or
cdk bootstrap # if you are authenticated through `aws configure`
cdk synth
cdk deploy --all
If you are reading this it is because you deployed the carbon data lake guidance with sample code Web Application by setting deployWebStack: true
in the cdk.context.json
file. Your application is already up and running in the AWS Cloud and there are a few simple steps to begin working with and editing your application.
Visit the AWS Amplify Console by navigating to the AWS Console and searching for Amplify. Make sure you are in the same region that you just selected to deploy your application.
Visit your live web application --> click on the link in the Amplify console When you open the web application in your browser you should see a cognito login page with input fields for an email address and password. Enter your email address and the temporary password sent to your email when you created your carbon data lake guidance with sample code CDK Application. After changing your password, you should be able to sign in successfully at this point.
NOTE: The sign-up functionality is disabled intentionally to help secure your application. You may change this and add the UI elements back, or manually add the necessary users in the cognito console while following the principle of least privilege (recommended).
Learn more about working with AWS Amplify CLI or the AWS Amplify Console.
Make the web application your own and let us know what you choose do to with it.
Success! At this point, you should successfully have the Amplify app working.
If you choose to deploy the Amazon Quicksight business intelligence stack it will include prebuilt data visualizations that leverage Amazon Athena to query your processed data. If you elect to deploy this stack you will need to remove the comments.
Before you proceed you need to set up your quicksight account and user. This needs to be done manually in the console, so please open this link and follow the instructions here.
To deploy this stack navigate to cdk.context.json
and change deployQuicksightStack
value to true
and redeploy the application by running cdk deploy --all
The forecast stack includes a pre-built sagemaker notebook instance running an .ipynb
with embedded machine learning tools and prompts.
To deploy this stack navigate to cdk.context.json
and change deploySagemakerStack
value to true
and redeploy the application by running cdk deploy --all
You can destroy all stacks included in carbon data lake guidance with sample code with cdk destroy --all
. You can destroy individual stacks with cdk destroy --StackName
. By default using CDK Destroy will destroy EVERYTHING. Use this with caution! We strongly recommend that you modify this functionality by applying no delete defaults within your CDK constructs. Some stacks and constructs that we recommend revising include:
The CDK stacks by default export all stack outputs to cdk-outputs.json
at the top level of the directory. You can disable this feature by removing "outputsFile": "cdk-outputs.json"
from cdk.json
but we recommend leaving this feature, as it is a requirement for some other features. By default this file is ignored via .gitignore
so any outputs will not be committed to a version control repository. Below is a guide to the standard outputs.
Shared resource stack outputs include:
cdlAwsRegion
: Region of CDK Application AWS Deployment.cdlEnrichedDataBucket
: Enriched data bucket with outputs from calculator service.cdlEnrichedDataBucketUrl
: Url for enriched data bucket with outputs from calculator servicecdlDataLineageBucket
: Data lineage S3 bucketcdlDataLineageBucketUrl
: Data lineage S3 bucket URL-cdluserPoolId
: Cognito user pool ID for authentication -CLQidentityPoolId
: Cognito Identity pool ID for authentication -cdluserPoolClientId
: Cognito user pool client ID for authentication -cdlcdlAdminUserRoleOutput
: Admin user role output -cdlcdlStandardUserRoleOutput
: Standard user role output -cdlApiEndpoint
: GraphQL API endpoint -cdlApiUsername
: GraphQL API admin username -cdlGraphQLTestQueryURL
: GraphQL Test Query URL (takes you to AWS console if you are signed in).
-LandingBucketName
: S3 Landing Zone bucket name for data ingestion to carbon data lake guidance with sample code Data Pipeline. -cdlLandingBucketUrl
: S3 Landing Zone bucket URL for data ingestion to carbon data lake guidance with sample code Data Pipeline. -cdlGlueDataBrewURL
: URL for Glue Data Brew in AWS Console. -cdlDataPipelineStateMachineUrl
: URL to open cdl state machine to view step functions workflow status.
-cdlWebAppRepositoryLink
: Amplify Web Application codecommit repository link. -cdlWebAppId
: Amplify Web Application ID. -cdlAmplifyLink
: Amplify Web Application AWS Console URL. -cdlWebAppDomain
: Amplify Web Application live web URL.
-QuickSightDataSource
: ID of QuickSight Data Source Connector Athena Emissions dataset. Use this connector to create additional QuickSight datasets based on Athena dataset. -QuickSightDataSet
: ID of pre-created QuickSight DataSet, based on Athena Emissions dataset. Use this pre-created dataset to create new dynamic analyses and dashboards. -QuickSightDashboard
: ID of pre-created QuickSight Dashboard, based on Athena Emissions dataset. Embed this pre-created dashboard directly into your user facing applications. -cdlQuicksightUrl
: URL of Quicksight Dashboard.
-cdlSagemakerRepository
: Codecommit repository of sagemaker notebook. -cdlSagemakerNotebookUrl
: AWS console URL for Sagemaker Notebook ML Instance.
-e2eTestLambdaFunctionName
: Name of carbon data lake lambda test function. -e2eTestLambdaConsoleLink
: URL to open and invoke calculator test function in the AWS Console.
Time to get started using carbon data lake guidance with sample code! Follow the steps below to see if everything is working and get familiar with this solution.
In your command line shell you should see confirmation of all resources deploying. Did they deploy successfully? Any errors or issues? If all is successful you should see indication that CDK deployed. You can also verify this by navigating to the Cloudformation service in the AWS console. Visually check the series of stacks that all begin with CLQS
to see that they deployed successfully. You can also search for the tag:
"application": "carbon-data-lake"
Time to test some data out and see if everything is working. This section assumes basic prerequisite knowledge of how to manually upload an object to S3 with the AWS console. For more on this please review how to upload an object to S3.
cdlpipelinestack-cdllandingbucket
with a unique identifier appended to itcdlPipeline
with an appended uuidFigure. In progress step function workflow
Figure. Completed step function workflow
The calculator outputs emissions calculator outputs referenced in the data model section below. Outputs are written to Amazon DynamoDB and Amazon S3. You can review the outputs using the AWS console or AWS CLI:
DataBase
and a table called Table
BucketName
. This bucket contains all calculator outputs.You can also query this data using the GraphQL API detailed below.
This one will get all of the records (with a default limit of 10)
query MyQuery {
all {
items {
activity
activity_event_id
asset_id
category
emissions_output
geo
origin_measurement_timestamp
raw_data
units
source
scope
}
}
}
Did that all work? Continue...
If you have not yet this is a great time to deploy the sample web application. Once you've run some data through the pipeline you should see that successfully populating in the application. Remember that to deploy the web application you will need to set "deployWebStack": "true"
in cdk.context.json
.
This application currently includes unit tests, infrastructure tests, deployment tests. We are working on an end to end testing solution as well. Read on for the test details:
For Gitlab users only -- The Gitlab CI runs each time you commit to remote and/or merge to main. This runs automatically and does the following:
npm ci
installs all dependencies from package.lock.json
npm run build
builds the javascript from typescript and makes sure everything works!cdk synth
synthesizes all CDK stacks in the applicationSuccess
You can run several of these tests manually on your local machine to check that everything is working as expected.
sh test-deployment.sh
Runs CDKitten locally using your assumed AWS rolesh test-e2e.sh
runs an end to end test by dropping data into the pipeline and querying the GraphQL api output. If the test is successful it returns Success
npm run lint
tests your code locally with the prebuilt linter configurationIf you are looking to utilize existing features of carbon data lake while integrating your own features, modules, or applications this section provides details for how to ingest your data to the carbon data lake data pipeline, how to connect data outputs, how to integrate other applications, and how to integrate other existing AWS services. As we engage with customers this list of recommendations will grow with customer use-cases. Please feel free to submit issues that describe use-cases you would like to be documented.
To ingest data into carbon data lake you can use various inputs to get data into the carbon data lake landing zone S3 bucket. This bucket can be found via AWS Console or AWS CLI under the name bucketName
. It can also be accessed as a public readonly stack output via props stackOutputName
. There are several methods for bringing data into an S3 bucket to start an event-driven pipeline. This article is a helpful resource as you explore options. Once your data is in S3 it will kick off the pipeline and the data quality check will begin.
To add additional features to carbon data lake we recommend developing your own stack that integrates with the existing carbon data lake stack inputs and outputs. We recommend starting by reviewing the concepts of application, stack, and construct in AWS CDK. Adding a stack is the best way to add functionality to carbon data lake.
Start by adding your own stack directory to lib/stacks
mkdir lib/stacks/stack-title
Add a stack file to this directory
touch lib/stacks/stack-title/stack-title.ts
Use basic CDK stack starter code to formulate your own stack. See example below:
import * as cdk from 'aws-cdk-lib'
import { Construct } from 'constructs'
// import * as sqs from 'aws-cdk-lib/aws-sqs';
export class ExampleStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props)
// The code that defines your stack goes here
// example resource
// const queue = new sqs.Queue(this, 'ExampleStackQueue', {
// visibilityTimeout: cdk.Duration.seconds(300)
// });
}
}
We recommend using a single stack, and integrating additional submodular components as constructs. Constructs are logical groupings of AWS resources with "sane" defaults. In many cases the CDK team has already created a reusable construct and you can simply work with that. But in specific cases you may way to create your own. You can create a construct using the commands and example below:
mkdir lib/constructs/construct-title
touch lib/constructs/construct-title/title-construct.ts
If you have integrated your stack successfully you should see it build when you run cdk synth
. For development purposes we recommend deploying your stack in isolation before you deploy with the full application. You can run cdk deploy YourStackName
to deploy in isolation.
Integrate your stack with the full application by importing it to bin/main.ts
and bin/cicd.ts
if you have chosen to deploy it.
#open the file main.ts
open main.ts
// Import your stack at the top of the file
import { YourStackName } from './stacks/stack-title/your-stack'
// Now create a new stack to deploy within the application
const stackName = new YourStackName(app, 'YourStackTitle', {
// these are props that serve as an input to your stack
// these are optional, but could include things like S3 bucket names or other outputs of other stacks.
// For more on this see the stack output section above.
yourStackProp1: prop1,
yourStackProp2: prop2,
env: appEnv, // be sure to include this environment prop
})
You can access the outputs of application stacks by adding them as props to your stack inputs. For example, you can access the myVpc
output by adding networkStack.myVpc
as props your own stack. It is best practice to add this as props at the application level, and then as an interface at the stack level. Finally, you can access it via props.myVpc
(or whatever you call it) within your stack. Below is an example.
// Start by importing it when you instatiate your stack π
new MyFirstStack(app, 'MyFirstStack', {
vpc: networkStack.myVpc
});
// Now export this as an interface within that stack π
export interface MySecondStackProps extends StackProps {
vpc: Ec2.vpc
}
// Now access it as a prop where you need it within the stack π
this.myStackObject = new ec2.SecurityGroup(this, 'ec2SecurityGroup', {
props.vpc,
allowAllOutbound: true,
});
The above is a theoretical example. We recommend reviewing the CDK documentation and the existing stacks to see more examples.
The model below describes the required schema for input to the carbon data lake calculator microservice. This is Calculator Data Input Model.
npm run build
compile typescript to jsnpm run watch
watch for changes and compilenpm run test
perform the jest unit tests\cdk diff
compare deployed stack with current statecdk synth
emits the synthesized CloudFormation templatecdk deploy --all
deploy this stack to your default AWS account/region w/o the CICD pipelinenpm run deploy:cicd
deploy this application CI/CD stack and then link your repo for automated pipelineThe model below describes the required schema for input to the carbon data lake calculator microservice. This is Calculator Data Input Model
{
"activity_event_id": "customer-carbon-data-lake-12345",
"asset_id": "vehicle-1234",
"geo": {
"lat": 45.5152,
"long": 122.6784
},
"origin_measurement_timestamp": "2022-06-26 02:31:29",
"scope": 1,
"category": "mobile-combustion",
"activity": "Diesel Fuel - Diesel Passenger Cars",
"source": "company_fleet_management_database",
"raw_data": 103.45,
"units": "gal"
}
The model below describes the standard output model from the carbon data lake emissions calculator microservice.
{
"activity_event_id": "customer-CarbonLake-12345",
"asset_id": "vehicle-1234",
"activity": "Diesel Fuel - Diesel Passenger Cars",
"category": "mobile-combustion",
"scope": 1,
"emissions_output": {
"calculated_emissions": {
"co2": {
"amount": 0.024,
"unit": "tonnes"
},
"ch4": {
"amount": 0.00001,
"unit": "tonnes"
},
"n2o": {
"amount": 0.00201,
"unit": "tonnes"
},
"co2e": {
"ar4": {
"amount": 0.2333,
"unit": "tonnes"
},
"ar5": {
"amount": 0.2334,
"unit": "tonnes"
}
}
},
"emissions_factor": {
"ar4": {
"amount": 8.812,
"unit": "kgCO2e/unit"
},
"ar5": {
"amount": 8.813,
"unit": "kgCO2e/unit"
}
}
},
"geo": {
"lat": 45.5152,
"long": 122.6784
},
"origin_measurement_timestamp": "2022-06-26 02:31:29",
"raw_data": 103.45,
"source": "company_fleet_management_database",
"units": "gal"
}
The json document below is a sample emissions factor model for testing and development purposes only. To use this solution or develop your own related solution please customize and update your own emissions factor models to represent your reporting requirements.
Sample Emissions Factor Model. This is the lookup table used for coefficient inputs to the calculator microservice.
Calculation methodologies reflected in this solution are aligned with the sample model, and this calculator stack may require modification if a new model is applied. To review calculation methodology and lookup tables please review the carbon data lake Emissions Calculator Stack.
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.
For users with an Apple M1 chip, you may run into the following error when executing npm commands: "no matching version found for node-darwin-amd64@16.4.0" or similar terminal error output depending on the version of node you are running. If this happens, execute the following commands from your terminal in order (this fix assumes you have node version manager (nvm) installed). In this example, we will use node version 16.4.0. Replace the node version in these commands with the version you are running:
nvm uninstall 16.4.0
arch -x86_64 zsh
nvm install 16.4.0
nvm alias default 16.4.0