Closed klai95 closed 2 years ago
Below is the code to generate Snowflake data; as part of this issue, I will need to inject this SQL code directly in the backend codebase.
select file_project, date_trunc('month', timestamp) as month, count(*) as num_downloads
from
(
select country_code, project, project_type, file_project, file_version, file_type, details_installer_name, details_installer_version,
CONCAT(
DETAILS,
DETAILS_INSTALLER, DETAILS_INSTALLER_NAME, DETAILS_INSTALLER_VERSION,
DETAILS_PYTHON,
DETAILS_IMPLEMENTATION_NAME, DETAILS_IMPLEMENTATION_VERSION,
DETAILS_DISTRO, DETAILS_DISTRO_NAME, DETAILS_DISTRO_VERSION, DETAILS_DISTRO_ID, DETAILS_DISTRO_LIBC, DETAILS_DISTRO_LIBC_LIB, DETAILS_DISTRO_LIBC_VERSION,
DETAILS_SYSTEM, DETAILS_SYSTEM_NAME, DETAILS_SYSTEM_RELEASE,
DETAILS_CPU,
DETAILS_OPENSSL_VERSION,
DETAILS_SETUPTOOLS_VERSION,
DETAILS_RUSTC_VERSION
) as details_all,
CASE
WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' OR details_all like '%gcp%' THEN 'ci usage'
WHEN details_installer_name in ('bandersnatch', 'pip') THEN details_installer_name
ELSE 'other'
END AS download_type,
to_timestamp(timestamp) as timestamp
from imaging.pypi.downloads
where download_type = 'pip'
and project_type = 'plugin'
order by timestamp desc
)
group by 1,2
order by NUM_DOWNLOADS desc, MONTH, FILE_PROJECT
1: Add workspace variables on Terraform: happy-sci-imaging->Workspaces->Variables. We want to utilize this process to add sensitive Snowflake credentials such as user
and password
here.
2: Add code changes to set up Snowflake access (credentials, CloudWatch cadence, etc) in backend codebase
3: Pull PyPI downloads data nightly (write data to S3 without writing the file locally)
Note: No additional lambda will be created. This PR will be captured in the backend
lambda.
@klai95 One minor update. Change...
WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' THEN 'ci usage'
...to...
WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' OR details_all like '%gcp%' THEN 'ci usage'
This classifies any GCP installs as CI usage.
I have added snowflake_user
and snowflake_password
on https://si.prod.tfe.czi.technology/app/sci-imaging-infra/workspaces/prod-happy-napari-hub.
@potating-potato will help apply the change on https://github.com/chanzuckerberg/sci-imaging-infra/blob/28ecdd11f35edc431bba52420b3b9220b1e9dd82/terraform/modules/happy-napari-hub/main.tf#L31-L60 in the additional_secrets
section.
Pinpointed the necessary files (https://github.com/chanzuckerberg/sci-imaging-infra/terraform/envs/prod/happy-napari-hub/variables.tf
, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/envs/prod/happy-napari-hub/main.tf
, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/modules/happy-napari-hub/variables.tf
, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/modules/happy-napari-hub/main.tf
) to make Terraform changes - will look into how to make those changes on my own for now
@klai95 Thanks for the update, Kevin! Let's make sure to configure our dev and staging environments as well, and test on dev or staging first.
Tagging @manasaV3 as I think she'll be reaching out to you to pair on this!
Based on discussions from our code pairing session, @klai95 and I decided to use secret manager to store the credentials for snowflake. The reason for this was to increase the security for the credentials, by preventing the credentials from being readable as an environment variable of the lambda.
This will be achieved by having the lambda fetch the credentials from the secret manager before making the snowflake query.
Steps we have identified to implement this:
Currently we are planning on creating the secret manually with aws/secretsmanager
as the encryption key. The credentials will be added to it as a key/value pairs. We will be naming this something that can be extended to support other credentials in the environment if needed in the future. Proposed structure of the secret:
{"snowflake_user":"XXXXXXX", "snowflake_password":"YYYYYYY"}
We could also explore adding this using terraform in the future, as we could leverage the terraform variables for it. This would allow for having a single place to manage secrets.
The environment config is currently being maintained as happy/env-<env-name>-config
in secrets manager. The data is stored as json, and we can add additional information on the secrets as follows:
...
"secrets": {
"credetials": {"name": "foo/bar", "arn": "arn:aws:secretsmanager:us-west-2:12345678910:secret:foo/bar-assw7uY" }
},
...
This is based on the assumption that happy/env-<env-name>-config
is currently being maintained manually. This assumption will have to be verified before making updates to staging/prod environments.
We can access the the secret's details from the env-config secret that is already being fetched and decoded. We can assign it to the locals in here
credentials_secrets_arn = try(local.secret["secrets"]["credentials"]["arn"], "")
credentials_secrets_name = try(local.secret["secrets"]["credentials"]["name"], "")
To allow the lambda to be able to fetch the secret values from the secret manager, we have to update the existing execution role to add policies allowing get secret value from the lambda. The current execution role definition can be found here. The proposed policy addition:
statement {
action = [
"secretsmanager:GetSecretValue"
],
resources = [
local.credentials_secrets_arn
]
}
Reference to IAM policy documentation: https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access_examples.html#auth-and-access_examples_read
To prevent hard coding the secret name and to allow for environment specific secrets we could use the environment variables in the lambda. We have to update the terraform here to add that.
We can use local.credentials_secrets_name
to add the secret name as an environment variable.
Before making the call to snowflake as referenced here, we can call the secrets manager to fetch the username and password.
Reference to api: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/secretsmanager.html#SecretsManager.Client.get_secret_value
We could have the session_manager calls be a separated class, to allow for it to be used in other use-cases in the future. Proposed method signature:
# Takes a secret name as input and returns a dictionary of the key values in the secret
def get_secrets_key_pair(secret_name: str) -> dict:
This is based on the assumption that happy/env-
-config is currently being maintained manually. This assumption will have to be verified before making updates to staging/prod environments.
@manasaV3 It's definitely worth verifying, but I don't believe those are being maintained manually. I believe all values are set in Terraform once and then generated in AWS.
Would it be possible to follow the current process of settings credentials via Terraform, and then proposing a change as a separate issue? I'm hesitant to have different variables be set in different places, and want to keep this ticket as narrowly scoped as possible. Our infra team can also help with setting up the variables according to our current process if needed.
I don't believe those are being maintained manually. I believe all values are set in Terraform once and then generated in AWS.
This is good to know.
My reasoning for suggesting the usage of the secret manager in code over having the terraform was because it didn't feel like the best practice to have the db credentials easily accessible. But, I do see your point on having different variables being sourced from different sources. Also, not having clarity on the sourcing of the env-config secret might take a some more time to unblock. What are your thoughts on making this a separate issue that not only tackles cleaning up of the environment variables, but also removing duplication of credentials across the secret manager and terraform?
P.S. As we are going the route of having the credentials surface in lambda env variables, we should ensure that the snowflake user is specific to our team's workflows, and having its password rotated at a later point wouldn't impact any other workflows.
Broke out the credentials piece into its own issue - https://github.com/chanzuckerberg/napari-hub/issues/695 - so that we can focus specifically on the python/cron job piece in this ticket!
In https://github.com/chanzuckerberg/napari-hub/issues/638, we introduced API endpoints that connect to PyPI downloads data. As part of that work, we did a one-time pull of data from Snowflake. Now, let's make that a recurring cron job that runs once every 24 hours.
Additional details
Earlier notes from Kevin
Snowflake
andS3
in our existing infrastructure that enables for automatic updates. Further more, additional data sources can be integrated seamlessly as well should we ever need to pull from new external sources.