[Activity dashboard] Pull PyPI downloads data nightly

klai95 commented 2 years ago

In https://github.com/chanzuckerberg/napari-hub/issues/638, we introduced API endpoints that connect to PyPI downloads data. As part of that work, we did a one-time pull of data from Snowflake. Now, let's make that a recurring cron job that runs once every 24 hours.

Additional details

Even though activity dashboard functionality is currently behind a feature flag, let's develop this data integration such that it runs successfully in all our environments (dev, staging and prod)
The underlying data should be getting updated once a day, and @bnelson-czi 's job runs at 3am PT for ~10 minutes, so maybe 5am PT or after is a good time to target.
We will be introducing other activity dashboard metrics in the future, so consider a folder/naming structure such as /activity-dashboard-metrics/pypi-downloads.csv

Earlier notes from Kevin

The follow-up work to tackle improvements to the snowflake code as well as connecting it to a cron job so that we don't need to keep re-running it; see more at https://github.com/chanzuckerberg/napari-hub/pull/662 for the activity dashboard prototype.
As such, we can integrate the read and write of data between Snowflake and S3 in our existing infrastructure that enables for automatic updates. Further more, additional data sources can be integrated seamlessly as well should we ever need to pull from new external sources.

klai95 commented 2 years ago

Below is the code to generate Snowflake data; as part of this issue, I will need to inject this SQL code directly in the backend codebase.

select file_project, date_trunc('month', timestamp) as month, count(*) as num_downloads
from
(
  select country_code, project, project_type, file_project, file_version, file_type, details_installer_name, details_installer_version,
      CONCAT(
        DETAILS,
        DETAILS_INSTALLER, DETAILS_INSTALLER_NAME, DETAILS_INSTALLER_VERSION,
        DETAILS_PYTHON,
        DETAILS_IMPLEMENTATION_NAME, DETAILS_IMPLEMENTATION_VERSION,
        DETAILS_DISTRO, DETAILS_DISTRO_NAME, DETAILS_DISTRO_VERSION, DETAILS_DISTRO_ID, DETAILS_DISTRO_LIBC, DETAILS_DISTRO_LIBC_LIB, DETAILS_DISTRO_LIBC_VERSION,
        DETAILS_SYSTEM, DETAILS_SYSTEM_NAME, DETAILS_SYSTEM_RELEASE,
        DETAILS_CPU,
        DETAILS_OPENSSL_VERSION,
        DETAILS_SETUPTOOLS_VERSION,
        DETAILS_RUSTC_VERSION
      ) as details_all,
      CASE
          WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' OR details_all like '%gcp%' THEN 'ci usage'
          WHEN details_installer_name in ('bandersnatch', 'pip') THEN details_installer_name
          ELSE 'other'
      END AS download_type,
      to_timestamp(timestamp) as timestamp
  from imaging.pypi.downloads
  where download_type = 'pip'
  and project_type = 'plugin'
  order by timestamp desc
)
group by 1,2
order by NUM_DOWNLOADS desc, MONTH, FILE_PROJECT

klai95 commented 2 years ago

1: Add workspace variables on Terraform: happy-sci-imaging->Workspaces->Variables. We want to utilize this process to add sensitive Snowflake credentials such as user and password here. 2: Add code changes to set up Snowflake access (credentials, CloudWatch cadence, etc) in backend codebase 3: Pull PyPI downloads data nightly (write data to S3 without writing the file locally)

Note: No additional lambda will be created. This PR will be captured in the backend lambda.

bnelson-czi commented 2 years ago

@klai95 One minor update. Change... WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' THEN 'ci usage' ...to... WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' OR details_all like '%gcp%' THEN 'ci usage'

This classifies any GCP installs as CI usage.

klai95 commented 2 years ago

I have added snowflake_user and snowflake_password on https://si.prod.tfe.czi.technology/app/sci-imaging-infra/workspaces/prod-happy-napari-hub.

@potating-potato will help apply the change on https://github.com/chanzuckerberg/sci-imaging-infra/blob/28ecdd11f35edc431bba52420b3b9220b1e9dd82/terraform/modules/happy-napari-hub/main.tf#L31-L60 in the additional_secrets section.

klai95 commented 2 years ago

Pinpointed the necessary files (https://github.com/chanzuckerberg/sci-imaging-infra/terraform/envs/prod/happy-napari-hub/variables.tf, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/envs/prod/happy-napari-hub/main.tf, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/modules/happy-napari-hub/variables.tf, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/modules/happy-napari-hub/main.tf) to make Terraform changes - will look into how to make those changes on my own for now

richaagarwal commented 2 years ago

@klai95 Thanks for the update, Kevin! Let's make sure to configure our dev and staging environments as well, and test on dev or staging first.

Tagging @manasaV3 as I think she'll be reaching out to you to pair on this!

manasaV3 commented 2 years ago

Based on discussions from our code pairing session, @klai95 and I decided to use secret manager to store the credentials for snowflake. The reason for this was to increase the security for the credentials, by preventing the credentials from being readable as an environment variable of the lambda.

This will be achieved by having the lambda fetch the credentials from the secret manager before making the snowflake query.

Steps we have identified to implement this:

1. Create a new secret to allow for storing credentials

Currently we are planning on creating the secret manually with aws/secretsmanager as the encryption key. The credentials will be added to it as a key/value pairs. We will be naming this something that can be extended to support other credentials in the environment if needed in the future. Proposed structure of the secret:

{"snowflake_user":"XXXXXXX", "snowflake_password":"YYYYYYY"}

We could also explore adding this using terraform in the future, as we could leverage the terraform variables for it. This would allow for having a single place to manage secrets.

2. Manually add the secret name to the existing environment config

The environment config is currently being maintained as happy/env-<env-name>-config in secrets manager. The data is stored as json, and we can add additional information on the secrets as follows:

...
"secrets": {
    "credetials": {"name": "foo/bar", "arn": "arn:aws:secretsmanager:us-west-2:12345678910:secret:foo/bar-assw7uY" }
},
...

This is based on the assumption that happy/env-<env-name>-config is currently being maintained manually. This assumption will have to be verified before making updates to staging/prod environments.

3. Accessing the secret's details in terraform

We can access the the secret's details from the env-config secret that is already being fetched and decoded. We can assign it to the locals in here

credentials_secrets_arn = try(local.secret["secrets"]["credentials"]["arn"], "")
credentials_secrets_name = try(local.secret["secrets"]["credentials"]["name"], "")

4. Updating the backend lambda access policy

To allow the lambda to be able to fetch the secret values from the secret manager, we have to update the existing execution role to add policies allowing get secret value from the lambda. The current execution role definition can be found here. The proposed policy addition:

  statement {
    action = [
        "secretsmanager:GetSecretValue"
    ],
    resources = [
        local.credentials_secrets_arn
    ]
  }

Reference to IAM policy documentation: https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access_examples.html#auth-and-access_examples_read

5. Add the secret name to the lambda environment variables

To prevent hard coding the secret name and to allow for environment specific secrets we could use the environment variables in the lambda. We have to update the terraform here to add that.

We can use local.credentials_secrets_name to add the secret name as an environment variable.

6. Fetching credentials from secrets manager in code

Before making the call to snowflake as referenced here, we can call the secrets manager to fetch the username and password.

Reference to api: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/secretsmanager.html#SecretsManager.Client.get_secret_value

We could have the session_manager calls be a separated class, to allow for it to be used in other use-cases in the future. Proposed method signature:

# Takes a secret name as input and returns a dictionary of the key values in the secret
def get_secrets_key_pair(secret_name: str) -> dict:

richaagarwal commented 2 years ago

This is based on the assumption that happy/env--config is currently being maintained manually. This assumption will have to be verified before making updates to staging/prod environments.

@manasaV3 It's definitely worth verifying, but I don't believe those are being maintained manually. I believe all values are set in Terraform once and then generated in AWS.

Would it be possible to follow the current process of settings credentials via Terraform, and then proposing a change as a separate issue? I'm hesitant to have different variables be set in different places, and want to keep this ticket as narrowly scoped as possible. Our infra team can also help with setting up the variables according to our current process if needed.

manasaV3 commented 2 years ago

I don't believe those are being maintained manually. I believe all values are set in Terraform once and then generated in AWS.

This is good to know.

My reasoning for suggesting the usage of the secret manager in code over having the terraform was because it didn't feel like the best practice to have the db credentials easily accessible. But, I do see your point on having different variables being sourced from different sources. Also, not having clarity on the sourcing of the env-config secret might take a some more time to unblock. What are your thoughts on making this a separate issue that not only tackles cleaning up of the environment variables, but also removing duplication of credentials across the secret manager and terraform?

P.S. As we are going the route of having the credentials surface in lambda env variables, we should ensure that the snowflake user is specific to our team's workflows, and having its password rotated at a later point wouldn't impact any other workflows.

richaagarwal commented 2 years ago

Broke out the credentials piece into its own issue - https://github.com/chanzuckerberg/napari-hub/issues/695 - so that we can focus specifically on the python/cron job piece in this ticket!

chanzuckerberg / napari-hub