dfe-analytical-services / dfeR

Common R tasks in the Department for Education (DfE)
https://dfe-analytical-services.github.io/dfeR/
GNU General Public License v3.0
11 stars 2 forks source link

extend create_project to include .secrets, optional pre-commit to detect secrets #83

Open holmes-alex opened 1 month ago

holmes-alex commented 1 month ago

Is your feature request related to a problem? Please describe.

For some workstreams a key or token may be needed and there is always a risk of individuals pushing keys/tokens to repo's.

Describe the solution you'd like

Include under create_project a .secrets file (or .env) which is untracked by git / included in the .gitignore as either standard or by optional toggle. Optionally, this could be extended to include a pre-commit hook which searches a repo for token/keys before a commit.

Describe alternatives you've considered

Users should use environment variables and not push secrets, guidance should assist with this but having it as part of the default project template with a back-up pre-commit hook can help limit this risk.

Additional context

Similar functionality is provided in the govcookie cutter repo but the pre-commits and structure are a bit buggy for R as this was mainly python focused originally: https://github.com/best-practice-and-impact/govcookiecutter

cjrace commented 1 month ago

Like this suggestion - think there's a really easy win here by giving a standard file users can have that is in the .gitignore by default.

I think I'd leave the commit hooks out of it for now, just as they can be more effort and also start to add more dependencies too, but definitely something that could be an extension afterwards.

Also worth noting that GitHub has some inbuilt secret scanning too, which we recommend having on for all repos, this can be pretty neat and requires little effort - I think it's worth us adding some notes on anything we add here, as well as the GitHub / DevOps specific approaches to think about how to make use of this too, nto the project vignette in #76.

Main question for this issue is then probably what to call the file?

Interested in any other views on this! Also, any suggestions of good real world examples to use in the documentation / as a test case?

holmes-alex commented 1 month ago

I think you're right .Renviron is probably the best native approach here as it's R specific, although might be a little less familiar for newer users.

From what I can tell there is a profile version and a .Rproj specific version. I'd assume we'd want to create a .Renviron file, have guidance on how users should use this (maybe pre-populate this with some examples), and how to call this from their code.

Example:

  1. Create a .Renviron file on project initialisation with some default placeholders included:
    • API_KEY=StoreYourKeyHere
    • DB_KEY=StoreYourKeyHere
  2. Guidance for user on how to unhide hidden files and how to update this with their key or new variables
  3. Guidance on how to call these environment variable in their code using Sys.getenv, e.g. Sys.getenv("API_KEY") returns "StoreYourKeyHere" from the example. Sys.getenv() can also be used to return all environment variables of the profile, not just the project so i'm unsure how duplicates are handled if for instance an API_KEY was already present on the system.
  4. Make sure .Renviron files are untracked by default in .gitignore

One thing to note is as far as I can tell, RStudios checks for .Renvirons only once on start up of the project and i'm unsure if there's a way to force this to happen once an .Renviron file is updated. Not ideal if users have to reopen a project after putting in their key but something to include in guidance if there's no fix.

cjrace commented 1 month ago

Nice - that all sounds good to me, happy to go with .Renviron. Looks like we have a clear plan now so will just need someone to pick this up 😄 (tagging you for sight @jen-machin)

JT-39 commented 1 month ago

I like this, and it pushed people towards that best practice of sharing as much as possible but doing so in a safe manner.

For reference, this is how I have used .Renviron and config.yml to connect to an internal SQL database in development and production stages (deployed to Posit Connect):

DB_UID = StoreYourDBUsernameHere
DB_PWD = StoreYourDBPasswordHere
default:
  cred_db:
    driver: "ODBC Driver 17 for SQL Server"
    server: "YourSQLdbServerName"
    database: "YourSQLdbName"
    uid: ""
    pwd: ""
    trusted: "Yes"
    encoding: ""

production:
  cred_db:
    driver: "ODBC Driver 17 for SQL Server"
    server: "YourSQLdbServerName"
    database: "YourSQLdbName"
    uid: !expr Sys.getenv("DB_UID")
    pwd: !expr Sys.getenv("DB_PWD")
    trusted: "No"

And then as a little bonus, this function I use to connect to SQL using the config.yml:

#' Connect to SQL Server Database
#'
#' This function takes a YAML key as input,
#' retrieves the database credentials from the YAML file,
#' and establishes a connection to the SQL Server database using the retrieved
#' credentials.
#' The database connection is then returned.
#'
#' @param yml_key A character string representing the YAML key for the
#' database credentials.
#'
#' @return A DBIConnection object representing the database connection.
#'
#' @examples
#' db_conn <- connect_sql_db("cred_db")
connect_sql_db <- function(yml_key) {
  # Get database credentials
  db_creds <- config::get(yml_key)

  # connect to the database
  db_conn <- DBI::dbConnect(
    odbc::odbc(),
    Driver = db_creds$driver,
    Server = db_creds$server,
    Database = db_creds$database,
    UID = db_creds$uid,
    PWD = db_creds$pwd,
    Trusted_connection = db_creds$trusted
  )

  db_conn
}

May be useful in providing a use case for the .Renviron most people will need when starting a project.