bcgov / wqbench

R package to generate download and compile data from EPA ECOTOX database
Apache License 2.0
2 stars 2 forks source link

`wqb_create_epa_ecotox()` table creation #4

Open aylapear opened 1 year ago

aylapear commented 1 year ago

Currently the tables are created by passing a list of the table names and the corresponding primary keys

db_tbl_core_structure <- function() {
  # names are table names, values are primary keys 
  list(
    "chemical_carriers" = "carrier_id",
    "dose_response_details" = "dose_resp_detail_id",
    "dose_response_links" = "dose_resp_link_id",
    "dose_responses" = "dose_resp_id",
    "doses" = "dose_id",
    "media_characteristics" = "result_id",
    "results" = "result_id",
    "tests" = "test_id"
  )
}

### there are about 40 reference/validation tables but not all being pk'ed
db_tbl_validation_structure <- function() {
  # names are table names, values are primary keys 
  list(
    "application_frequency_codes" = "code",
    "application_type_codes" = "code",
    "chemical_analysis_codes" = "code",
    "chemical_formulation_codes" = NULL,
    "chemical_grade_codes" = NULL,
    "chemicals" = "cas_number",
    "concentration_type_codes" = NULL,
    "concentration_unit_codes" = NULL,
    "control_type_codes" = NULL,
    "dose_stat_method_codes" = NULL,
    "duration_unit_codes" = NULL, 
    "effect_codes" = "code",
    "endpoint_assigned_codes" = NULL,
    "endpoint_codes" = NULL,
    "exposure_type_codes" = "code",
    "field_study_type_codes" = "code",
    "gender_codes" = "code",
    "geographic_codes" = NULL,
    "habitat_codes" = NULL,
    "ion_codes" = NULL,
    "lifestage_codes" = "code",
    "measurement_codes" = "code",
    "media_char_unit_codes" = NULL,
    "media_type_codes" = "code",
    "organic_matter_type_codes" = NULL,
    "organism_source_codes" = "code",
    "radio_label_codes" = NULL,
    "references" = "reference_number", # make sure tbl name isnt issue
    "response_site_codes" = "code",
    "sample_size_unit_codes" = NULL,
    "season_codes" = NULL,
    "species_synonyms" = NULL, 
    "species" = "species_number",
    "statistical_significance_codes" = NULL,
    "substrate_codes" = NULL,
    "test_location_codes" = "code",
    "test_method_codes" = "code",
    "test_type_codes" = "code",
    "trend_codes" = "code",
    "weight_unit_codes" = NULL
  )
}
aylapear commented 1 year ago

This can be improved upon so each validation table has a primary key. Though it is known that the species_synonms table doesn't have a valid primary key anymore as there is some type of duplication in the table.

Only a few tables have been looked at and a primary key was picked as they are the more used tables at this point. Could take the time to go through each table and determine primary keys for the rest.

aylapear commented 1 year ago

Only the primary keys are being set. Could expand the list into a table and set the datatype for each table, foreign keys and more but it would take extensive time to compile that data for it to be used in an automated capacity.

    if (!is.null(validation_data[[i]])) {
      query <- paste0(
        "CREATE TABLE [", names(validation_data[i]), 
        "] (", paste(colnames(dt), collapse = ", "), 
        ", PRIMARY KEY (", validation_data[[i]], "))"
      )
      DBI::dbExecute(con, query)
    }
    DBI::dbWriteTable(con, 
      names(validation_data[i]), value = dt, append = TRUE, row.names = FALSE
    )

Tables with a primary key don't have the data types for each column set but ones without a primary key do have the data types of each column set because of how the DBI::dbWriteTable function works.

Screenshot 2023-02-14 at 8 46 10 AM
aylapear commented 1 year ago

Since the database is being created in its entirety each time and not being added to this should be fine but could be a future improvement.