Closed Lysbethk closed 4 months ago
Create lists of datasets we want to merge, and specify how to merge them and by what columns.
List of 'overlapping' variables:
Groupings:
date, weight growth_pct: ksf_clam_growth_data_tidied, ksf_oyster_cylinder_growth_data_tidied
date: ksf_compiled_data_tidied, tidal_data_tidied, weather_data_tidied
date, location, depth: water_samples_data_tidied, profiles_data_tidied
We want to run a correlational analysis across time, so we'll do a conditional imputation where based on our available data, we'll use linear interpolation for continuous variables and mode imputation for categorical variables. In order to do this, we need to revert our initial replacement of NA values with zeros back because linear interpolation assumes the data points reflect actualy measurements and using the zeros could potentially skew the results.
Though the date columns from each dataset fall within a specific date range, the dates of the data collected for each dataset do not match with each other, and therefore, it is hard to merge.
> unique(ksf_clam_growth_data_tidied$date)
[1] "2023-10-17" "2023-12-06"
[3] "2023-12-12" "2024-01-02"
[5] "2024-01-10" "2024-01-24"
[7] "2024-01-31" "2024-02-08"
[9] "2024-02-13"
> unique(ksf_oyster_cylinder_growth_data_tidied$date)
[1] "2023-11-20" "2023-11-27"
[3] "2023-12-08" "2023-12-11"
[5] "2023-12-18" "2023-12-28"
[7] "2024-01-01" "2024-01-05"
[9] "2024-01-08" "2024-01-12"
[11] "2024-01-17" "2024-01-23"
[13] "2024-02-14"
> unique(second_merge$date)
[1] "2023-11-28" "2023-12-21"
[3] "2024-01-09" "2024-01-30"
[5] "2024-02-20" "2023-11-20"
[7] "2023-11-21" "2023-11-22"
[9] "2023-11-23" "2023-11-24"
[11] "2023-11-25" "2023-11-26"
[13] "2023-11-27" "2023-11-29"
[15] "2023-11-30" "2023-12-01"
[17] "2023-12-02" "2023-12-03"
[19] "2023-12-04" "2023-12-05"
[21] "2023-12-06" "2023-12-07"
[23] "2023-12-08" "2023-12-09"
[25] "2023-12-10" "2023-12-11"
[27] "2023-12-12" "2023-12-13"
[29] "2023-12-14" "2023-12-15"
[31] "2023-12-16" "2023-12-17"
[33] "2023-12-18" "2023-12-19"
[35] "2023-12-20" "2023-12-22"
[37] "2023-12-23" "2023-12-24"
[39] "2023-12-25" "2023-12-26"
[41] "2023-12-27" "2023-12-28"
[43] "2023-12-29" "2023-12-30"
[45] "2023-12-31" "2024-01-01"
[47] "2024-01-02" "2024-01-03"
[49] "2024-01-04" "2024-01-05"
[51] "2024-01-06" "2024-01-07"
[53] "2024-01-08" "2024-01-10"
[55] "2024-01-11" "2024-01-12"
[57] "2024-01-13" "2024-01-14"
[59] "2024-01-15" "2024-01-16"
[61] "2024-01-17" "2024-01-18"
[63] "2024-01-19" "2024-01-20"
[65] "2024-01-21" "2024-01-22"
[67] "2024-01-23" "2024-01-24"
[69] "2024-01-25" "2024-01-26"
[71] "2024-01-27" "2024-01-28"
[73] "2024-01-29" "2024-01-31"
[75] "2024-02-01" "2024-02-02"
[77] "2024-02-03" "2024-02-04"
[79] "2024-02-05" "2024-02-06"
[81] "2024-02-07" "2024-02-08"
[83] "2024-02-09" "2024-02-10"
[85] "2024-02-11" "2024-02-12"
[87] "2024-02-13" "2024-02-14"
[89] "2024-02-15" "2024-02-16"
[91] "2024-02-17" "2024-02-18"
[93] "2024-02-19"
> names(second_merge)
[1] "date"
[2] "round"
[3] "location"
[4] "depth"
[5] "water_temperature"
[6] "dissolved_oxygen"
[7] "salinity"
[8] "ksf_rdo_concentration"
[9] "ksf_rdo_saturation"
[10] "ksf_oxygen_partial_pressure"
[11] "ksf_actual_conductivity"
[12] "ksf_specific_conductivity"
[13] "ksf_salinity"
[14] "ksf_density"
[15] "ksf_total_dissolved_solids"
[16] "ksf_chlorophyll_a_fluorescence"
[17] "ksf_ammonium"
[18] "ksf_ammonium_m_v"
[19] "ksf_barometric_pressure"
[20] "outdoor_temperature"
[21] "wind_speed_mph"
[22] "hourly_rain_inch_hr"
[23] "wind_direction"
[24] "time"
[25] "pred"
[26] "high_low"
Removed tidal dara because it has little observations + doesn't have a close date to the profiles and water samples datasets
[ ] ksf_clam_growth_data_tidied [1] "date" "days_btwn_sort" "color"
[4] "stage" "count" "lbs"
[7] "avg_per_lbs" "growth_in_lbs" "growth_pct"
[10] "sr"
[ ] ksf_compiled_data_tidied [1] "date" "rdo_concentration"
[3] "rdo_saturation" "oxygen_partial_pressure"
[5] "actual_conductivity" "specific_conductivity"
[7] "salinity" "density"
[9] "total_dissolved_solids" "chlorophyll_a_fluorescence" [11] "ammonium" "ammonium_m_v"
[13] "barometric_pressure"
[ ] ksf_oyster_cylinder_growth_data_tidied [1] "date" "oyster_chlorophyll" "oyster_size"
[4] "weight" "gain"
[ ] water_samples_data_tidied [1] "sample_id" "nomilo_id"
[3] "round" "date"
[5] "location" "depth"
[7] "chlorophyll_a" "phosphate"
[9] "silicate" "nitrate_nitrite"
[11] "ammonia" "heterotrophic_bacteria"
[13] "large_phytoplankton" "synechococcus_population_1"
[15] "synechococcus_population_2" "prochlorococcus"
[17] "lysbeths_mystery_cells_events" "tube_name"
[ ] tidal_data_tidied [1] "date" "time" "pred" "high_low"
[ ] weather_data_tidied [1] "date" "outdoor_temperature" "wind_speed_mph"
[4] "hourly_rain_inch_hr" "wind_direction"
[ ] profiles_data_tidied [1] "depth" "water_temperature" "dissolved_oxygen" [4] "salinity" "conductivity" "visibility"
[7] "location" "date"