Closed kyle-messier closed 5 months ago
Developing functions to:
Direct all downloaded data to /ddn/gs1/group/SET/Projects/NRT-AP-model/input/data/covariates/
Buffer radius distances:
(If resolution is larger than any of the buffers, then only compute pixel value and buffers larger than variable resolution)
@Spatiotemporal-Exposures-and-Toxicology I think we did not talk about the locations at which the covariates are calculated. Would they be the unique sites in 2018-2022?
@sigmafelix that sounds right to me. We can also discuss on Monday to have everyone on the same page.
@MAKassien Sounds great. Thank you!
For covariate calculations, what is the expected number of unique AQS sites between 2018-01-01 and 2022-12-31.? The filter_unique_sites
function returns 1058 unique site identification codes.
> source("input/Rinput/processing_functions/filter_unique_sites.R")
> sites <- filter_unique_sites(include_time = FALSE,
+ date_start = "2018-01-01",
+ date_end = "2022-12-31")
> length(unique(sites$site_id))
[1] 1058
However, MODIS covariates calculated by @sigmafelix contain 1060 unique site identification codes.
> mod <- readRDS(
+ "/Volumes/SET/Projects/NRT-AP-Model/output/NRTAP_Covar_MCD19A2_AOD047.rds"
+ )
> length(unique(mod$site_id))
[1] 1060
@mitchellmanware Sorry for the confusion. Earlier this week, I realized that two sites in Mexico (along the US-Mexico border) with state code "80" that is nonexistent in actual census FIPS codes, were included in the first version of filter_unique_sites()
. Covariates in charge of me will be filtered when they will be joined with all other covariates.
FYI: @Spatiotemporal-Exposures-and-Toxicology @MAKassien @dawranadeep @eva0marques @dzilber
@eva0marques @MAKassien @dawranadeep I am happy to help with any of the processing/calculation functions or covariate unit tests if y'all have lots of other commitments. Feel free to message me on Teams or post here.
Hi @mitchellmanware, I have assigned the extraction of elevation covariate (#207). I planned to work on it later (I am currently back on the testdata task) but if you want to help you are welcome on this task (but really: no obligation)!
@mitchellmanware
Inventory of calculated, ongoing, and missing covariates
Status of covariates based on contents of /ddn/gs1/group/set/Projects/NRT-AP-Model/output
> nrow(covariates)
[1] 316
> doBy::summary_by(covariates, name~code, FUN = length)
code name.length
1 Dummy 124
2 GEOS-CF 55
3 Land Use 21
4 Meteorological 30
5 MODIS 80
6 Wildfire 3
7 Population 3
>
To be calculated:
MERRA2 covariates will be removed from Excel spreadsheet based on previous discussions.
Things to consider:
Post-hoc analysis:
Loss function choice - a mean versus extreme Second model, a logistic model that trains on above/below NAAQS
@mitchellmanware are you using a specific script to generate NRTAPCovars
I am using saveRDS()
to save the data.frame
of covariates that is returned from the calc_*
function.
I did not use a specific script to read the .rds
files and create the inventory.
@mitchellmanware @sigmafelix There is a file "covar_nlcd.rds". Is it the calculation of NLCD at monitors location? It has 5,290 rows (in comparison NRTAP_Covars_Ecoregion.rds have 1,058 🤔). I'm looking for the author so that I can figure out if I have to calculate NLCD again!
Looks like file creator is Insang. I believe the rows reflect a yearly value for each monitor location.
[manwareme@cn040603/triton output]$ ll
total 839552
-rw-r--r-- 1 songi2 grpkmessier 187660 Feb 21 14:25 covar_nlcd.rds
-rw-r--r-- 1 songi2 grpkmessier 7496476 Feb 21 16:42 covar_tri.rds
-rw-rw-r-- 1 songi2 grpkmessier 7380 Dec 11 10:57 NRTAP_Covars_Ecoregion.rds
-rw-r--r-- 1 manwareme grpkmessier 344328218 Dec 29 07:55 NRTAP_Covars_GEOS.rds
-rw-r--r-- 1 manwareme grpkmessier 45226 Jan 25 15:10 NRTAP_Covars_GMTED.rds
-rw-r--r-- 1 manwareme grpkmessier 4857765 Jan 12 12:01 NRTAP_Covars_HMS.rds
-rw-rw-r-- 1 songi2 grpkmessier 4841 Jan 24 14:34 NRTAP_Covars_Koppen_Geiger_AE_binary.rds
-rw-rw-r-- 1 songi2 grpkmessier 130295628 Jan 21 11:16 NRTAP_Covars_MCD19A2_2018_2022_new.rds
-rw-rw-r-- 1 songi2 grpkmessier 26837380 Jan 21 11:04 NRTAP_Covars_MOD06L2_2018_2022_new.rds
-rw-rw-r-- 1 songi2 grpkmessier 146477732 Jan 21 11:07 NRTAP_Covars_MOD09GA_2018_2022_new.rds
-rw-rw-r-- 1 songi2 grpkmessier 23969460 Jan 21 11:07 NRTAP_Covars_MOD11A1_2018_2022_new.rds
-rw-rw-r-- 1 songi2 grpkmessier 1420756 Jan 21 11:07 NRTAP_Covars_MOD13A2_2018_2022_new.rds
-rw-r--r-- 1 manwareme grpkmessier 153541644 Dec 29 07:55 NRTAP_Covars_NARR.rds
-rw-r--r-- 1 manwareme grpkmessier 24096 Feb 15 14:27 NRTAP_Covars_SEDAC_Pop.rds
-rw-rw-r-- 1 songi2 grpkmessier 69064 Dec 11 10:42 NRTAP_Covars_Timedummies.rds
-rw-rw-r-- 1 songi2 grpkmessier 19504920 Jan 21 11:08 NRTAP_Covars_VNP46A2_2018_2022_new.rds
> n <- readRDS("covar_nlcd.rds")
> colnames(n)
[1] "site_id" "lon" "lat"
[4] "time" "LDU_TWATR_0_00100" "LDU_TDVOS_0_00100"
[7] "LDU_TDVLO_0_00100" "LDU_TDVMI_0_00100" "LDU_TDVHI_0_00100"
[10] "LDU_TBARN_0_00100" "LDU_TDFOR_0_00100" "LDU_TEFOR_0_00100"
[13] "LDU_TMFOR_0_00100" "LDU_TSHRB_0_00100" "LDU_THERB_0_00100"
[16] "LDU_TPAST_0_00100" "LDU_TPLNT_0_00100" "LDU_TWDWT_0_00100"
[19] "LDU_THWEM_0_00100" "LDU_TUNCL_0_01000" "LDU_TWATR_0_01000"
[22] "LDU_TDVOS_0_01000" "LDU_TDVLO_0_01000" "LDU_TDVMI_0_01000"
[25] "LDU_TDVHI_0_01000" "LDU_TBARN_0_01000" "LDU_TDFOR_0_01000"
[28] "LDU_TEFOR_0_01000" "LDU_TMFOR_0_01000" "LDU_TSHRB_0_01000"
[31] "LDU_THERB_0_01000" "LDU_TPAST_0_01000" "LDU_TPLNT_0_01000"
[34] "LDU_TWDWT_0_01000" "LDU_THWEM_0_01000" "LDU_TUNCL_0_10000"
[37] "LDU_TWATR_0_10000" "LDU_TDVOS_0_10000" "LDU_TDVLO_0_10000"
[40] "LDU_TDVMI_0_10000" "LDU_TDVHI_0_10000" "LDU_TBARN_0_10000"
[43] "LDU_TDFOR_0_10000" "LDU_TEFOR_0_10000" "LDU_TMFOR_0_10000"
[46] "LDU_TSHRB_0_10000" "LDU_THERB_0_10000" "LDU_TPAST_0_10000"
[49] "LDU_TPLNT_0_10000" "LDU_TWDWT_0_10000" "LDU_THWEM_0_10000"
> unique(n$time)
[1] 2018 2019 2020 2021 2022
> length(unique(n$site_id))
[1] 1058
>
You're a better detective than I am 😂 thanks
@eva0marques @mitchellmanware I'm sorry to be late to check in. The file names are set as I am preparing a pipeline configuration file where all expected covariates (RDS files for both AQS sites and prediction grid points) are listed. As @mitchellmanware pointed out, the NLCD RDS file has 5,290=1,058*5 rows. 2019 NLCD was used in 2018-2020 and 2021 NLCD was used in 2021-2022.
@sigmafelix We can discuss that in the meeting. Perhaps compressed file types are needed as in the fst
package?
@Spatiotemporal-Exposures-and-Toxicology I will look at fst
. fst
only exports data.frame
objects, but this restriction would not matter for our covariate as we have coordinates and time (either date or year) in it. It will be the best if fst
writing function is applied in the pipeline.
Road density is calculated as (Total road length [km]) / (Buffer area [sq km]) and the results are stored at ./output.
Based on contents of /ddn/.../NRT-AP-model/output/
, all covariates have been calculated at AQS monitor locations.
> nrow(covariates)
[1] 3838
> doBy::summary_by(covariates, name~category, FUN = length)
category name.length
1 DUM 124
2 EMI 3468
3 GEO 55
4 GRD 6
5 LDU 68
6 MET 30
7 MOD 80
8 OTH 6
9 TRF 1
I have cleaned up the column names for all of the covariate .rds
files in /ddn/.../NRT-AP-model/output/
. Main changes:
$time
$time
follows "YYYY-MM-DD" format$year
$year
is 4 digit integer ranging from 2018 - 2022$nei_year
NRTAP_Covars_NEI.rds
$time
or $year
column/renamed_columns
.rds
files which I did not create, so the original NEI, NLCD, and TRI files have been moved to renamed_columns
folder@sigmafelix I suspect a small problem on MOD11A1 covariate, the dates of year 2021 are duplicated and 2020 is missing ("../output/NRTAP_Covars_MOD11A1_2018_2022_new.rds").
I'm sorry for the issues in the MOD11 covariates. I will investigate the issues when I can access the ddn.
From: Eva Marques @.> Sent: Wednesday, March 27, 2024 11:41:36 AM To: NIEHS/beethoven @.> Cc: Insang Song @.>; Mention @.> Subject: Re: [NIEHS/beethoven] Geographic covariate list to develop (Issue #186)
@sigmafelixhttps://github.com/sigmafelix I suspect a small problem on MOD11A1 covariate, the dates of year 2021 are duplicated and 2020 is missing ("../output/NRTAP_Covars_MOD11A1_2018_2022_new.rds").
— Reply to this email directly, view it on GitHubhttps://github.com/NIEHS/beethoven/issues/186#issuecomment-2023093874, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGCFCUVCOKDUG2E5EDJTZB3Y2LSDBAVCNFSM6AAAAAA7TLZGF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRTGA4TGOBXGQ. You are receiving this because you were mentioned.Message ID: @.***>
Hey @sigmafelix, no hurry at all, for now I am just removing these duplicates and wait for year 2020. Sorry to bother you on your day off!
I would like to open AQS data with site_id and time as stored in NRTAPCovars*.rds files. As far as I understand, the files in input/aqs folder do not contain the site_id column? Where did you format this index covariate? Maybe we should think about tidying the data storage folder too on the ddn. @sigmafelix and @mitchellmanware I feel quite lost right now on many points about the organization you chose. Here is a bunch of questions I would like to clarify:
I do not know where to find a code/vignette of how and where you created data from raw files to the output folder. It is very complicated to me to follow you!
We should try to do efforts to be transparent on the choice we take and sometimes I feel like information is complicated to access through the project board. Working with such a big team is really challenging :)
@eva0marques @sigmafelix @mitchellmanware @dzilber @dawranadeep We need to update and document these protocols. @eva0marques I apologize for not keeping up documentation with the pace of changes and the new approaches/tools that we are learning and developing. We will discuss at meeting on 3/28, 4/1, and/or 4/4. Thanks!
Sorry for not making these points more clear during the process - I will explain what I can.
output/
output/
folder was a decision made at the beginning when the two packages were still one. The data was output from our functions and put in this folder, but I agree that we can create a more intuitive folder structure.site_id
site_id
index was created as part of the original download_aqs
function to unique identifiers for the AQS monitoring sites at the national level. Monitor site numbers from the raw AQS files are county specific and therefore repeat, so this site_id
allows us to have a single ID column for the monitor sites. This process is now in process_aqs
as part of the splitting of download_aqs
.site_id
filter_unique_sites()
function takes the downloaded AQS files (tests/testdata/daily_88101_2018-2022.rds
) files and returns a data frame with each monitor location included in this period, the site_id
and the coordinates. I used this function to calculate covariates, but it has not been migrated to amadeus
. I will add this to the processing functions in amadeus
.output/renamed_columns/
and NRT_Covar...new.rds
.rds
files had variations in column naming conventions (discussed in detail in this post https://github.com/NIEHS/beethoven/issues/186#issuecomment-1981152026). I cleaned them up to ensure the time-related columns have consistent names. I was not the creator of these files so I was unable to overwrite the original files or remove them from the shared repo. I created the /NRT_Covar...new.rds
files to indicate which I had the new column names and moved the old ones to renamed_columns/
folder.CAT_COVID_X_XXXXX
). For example, the Meteorological: Accumulated Snow covariate with 0 lag days and 0 m buffer would be MET_ACSNW_0_00000
.@eva0marques @sigmafelix @kyle-messier @dawranadeep @dzilber
Thanks a lot @Mitchell for taking the time to reply :) @kyle-messier I think it is a good idea to take time at next meeting to talk about a (better?) way to ensure that we keep a track of all the big decisions we make!
@eva0marques I'm sorry for the confusions that I have made. I think we discussed most points in issues or discussions.
Site.ID
in our very first development of AQS data ingestion, and shifted to now-used site_id
in favor of all-lowercase lat
, lon
, and time
in stdt. I should have clarified the choice of the name.For 2020-2021 MOD11 covariates, I found that 2021 MOD11 covariates were calculated twice and will fix it as soon as I will have the corrected calculations.
Thanks @sigmafelix for your replies! I admit that I have some difficulties to go back to discussions in the past issues, because I do not always know where to look. Conversations can sometimes diverge until they're no longer on the same topic as the issue title.
@eva0marques Thanks all for bringing this up. As a large group project, things have changed as we learn on the fly. Additionally, some protocols in the beginning have changed and haven't been well documented - my apologies. For example, the "local" repo on the ddn is not staying up to date even though that is where our data for the analysis lives. My suggestion is going to be to develop an internal README for the project. It could exist as part of the Quarto markdown for the project manuscript.
@eva0marques Thank you for your patience. Newly calculated NRTAP_Covars_MOD11A2_2018_2022.rds is copied to ./output. I changed the file permission of the covariate files I created for reading and writing by
chmod 660 [filename]
Please feel free to move the files if you want per your suggestion for ./input/extracted
@eva0marques @sigmafelix @mitchellmanware The issues brought up here are being addressed with new issues.