NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
4 stars 0 forks source link

Geographic covariate list to develop #186

Closed kyle-messier closed 5 months ago

kyle-messier commented 9 months ago
MAKassien commented 9 months ago

Developing functions to:

  1. Download covariates
  2. Calculate relevant quantities at specified locations (first will be for AQS locations, later used to calculate at prediction grid locations)
MAKassien commented 9 months ago

Direct all downloaded data to /ddn/gs1/group/SET/Projects/NRT-AP-model/input/data/covariates/

MAKassien commented 9 months ago

Buffer radius distances:

(If resolution is larger than any of the buffers, then only compute pixel value and buffers larger than variable resolution)

sigmafelix commented 9 months ago

@Spatiotemporal-Exposures-and-Toxicology I think we did not talk about the locations at which the covariates are calculated. Would they be the unique sites in 2018-2022?

MAKassien commented 9 months ago

@sigmafelix that sounds right to me. We can also discuss on Monday to have everyone on the same page.

sigmafelix commented 9 months ago

@MAKassien Sounds great. Thank you!

mitchellmanware commented 9 months ago

For covariate calculations, what is the expected number of unique AQS sites between 2018-01-01 and 2022-12-31.? The filter_unique_sites function returns 1058 unique site identification codes.

> source("input/Rinput/processing_functions/filter_unique_sites.R")
> sites <- filter_unique_sites(include_time = FALSE,
+                              date_start = "2018-01-01",
+                              date_end = "2022-12-31")
> length(unique(sites$site_id))
[1] 1058

However, MODIS covariates calculated by @sigmafelix contain 1060 unique site identification codes.

> mod <- readRDS(
+   "/Volumes/SET/Projects/NRT-AP-Model/output/NRTAP_Covar_MCD19A2_AOD047.rds"
+ )
> length(unique(mod$site_id))
[1] 1060
sigmafelix commented 9 months ago

@mitchellmanware Sorry for the confusion. Earlier this week, I realized that two sites in Mexico (along the US-Mexico border) with state code "80" that is nonexistent in actual census FIPS codes, were included in the first version of filter_unique_sites(). Covariates in charge of me will be filtered when they will be joined with all other covariates.

FYI: @Spatiotemporal-Exposures-and-Toxicology @MAKassien @dawranadeep @eva0marques @dzilber

mitchellmanware commented 8 months ago

@eva0marques @MAKassien @dawranadeep I am happy to help with any of the processing/calculation functions or covariate unit tests if y'all have lots of other commitments. Feel free to message me on Teams or post here.

eva0marques commented 7 months ago

Hi @mitchellmanware, I have assigned the extraction of elevation covariate (#207). I planned to work on it later (I am currently back on the testdata task) but if you want to help you are welcome on this task (but really: no obligation)!

mitchellmanware commented 7 months ago

@mitchellmanware

Inventory of calculated, ongoing, and missing covariates

mitchellmanware commented 7 months ago

Covariate Update 02/15/2023

Status of covariates based on contents of /ddn/gs1/group/set/Projects/NRT-AP-Model/output

> nrow(covariates)
[1] 316
> doBy::summary_by(covariates, name~code, FUN = length)
            code name.length
1          Dummy         124
2        GEOS-CF          55
3       Land Use          21
4 Meteorological          30
5          MODIS          80
6       Wildfire           3
7     Population           3
>

To be calculated:

MERRA2 covariates will be removed from Excel spreadsheet based on previous discussions.

kyle-messier commented 6 months ago

Things to consider:

  1. Training to large events (extremes of the distribution)
  2. Indicator for exceeding NAAQS
  3. A general mean model, but not too smooth.
  4. localized variables for fine s-t roughness

Post-hoc analysis:

  1. Understanding the impact of a S-T smoothing on capturing local events
kyle-messier commented 6 months ago

Loss function choice - a mean versus extreme Second model, a logistic model that trains on above/below NAAQS

eva0marques commented 6 months ago

@mitchellmanware are you using a specific script to generate NRTAPCovars.rds in output/ folder?

mitchellmanware commented 6 months ago

I am using saveRDS() to save the data.frame of covariates that is returned from the calc_* function.

I did not use a specific script to read the .rds files and create the inventory.

eva0marques commented 6 months ago

@mitchellmanware @sigmafelix There is a file "covar_nlcd.rds". Is it the calculation of NLCD at monitors location? It has 5,290 rows (in comparison NRTAP_Covars_Ecoregion.rds have 1,058 🤔). I'm looking for the author so that I can figure out if I have to calculate NLCD again!

mitchellmanware commented 6 months ago

Looks like file creator is Insang. I believe the rows reflect a yearly value for each monitor location.

[manwareme@cn040603/triton output]$ ll
total 839552
-rw-r--r-- 1 songi2    grpkmessier    187660 Feb 21 14:25 covar_nlcd.rds
-rw-r--r-- 1 songi2    grpkmessier   7496476 Feb 21 16:42 covar_tri.rds
-rw-rw-r-- 1 songi2    grpkmessier      7380 Dec 11 10:57 NRTAP_Covars_Ecoregion.rds
-rw-r--r-- 1 manwareme grpkmessier 344328218 Dec 29 07:55 NRTAP_Covars_GEOS.rds
-rw-r--r-- 1 manwareme grpkmessier     45226 Jan 25 15:10 NRTAP_Covars_GMTED.rds
-rw-r--r-- 1 manwareme grpkmessier   4857765 Jan 12 12:01 NRTAP_Covars_HMS.rds
-rw-rw-r-- 1 songi2    grpkmessier      4841 Jan 24 14:34 NRTAP_Covars_Koppen_Geiger_AE_binary.rds
-rw-rw-r-- 1 songi2    grpkmessier 130295628 Jan 21 11:16 NRTAP_Covars_MCD19A2_2018_2022_new.rds
-rw-rw-r-- 1 songi2    grpkmessier  26837380 Jan 21 11:04 NRTAP_Covars_MOD06L2_2018_2022_new.rds
-rw-rw-r-- 1 songi2    grpkmessier 146477732 Jan 21 11:07 NRTAP_Covars_MOD09GA_2018_2022_new.rds
-rw-rw-r-- 1 songi2    grpkmessier  23969460 Jan 21 11:07 NRTAP_Covars_MOD11A1_2018_2022_new.rds
-rw-rw-r-- 1 songi2    grpkmessier   1420756 Jan 21 11:07 NRTAP_Covars_MOD13A2_2018_2022_new.rds
-rw-r--r-- 1 manwareme grpkmessier 153541644 Dec 29 07:55 NRTAP_Covars_NARR.rds
-rw-r--r-- 1 manwareme grpkmessier     24096 Feb 15 14:27 NRTAP_Covars_SEDAC_Pop.rds
-rw-rw-r-- 1 songi2    grpkmessier     69064 Dec 11 10:42 NRTAP_Covars_Timedummies.rds
-rw-rw-r-- 1 songi2    grpkmessier  19504920 Jan 21 11:08 NRTAP_Covars_VNP46A2_2018_2022_new.rds
> n <- readRDS("covar_nlcd.rds")
> colnames(n)
 [1] "site_id"           "lon"               "lat"              
 [4] "time"              "LDU_TWATR_0_00100" "LDU_TDVOS_0_00100"
 [7] "LDU_TDVLO_0_00100" "LDU_TDVMI_0_00100" "LDU_TDVHI_0_00100"
[10] "LDU_TBARN_0_00100" "LDU_TDFOR_0_00100" "LDU_TEFOR_0_00100"
[13] "LDU_TMFOR_0_00100" "LDU_TSHRB_0_00100" "LDU_THERB_0_00100"
[16] "LDU_TPAST_0_00100" "LDU_TPLNT_0_00100" "LDU_TWDWT_0_00100"
[19] "LDU_THWEM_0_00100" "LDU_TUNCL_0_01000" "LDU_TWATR_0_01000"
[22] "LDU_TDVOS_0_01000" "LDU_TDVLO_0_01000" "LDU_TDVMI_0_01000"
[25] "LDU_TDVHI_0_01000" "LDU_TBARN_0_01000" "LDU_TDFOR_0_01000"
[28] "LDU_TEFOR_0_01000" "LDU_TMFOR_0_01000" "LDU_TSHRB_0_01000"
[31] "LDU_THERB_0_01000" "LDU_TPAST_0_01000" "LDU_TPLNT_0_01000"
[34] "LDU_TWDWT_0_01000" "LDU_THWEM_0_01000" "LDU_TUNCL_0_10000"
[37] "LDU_TWATR_0_10000" "LDU_TDVOS_0_10000" "LDU_TDVLO_0_10000"
[40] "LDU_TDVMI_0_10000" "LDU_TDVHI_0_10000" "LDU_TBARN_0_10000"
[43] "LDU_TDFOR_0_10000" "LDU_TEFOR_0_10000" "LDU_TMFOR_0_10000"
[46] "LDU_TSHRB_0_10000" "LDU_THERB_0_10000" "LDU_TPAST_0_10000"
[49] "LDU_TPLNT_0_10000" "LDU_TWDWT_0_10000" "LDU_THWEM_0_10000"
> unique(n$time)
[1] 2018 2019 2020 2021 2022
> length(unique(n$site_id))
[1] 1058
> 
eva0marques commented 6 months ago

You're a better detective than I am 😂 thanks

sigmafelix commented 6 months ago

@eva0marques @mitchellmanware I'm sorry to be late to check in. The file names are set as I am preparing a pipeline configuration file where all expected covariates (RDS files for both AQS sites and prediction grid points) are listed. As @mitchellmanware pointed out, the NLCD RDS file has 5,290=1,058*5 rows. 2019 NLCD was used in 2018-2020 and 2021 NLCD was used in 2021-2022.

kyle-messier commented 6 months ago

@sigmafelix We can discuss that in the meeting. Perhaps compressed file types are needed as in the fst package?

sigmafelix commented 6 months ago

@Spatiotemporal-Exposures-and-Toxicology I will look at fst. fst only exports data.frame objects, but this restriction would not matter for our covariate as we have coordinates and time (either date or year) in it. It will be the best if fst writing function is applied in the pipeline.


Road density is calculated as (Total road length [km]) / (Buffer area [sq km]) and the results are stored at ./output.

mitchellmanware commented 6 months ago

Based on contents of /ddn/.../NRT-AP-model/output/, all covariates have been calculated at AQS monitor locations.

> nrow(covariates)
[1] 3838
> doBy::summary_by(covariates, name~category, FUN = length)
  category name.length
1      DUM         124
2      EMI        3468
3      GEO          55
4      GRD           6
5      LDU          68
6      MET          30
7      MOD          80
8      OTH           6
9      TRF           1
mitchellmanware commented 6 months ago

I have cleaned up the column names for all of the covariate .rds files in /ddn/.../NRT-AP-model/output/. Main changes:

  1. $time
    • "time" column for covariates calculated at daily values
    • $time follows "YYYY-MM-DD" format
    • Includes GEOS, HMS, MODIS, NARR, Time Dummies
  2. $year
    • "year" column for covariates with raw data at annual temporal resolution
    • $year is 4 digit integer ranging from 2018 - 2022
    • Includes NEI, NLCD, TRI
  3. $nei_year
    • "nei_year" column removed from NRTAP_Covars_NEI.rds
  4. Time-less
    • Covariates with single value for entire temporal range do not have $time or $year column
    • Includes Ecoregion, GMTED, Koppen Geiger, SEDAC gRoads, SEDAC Population
  5. /renamed_columns
    • I cannot overwrite .rds files which I did not create, so the original NEI, NLCD, and TRI files have been moved to renamed_columns folder
eva0marques commented 5 months ago

@sigmafelix I suspect a small problem on MOD11A1 covariate, the dates of year 2021 are duplicated and 2020 is missing ("../output/NRTAP_Covars_MOD11A1_2018_2022_new.rds").

sigmafelix commented 5 months ago

I'm sorry for the issues in the MOD11 covariates. I will investigate the issues when I can access the ddn.


From: Eva Marques @.> Sent: Wednesday, March 27, 2024 11:41:36 AM To: NIEHS/beethoven @.> Cc: Insang Song @.>; Mention @.> Subject: Re: [NIEHS/beethoven] Geographic covariate list to develop (Issue #186)

@sigmafelixhttps://github.com/sigmafelix I suspect a small problem on MOD11A1 covariate, the dates of year 2021 are duplicated and 2020 is missing ("../output/NRTAP_Covars_MOD11A1_2018_2022_new.rds").

— Reply to this email directly, view it on GitHubhttps://github.com/NIEHS/beethoven/issues/186#issuecomment-2023093874, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGCFCUVCOKDUG2E5EDJTZB3Y2LSDBAVCNFSM6AAAAAA7TLZGF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRTGA4TGOBXGQ. You are receiving this because you were mentioned.Message ID: @.***>

eva0marques commented 5 months ago

Hey @sigmafelix, no hurry at all, for now I am just removing these duplicates and wait for year 2020. Sorry to bother you on your day off!

eva0marques commented 5 months ago

I would like to open AQS data with site_id and time as stored in NRTAPCovars*.rds files. As far as I understand, the files in input/aqs folder do not contain the site_id column? Where did you format this index covariate? Maybe we should think about tidying the data storage folder too on the ddn. @sigmafelix and @mitchellmanware I feel quite lost right now on many points about the organization you chose. Here is a bunch of questions I would like to clarify:

I do not know where to find a code/vignette of how and where you created data from raw files to the output folder. It is very complicated to me to follow you!

We should try to do efforts to be transparent on the choice we take and sometimes I feel like information is complicated to access through the project board. Working with such a big team is really challenging :)

kyle-messier commented 5 months ago

@eva0marques @sigmafelix @mitchellmanware @dzilber @dawranadeep We need to update and document these protocols. @eva0marques I apologize for not keeping up documentation with the pace of changes and the new approaches/tools that we are learning and developing. We will discuss at meeting on 3/28, 4/1, and/or 4/4. Thanks!

mitchellmanware commented 5 months ago

Sorry for not making these points more clear during the process - I will explain what I can.

  1. output/
    • Putting the processed covariate data into the output/ folder was a decision made at the beginning when the two packages were still one. The data was output from our functions and put in this folder, but I agree that we can create a more intuitive folder structure.
  2. site_id
    • The site_id index was created as part of the original download_aqs function to unique identifiers for the AQS monitoring sites at the national level. Monitor site numbers from the raw AQS files are county specific and therefore repeat, so this site_id allows us to have a single ID column for the monitor sites. This process is now in process_aqs as part of the splitting of download_aqs.
    • https://github.com/NIEHS/amadeus/blob/394197fccb685ad573294025092de538f14d538c/R/process.R#L969
  3. Sensor metadata and site_id
    • We do not (to my knowledge) have a single file with all monitor metadata, but this is a good point and we should have one.
    • The filter_unique_sites() function takes the downloaded AQS files (tests/testdata/daily_88101_2018-2022.rds) files and returns a data frame with each monitor location included in this period, the site_id and the coordinates. I used this function to calculate covariates, but it has not been migrated to amadeus. I will add this to the processing functions in amadeus.
    • https://github.com/NIEHS/beethoven/blob/main/input/Rinput/processing_functions/filter_unique_sites.R
  4. output/renamed_columns/ and NRT_Covar...new.rds
    • Covariate .rds files had variations in column naming conventions (discussed in detail in this post https://github.com/NIEHS/beethoven/issues/186#issuecomment-1981152026). I cleaned them up to ensure the time-related columns have consistent names. I was not the creator of these files so I was unable to overwrite the original files or remove them from the shared repo. I created the /NRT_Covar...new.rds files to indicate which I had the new column names and moved the old ones to renamed_columns/ folder.
  5. Covariate column nomenclature
    • The original covariate naming format was discussed (I cannot find the discussion post), and was stored in the covariates excel file on Microsoft teams (NIEHS-SET/Group-AP-Model/Files/NRT-AP-Model_Covariates.xlsx).
    • Naming convention was 3 letter category + 5 letter covariate ID + 1 digit lag days + 5 digit padded buffer radius (CAT_COVID_X_XXXXX). For example, the Meteorological: Accumulated Snow covariate with 0 lag days and 0 m buffer would be MET_ACSNW_0_00000.
    • This may have not been followed, and the individual 5 digit covariate ID's were not discussed, so we should review what we have and update the Excel file with long and short covariate names, codes, and descriptions. The excel file does not reflect the additional NEI or TRI variables.

@eva0marques @sigmafelix @kyle-messier @dawranadeep @dzilber

eva0marques commented 5 months ago

Thanks a lot @Mitchell for taking the time to reply :) @kyle-messier I think it is a good idea to take time at next meeting to talk about a (better?) way to ensure that we keep a track of all the big decisions we make!

sigmafelix commented 5 months ago

@eva0marques I'm sorry for the confusions that I have made. I think we discussed most points in issues or discussions.

For 2020-2021 MOD11 covariates, I found that 2021 MOD11 covariates were calculated twice and will fix it as soon as I will have the corrected calculations.

eva0marques commented 5 months ago

Thanks @sigmafelix for your replies! I admit that I have some difficulties to go back to discussions in the past issues, because I do not always know where to look. Conversations can sometimes diverge until they're no longer on the same topic as the issue title.

kyle-messier commented 5 months ago

@eva0marques Thanks all for bringing this up. As a large group project, things have changed as we learn on the fly. Additionally, some protocols in the beginning have changed and haven't been well documented - my apologies. For example, the "local" repo on the ddn is not staying up to date even though that is where our data for the analysis lives. My suggestion is going to be to develop an internal README for the project. It could exist as part of the Quarto markdown for the project manuscript.

sigmafelix commented 5 months ago

@eva0marques Thank you for your patience. Newly calculated NRTAP_Covars_MOD11A2_2018_2022.rds is copied to ./output. I changed the file permission of the covariate files I created for reading and writing by

chmod 660 [filename]

Please feel free to move the files if you want per your suggestion for ./input/extracted

kyle-messier commented 5 months ago

@eva0marques @sigmafelix @mitchellmanware The issues brought up here are being addressed with new issues.