Closed MAKassien closed 6 months ago
@sigmafelix thank you for working on the download function! Can you remind me where it is located in the repository so I can take a look? I wasn't able to find it on my own just now.
@MAKassien Download function is in my branch for now, but I will make a pull request before Friday. (download function)
@MAKassien The main
branch includes a function named calc_tri
(link: https://github.com/Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel/blob/6acbb8f65fd8bec1a739e1bc5a88ed091fd8bf78/R/calculate_covariates.R#L763-L847), which is a draft for calculating TRI variables from the source datasets. It is not fully implemented and still untested, so please review the function and revise/rewrite in part or in entirety if necessary. Thank you!
@sigmafelix thank you for working on this, I'll review and make necessary changes this week!
TRI covariate update:
@sigmafelix Some comments on the calc_tri function:
line 817: I think we should change some of the columns selected. This is my list of the columns that I think would be relevant for us:
[1] "X1..YEAR" "X12..LATITUDE"
[3] "X13..LONGITUDE" "X14..HORIZONTAL.DATUM"
[5] "X19..INDUSTRY.SECTOR.CODE" "X20..INDUSTRY.SECTOR"
[7] "X34..CHEMICAL" "X36..TRI.CHEMICAL.COMPOUND.ID"
[9] "X47..UNIT.OF.MEASURE" "X48..5.1...FUGITIVE.AIR"
[11] "X49..5.2...STACK.AIR"
We can discuss more on these (@Spatiotemporal-Exposures-and-Toxicology may want to weigh in too), but the indeces for these are c(1, 12, 13, 14, 19, 20, 34, 36, 47, 48, 49) for the record.
line 821: here is some code for column name readjustment that we could add in:
colnames(csvs_tri) <- sub(".*?\\.\\.", "", colnames(csvs_tri)) # Get rid of first string of codes colnames(csvs_tri) <- sub("^[^A-Za-z]*", "", colnames(csvs_tri)) # Get rid of second string of codes
applying these two lines will result in the column names above looking like this instead, which I think is sufficient clean-up but feel free to weigh in:
[1] "YEAR" "LATITUDE" "LONGITUDE"
[4] "HORIZONTAL.DATUM" "INDUSTRY.SECTOR.CODE" "INDUSTRY.SECTOR"
[7] "CHEMICAL" "TRI.CHEMICAL.COMPOUND.ID" "UNIT.OF.MEASURE"
[10] "FUGITIVE.AIR" "STACK.AIR"
line 830: I'm not sure if this is relevant, but the data's original datum is NAD83, which would be EPSG:4269 (right now it is set for EPSG:4326 which is WGS84)
I have not reviewed the last part for the buffers since I don't have an appropriate "sites" file, @sigmafelix can you point me to what data you used when you were developing the function?
Thanks!
@MAKassien Thank you for the feedback!
Starting from the last question about the way to get sites
object:
https://github.com/Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel/blob/6acbb8f65fd8bec1a739e1bc5a88ed091fd8bf78/input/Rinput/processing_functions/filter_unique_sites.R
You will need to load R/manipulate_spacetime_data.R
to load the function without errors. The function converts non-WGS84 coordinates into WGS84 coordinates.
I will edit the function in my working branch isong-calc-covars
and will ask you for comments.
@MAKassien
I guess the TRI variables are supposed to be summarized by compounds, am I correct? If we do in that way, the intermediate product looks like
# A tibble: 107,840 × 1,159
YEAR LONGITUDE LATITUDE FUGITIVE_AIR_0000050000 FUGITIVE_AIR_0007664417
<int> <dbl> <dbl> <dbl> <dbl>
1 2018 -124. 42.1 NA NA
2 2018 -124. 42.1 NA NA
3 2018 -124. 43.3 NA NA
4 2018 -124. 43.5 NA NA
5 2018 -124. 40.7 NA NA
6 2018 -124. 43.5 NA NA
7 2018 -124. 43.2 NA NA
8 2018 -124. 40.7 NA 0
9 2018 -124. 41.8 NA NA
10 2018 -124. 40.6 NA NA
# ℹ 107,830 more rows
# ℹ 1,154 more variables: FUGITIVE_AIR_N590 <dbl>,
# FUGITIVE_AIR_0000091203 <dbl>, FUGITIVE_AIR_0007439921 <dbl>,
# FUGITIVE_AIR_0007782505 <dbl>, FUGITIVE_AIR_0000071432 <dbl>,
# FUGITIVE_AIR_0000095636 <dbl>, FUGITIVE_AIR_0000100414 <dbl>,
# FUGITIVE_AIR_0000108883 <dbl>, FUGITIVE_AIR_0000110543 <dbl>,
# FUGITIVE_AIR_0000110827 <dbl>, FUGITIVE_AIR_0001330207 <dbl>, …
There are 1,156 variables, even though most of variables will have many NA
s. Each compound may have different properties and influences to PM2.5, so I will keep variables separated for now.
Up-to-date calc_tri
is pushed to init_spinoff
branch along with its tests. The current version considers stack and fugitive air by compounds and the units (pounds or grams) were unified to kilograms.
Great, thanks Insang! I will review the new version and give you my thoughts. And yes, I think by chemical is the right way to split the data.
Do we have a list of TRI covariate names? We are using the covariate names in NRT-AP-Model_Covariates spreadsheet, where TRI covariates were not specified. Since the chemical codes are strings in varying lengths, we will use "long names" for these covariates. My initial thought is to use "EMI_STACK/FUGTV_[chemical_code]_[buffer_size]"
.
We don't have a list yet, but the naming system you are proposing sounds reasonable to me. Since we're talking order of >1000 different chemicals, maybe we can make a separate reference spreadsheet with chemical code, name, and chemical category, in case we decide to select only certain chemicals or to group them by category later
@MAKassien Yes, I think it is time to update the covariate list. I will add a worksheet to the covariate list with TRI chemical codes and full names as they are recorded. We could talk more about classification later.
I've finished reviewing all the TRI functions and everything checks out. Thanks for the hard work @sigmafelix! We should be good to close this issue and proceed to covariate calculation 😃
I computed TRI variables (3,468 variables from 1, 10, and 50 kilometers radius) and saved it in our team project output directory. Treating NA
needs to be discussed in the next meeting. Thank you for your feedback @MAKassien !
@sigmafelix @MAKassien We can discuss NA/zeros, but briefly:
@Spatiotemporal-Exposures-and-Toxicology @MAKassien
Most (I believe all of) NA
s come from the different chemicals by emission sources. process_tri
transforms the input long table into a wide table, which results in many NA
s as the long table records the chemicals emitted at each location. Basically we can think of NA
s are true zeros as they were not recorded in the original EPA data.
@MAKassien @Spatiotemporal-Exposures-and-Toxicology
Download function and its test are completed. Do we have a list of TRI variables and buffer radius for spatial join? I didn't find that on our covariate list in Teams. Things to think about include: