Download and process TRI data

sigmafelix commented 7 months ago

@MAKassien @Spatiotemporal-Exposures-and-Toxicology

Download function and its test are completed. Do we have a list of TRI variables and buffer radius for spatial join? I didn't find that on our covariate list in Teams. Things to think about include:

Buffer radius?
By chemicals (593 in total) or the total amount?
Do we unify the SI units of chemicals (*they are in different units, e.g., pounds, grams, etc.) if we calculate covariates chemical-wise.

MAKassien commented 7 months ago

@sigmafelix thank you for working on the download function! Can you remind me where it is located in the repository so I can take a look? I wasn't able to find it on my own just now.

We don't have a list of desired chemicals yet, the simplest option conceptually wold be to use all chemicals with air releases (either "plume air" or "fugitive air"), and given the relative sparsity of the data that shouldn't be too computationally intensive.
Calculating separately by chemical makes more sense to me since different chemicals are released at different sites, so a "total" may not be a comparable metric between aq stations if they're placed next to totally different chemical releases from each other.
All the air releases may be in the same units, but if not I think it's a good idea to homogenize to whichever unit is the most frequent.
We have a specific formula to calculate these buffers that's based on exponential decay dependent on distance, for which we may have some old code from Lara to recycle (is that right @Spatiotemporal-Exposures-and-Toxicology?)

sigmafelix commented 7 months ago

@MAKassien Download function is in my branch for now, but I will make a pull request before Friday. (download function)

sigmafelix commented 7 months ago

@MAKassien The main branch includes a function named calc_tri (link: https://github.com/Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel/blob/6acbb8f65fd8bec1a739e1bc5a88ed091fd8bf78/R/calculate_covariates.R#L763-L847), which is a draft for calculating TRI variables from the source datasets. It is not fully implemented and still untested, so please review the function and revise/rewrite in part or in entirety if necessary. Thank you!

MAKassien commented 7 months ago

@sigmafelix thank you for working on this, I'll review and make necessary changes this week!

kyle-messier commented 7 months ago

TRI covariate update:

[ ] @larapclark share the general SEDC code with Insang
[ ] @sigmafelix update calc_tri
[ ] @MAKassien review/check for chemicals, physics, etc.

sigmafelix commented 7 months ago

For reference, my SEDC function: (https://github.com/Spatiotemporal-Exposures-and-Toxicology/Scalable_GIS/blob/498dbd47e6e120876678f385f0ff815bc84fb3bd/R/processing.R#L379-L479)

MAKassien commented 7 months ago

@sigmafelix Some comments on the calc_tri function:

line 817: I think we should change some of the columns selected. This is my list of the columns that I think would be relevant for us:

 [1] "X1..YEAR"                      "X12..LATITUDE"                
 [3] "X13..LONGITUDE"                "X14..HORIZONTAL.DATUM"        
 [5] "X19..INDUSTRY.SECTOR.CODE"     "X20..INDUSTRY.SECTOR"         
 [7] "X34..CHEMICAL"                 "X36..TRI.CHEMICAL.COMPOUND.ID"
 [9] "X47..UNIT.OF.MEASURE"          "X48..5.1...FUGITIVE.AIR"      
[11] "X49..5.2...STACK.AIR"

We can discuss more on these (@Spatiotemporal-Exposures-and-Toxicology may want to weigh in too), but the indeces for these are c(1, 12, 13, 14, 19, 20, 34, 36, 47, 48, 49) for the record.

line 821: here is some code for column name readjustment that we could add in:

colnames(csvs_tri) <- sub(".*?\\.\\.", "", colnames(csvs_tri)) # Get rid of first string of codes colnames(csvs_tri) <- sub("^[^A-Za-z]*", "", colnames(csvs_tri)) # Get rid of second string of codes

applying these two lines will result in the column names above looking like this instead, which I think is sufficient clean-up but feel free to weigh in:

[1] "YEAR"                     "LATITUDE"                 "LONGITUDE"               
 [4] "HORIZONTAL.DATUM"         "INDUSTRY.SECTOR.CODE"     "INDUSTRY.SECTOR"         
 [7] "CHEMICAL"                 "TRI.CHEMICAL.COMPOUND.ID" "UNIT.OF.MEASURE"         
[10] "FUGITIVE.AIR"             "STACK.AIR"

line 830: I'm not sure if this is relevant, but the data's original datum is NAD83, which would be EPSG:4269 (right now it is set for EPSG:4326 which is WGS84)

I have not reviewed the last part for the buffers since I don't have an appropriate "sites" file, @sigmafelix can you point me to what data you used when you were developing the function?

Thanks!

sigmafelix commented 7 months ago

@MAKassien Thank you for the feedback!

Starting from the last question about the way to get sites object: https://github.com/Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel/blob/6acbb8f65fd8bec1a739e1bc5a88ed091fd8bf78/input/Rinput/processing_functions/filter_unique_sites.R

You will need to load R/manipulate_spacetime_data.R to load the function without errors. The function converts non-WGS84 coordinates into WGS84 coordinates.

I will edit the function in my working branch isong-calc-covars and will ask you for comments.