NIEHS / amadeus

https://niehs.github.io/amadeus/
Other
4 stars 2 forks source link

Download and process TRI data #4

Closed MAKassien closed 6 months ago

sigmafelix commented 7 months ago

@MAKassien @Spatiotemporal-Exposures-and-Toxicology

Download function and its test are completed. Do we have a list of TRI variables and buffer radius for spatial join? I didn't find that on our covariate list in Teams. Things to think about include:

MAKassien commented 7 months ago

@sigmafelix thank you for working on the download function! Can you remind me where it is located in the repository so I can take a look? I wasn't able to find it on my own just now.

sigmafelix commented 7 months ago

@MAKassien Download function is in my branch for now, but I will make a pull request before Friday. (download function)

sigmafelix commented 7 months ago

@MAKassien The main branch includes a function named calc_tri (link: https://github.com/Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel/blob/6acbb8f65fd8bec1a739e1bc5a88ed091fd8bf78/R/calculate_covariates.R#L763-L847), which is a draft for calculating TRI variables from the source datasets. It is not fully implemented and still untested, so please review the function and revise/rewrite in part or in entirety if necessary. Thank you!

MAKassien commented 7 months ago

@sigmafelix thank you for working on this, I'll review and make necessary changes this week!

kyle-messier commented 7 months ago

TRI covariate update:

sigmafelix commented 7 months ago

For reference, my SEDC function: (https://github.com/Spatiotemporal-Exposures-and-Toxicology/Scalable_GIS/blob/498dbd47e6e120876678f385f0ff815bc84fb3bd/R/processing.R#L379-L479)

MAKassien commented 7 months ago

@sigmafelix Some comments on the calc_tri function:

line 817: I think we should change some of the columns selected. This is my list of the columns that I think would be relevant for us:

 [1] "X1..YEAR"                      "X12..LATITUDE"                
 [3] "X13..LONGITUDE"                "X14..HORIZONTAL.DATUM"        
 [5] "X19..INDUSTRY.SECTOR.CODE"     "X20..INDUSTRY.SECTOR"         
 [7] "X34..CHEMICAL"                 "X36..TRI.CHEMICAL.COMPOUND.ID"
 [9] "X47..UNIT.OF.MEASURE"          "X48..5.1...FUGITIVE.AIR"      
[11] "X49..5.2...STACK.AIR"   

We can discuss more on these (@Spatiotemporal-Exposures-and-Toxicology may want to weigh in too), but the indeces for these are c(1, 12, 13, 14, 19, 20, 34, 36, 47, 48, 49) for the record.

line 821: here is some code for column name readjustment that we could add in:

colnames(csvs_tri) <- sub(".*?\\.\\.", "", colnames(csvs_tri)) # Get rid of first string of codes colnames(csvs_tri) <- sub("^[^A-Za-z]*", "", colnames(csvs_tri)) # Get rid of second string of codes

applying these two lines will result in the column names above looking like this instead, which I think is sufficient clean-up but feel free to weigh in:

[1] "YEAR"                     "LATITUDE"                 "LONGITUDE"               
 [4] "HORIZONTAL.DATUM"         "INDUSTRY.SECTOR.CODE"     "INDUSTRY.SECTOR"         
 [7] "CHEMICAL"                 "TRI.CHEMICAL.COMPOUND.ID" "UNIT.OF.MEASURE"         
[10] "FUGITIVE.AIR"             "STACK.AIR" 

line 830: I'm not sure if this is relevant, but the data's original datum is NAD83, which would be EPSG:4269 (right now it is set for EPSG:4326 which is WGS84)

I have not reviewed the last part for the buffers since I don't have an appropriate "sites" file, @sigmafelix can you point me to what data you used when you were developing the function?

Thanks!

sigmafelix commented 7 months ago

@MAKassien Thank you for the feedback!

Starting from the last question about the way to get sites object: https://github.com/Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel/blob/6acbb8f65fd8bec1a739e1bc5a88ed091fd8bf78/input/Rinput/processing_functions/filter_unique_sites.R

You will need to load R/manipulate_spacetime_data.R to load the function without errors. The function converts non-WGS84 coordinates into WGS84 coordinates.

I will edit the function in my working branch isong-calc-covars and will ask you for comments.

sigmafelix commented 6 months ago

@MAKassien

I guess the TRI variables are supposed to be summarized by compounds, am I correct? If we do in that way, the intermediate product looks like

# A tibble: 107,840 × 1,159
    YEAR LONGITUDE LATITUDE FUGITIVE_AIR_0000050000 FUGITIVE_AIR_0007664417
   <int>     <dbl>    <dbl>                   <dbl>                   <dbl>
 1  2018     -124.     42.1                      NA                      NA
 2  2018     -124.     42.1                      NA                      NA
 3  2018     -124.     43.3                      NA                      NA
 4  2018     -124.     43.5                      NA                      NA
 5  2018     -124.     40.7                      NA                      NA
 6  2018     -124.     43.5                      NA                      NA
 7  2018     -124.     43.2                      NA                      NA
 8  2018     -124.     40.7                      NA                       0
 9  2018     -124.     41.8                      NA                      NA
10  2018     -124.     40.6                      NA                      NA
# ℹ 107,830 more rows
# ℹ 1,154 more variables: FUGITIVE_AIR_N590 <dbl>,
#   FUGITIVE_AIR_0000091203 <dbl>, FUGITIVE_AIR_0007439921 <dbl>,
#   FUGITIVE_AIR_0007782505 <dbl>, FUGITIVE_AIR_0000071432 <dbl>,
#   FUGITIVE_AIR_0000095636 <dbl>, FUGITIVE_AIR_0000100414 <dbl>,
#   FUGITIVE_AIR_0000108883 <dbl>, FUGITIVE_AIR_0000110543 <dbl>,
#   FUGITIVE_AIR_0000110827 <dbl>, FUGITIVE_AIR_0001330207 <dbl>, …

There are 1,156 variables, even though most of variables will have many NAs. Each compound may have different properties and influences to PM2.5, so I will keep variables separated for now.

sigmafelix commented 6 months ago

Up-to-date calc_tri is pushed to init_spinoff branch along with its tests. The current version considers stack and fugitive air by compounds and the units (pounds or grams) were unified to kilograms.

MAKassien commented 6 months ago

Great, thanks Insang! I will review the new version and give you my thoughts. And yes, I think by chemical is the right way to split the data.

sigmafelix commented 6 months ago

Do we have a list of TRI covariate names? We are using the covariate names in NRT-AP-Model_Covariates spreadsheet, where TRI covariates were not specified. Since the chemical codes are strings in varying lengths, we will use "long names" for these covariates. My initial thought is to use "EMI_STACK/FUGTV_[chemical_code]_[buffer_size]".

MAKassien commented 6 months ago

We don't have a list yet, but the naming system you are proposing sounds reasonable to me. Since we're talking order of >1000 different chemicals, maybe we can make a separate reference spreadsheet with chemical code, name, and chemical category, in case we decide to select only certain chemicals or to group them by category later

sigmafelix commented 6 months ago

@MAKassien Yes, I think it is time to update the covariate list. I will add a worksheet to the covariate list with TRI chemical codes and full names as they are recorded. We could talk more about classification later.

MAKassien commented 6 months ago

I've finished reviewing all the TRI functions and everything checks out. Thanks for the hard work @sigmafelix! We should be good to close this issue and proceed to covariate calculation 😃

sigmafelix commented 6 months ago

I computed TRI variables (3,468 variables from 1, 10, and 50 kilometers radius) and saved it in our team project output directory. Treating NA needs to be discussed in the next meeting. Thank you for your feedback @MAKassien !

kyle-messier commented 6 months ago

@sigmafelix @MAKassien We can discuss NA/zeros, but briefly:

  1. TRI variables based on SEDC should not have an NA because distance is always defined. If there are NA in SEDC then we need to look at the distance matrix and make sure that nothing weird is happening there.
  2. Other variables should be handles on a case-by-case basis. Typically if NA is less than 5% then a simple linear interpolation is reasonable.
  3. We should see if there are cases for real zeros - those are of course fine to include
sigmafelix commented 6 months ago

@Spatiotemporal-Exposures-and-Toxicology @MAKassien Most (I believe all of) NAs come from the different chemicals by emission sources. process_tri transforms the input long table into a wide table, which results in many NAs as the long table records the chemicals emitted at each location. Basically we can think of NAs are true zeros as they were not recorded in the original EPA data.