Drexel-UHC / NETS

FIles use to wrangle National Establishment Time Series datasets for neighborhood public health research
1 stars 0 forks source link

BEDDN data transfer to Peter James #30

Closed kam642 closed 6 months ago

kam642 commented 6 months ago

Email from Jana Hirsch to Peter James 4/17/2024:

• Kari spoke with Ana (license holder) who said she is comfortable sharing the data!! • Stevie (Francisco) will be your point contact to getting you the classified, individual data (without sharing NETS). He also has all the code documentation and classification schema. • Steve (Melly) is included here because of his role and interest in the rasters you will ultimately create, but will take more of a hands-off role in this data transfer/sharing for the moment. • Stephen (Dickinson) is included here for project management, if needed.

What we can send:

To decide:

stfran22 commented 6 months ago

@kam642 I thought we'd be giving Peter the data in the format in which we currently have it - in a database format, with individual tables that can be joined and reformatted (pivoted) to produce analytic files with 0/1 category indicators per location-year. Our workflow for MESA would be a little different than Peter's, since we are linking to participant address locations and not producing raster surfaces, so if we intend to give him an analytic file it would involve a decent bit of work that we would only be doing for him.

Do you have time for a quick call on this today or sometime this week? Perhaps with Jana so we can talk about hierarchy as well?

kam642 commented 6 months ago

@stfran22 Ok. We need to just make sure that we don't send any information beyond what I wrote in https://github.com/Drexel-UHC/NETS/issues/30#issue-2250649309 But we can include SIC code if needed (I think I didn't list that above)

I have a 1-on-1 with Jana already scheduled for this Thursday at 1:30. Can you join at that time?

stfran22 commented 6 months ago

@kam642 We can definitely avoid that. That time works great for me!

stfran22 commented 6 months ago
@kam642 @sjmelly Alright so I started creating a script to test how Peter's group would be joining the tables together.... and just ended up processing the data myself. It was surprisingly fast and the files are not massive. By tomorrow morning, Z:\UHC_Data\NETS_UHC\NETS2022\Data\JamesPeter should be populated with folders for every year, containing files for every category (CCC_YYYY.csv, where CCC = category, such as 'AAL', and YYYY = year). These files can be individually linked to BEDDN_addressid_latlong.txt by AddressID, after converting BEDDN_addressid_latlong into a spatial file. I'll write up a short document with this info tomorrow so we can send all of this off soon. My code is here https://github.com/Drexel-UHC/BEDDN/blob/main/requests/james_p/process_for_rasters.py. CCC_YYYY.csv
DunsYearId unique id for each DunsNumber-Year made up at Drexel to de-identify raw data
AddressID unique id for each address in BEDDN can be used to link to spatial dataset
Year
Category this variable will read BaseGroup or HighLevel depending on whether the file contains establishments that are categorized as Base Groups or Combination Categories/Thematic Constructs) (hierarchy versions
BEDDN_addressid_latlong.txt
AddressID unique id for each address in BEDDN can be used to link to spatial dataset
DisplayX longitude (WGS84)
DisplayY latitude (WGS84)
kam642 commented 6 months ago

@stfran22 Just checking - For category, it this just one column that has the 3-letter code for the hierarchy base group that business belongs to? How are the ComboCats (😸) and Thematic Constructs included in here? Sorry - I'm a little confused about how this is set up.

stfran22 commented 6 months ago

@kam642 yes - both the Year and Category (really 'BaseGroup' or 'HighLevel', depending on the type of category) columns will contain the same value within any of these files (e.g., in the CMM_1999.csv file, all records will have BaseGroup = 'CMM', and Year = 1999). It's a bit redundant, but serves as a check that the intended file is being used as Peter's group processes further. Base Groups, Combo Cats, and Thematic Constructs are all included in the files, and high level category establishment groupings were constructed by aggregating hierarchy versions of Base Group establishment groupings.

Example from AAL_2022

image

sjmelly commented 6 months ago

@stfran22 I zipped the files on the server together. Can I go ahead and send them to Peter and Will and we will send the documentation separately?

stfran22 commented 6 months ago

@sjmelly sounds good. I'll send the catalogue with the documentation.

stfran22 commented 6 months ago

Documentation sent. BEDDN request link below. https://github.com/Drexel-UHC/BEDDN/tree/main/requests/james_p

sjmelly commented 6 months ago

@stfran22 I sent BEDDN.zip to Peter and William BEDDN_addressid_latlong.txt is comma delimited not tab delimited. Should the Github readme be changed? It's first column is an integer with no variable name. SAS doesn't have a problem with this and calls it VAR1 but in the future it would be best to have all variables have names. Mysteriously when I add it to ArcGISPro additional columns AddressID_X and AddressID_Y appear with values that don't make sense. Any ideas why this happens?

stfran22 commented 6 months ago

@sjmelly I just uploaded a version that is .csv and without the first index column. I also included a schema.ini that resolves the weird AddressID_X bug. Apparently Arc was picking up the AddressID column as coordinates: https://gis.stackexchange.com/questions/303402/prevent-arcmap-from-adding-field-x-and-field-y-fields-to-csv. Let me know if you want me to just send the schema and tell them to remove column 1 or if you'd rather just re-send.

stfran22 commented 6 months ago

@sjmelly also feel free to delete the old version (BEDDN_addressid_latlong.txt). I think it won't let me because you may still have it open in ArcGIS or another program.

sjmelly commented 6 months ago

I resent zipped files in 10 year batches and Peter was able to open them