NSAPH / moretrees2

0 stars 0 forks source link

Daily PM2.5, temperature, humidity, & ozone data at zipcode resolution #2

Closed emgthomas closed 4 years ago

emgthomas commented 4 years ago

Hi @mbsabath -

Sorry it's been a while since I've worked on this! I got waylaid by some challenges with the methodology. As a result of those challenges, we actually decided to switch directions a bit (actually, a lot with respect to the data!).

Our new approach is to apply the methods I've been working on to look at the short-term effect of PM2.5 on urgent/emergent hospital admissions for cardiovascular disease. This means we need a different dataset. I created a new branch of this repo (called cvd_outcomes_data) and started writing some code to extract the data. The file code/1_get_admissions.R extracts the relevant hospitalizations for the study period and converts the ICD9 codes to CCS codes. Next, I want to merge in environmental data - daily PM2.5, temperature, humidity, and ozone measurements for each zipcode, which I plan to do in code/1_merge_enviro_vars.R

At this point I have a couple of questions I'm hoping you can help with:

  1. I'm not sure where the daily/zipcode-level environmental variables are stored, or how to create a symbolic link to the correct directory. Would you be able to help out?
  2. One thing I'm unsure of is how to ensure I am extracting only urgent/emergent hospital admissions and excluding scheduled/planned admissions. Any suggestions here?

After the above are sorted out, I will finish writing the code to extract the dataset I need. If possible, it would be great to quickly run through this with you sometime so I can make sure what I've done makes sense (and is roughly consistent with what others have done).

Please let me know if the above makes sense and if you have any questions!

Thanks,

Emma

mbsabath commented 4 years ago

HI Emma,

To answer your first question, the commands to create symbolic links for things are as follows:

Daily PM at the zipcode (2000-2016):

ln -s /nfs/nsaph_ci3/ci3_exposure/pm25/whole_us/daily/zipcode/qd_predictions_ensemble/ywei_aggregation daily_pm

That directory has files for each day in .rds and .csv format, and a file named all_days_PM.csv. The all days file is the data for each day combined in to a single file, but it's 7G so even though it's probably the easiest to code with, it'll take a bit of memory to use. I'd also recommend breaking it in to pieces for your purposes. The other option is the files for each day. They have the format of yyyymmdd.csv, and can be looped over in the directory. Happy to talk more about making a plan to work through the data here.

Daily Ozone Data (2000-2012):

ln -s /nfs/nsaph_ci3/ci3_exposure/ozone/whole_us/daily/zcta/qd_predictions/neighbor_weight/ ozone_data

This has QD's daily ozone predictions aggregated to the zipcode level stored as an .rds file. I haven't personally worked with these data on a project, but if there's any issues, I'm happy to help.

Daily Temperature and Humidity data

ln -s /nfs/nsaph_ci3_ci3_confounders/data_for_analysis/earth_engine/temperature/temperature_daily_zipcode_combined.csv daily_temperature.csv

That file has the daily max temperature (tmmx) and max relative humidity(rmax) for each zipcode for 2000-2017. This doc has more info on potential data available.

mbsabath commented 4 years ago

Looking at your second question, there is a variable ADM_TYPE in the admission data that tells what kind of admission there was.

It is coded as follows:

There is also a variable on admission source (ADM_SOURCE) that may be useful. If there are other variables you might want, all of the variables are described in the file /nfs/nsaph_ci3/ci3_health_data/medicare/gen_admission/1999_2016/targeted_conditions/condition_dictionary.md. The list of codes for both admission variables are in that file as well.

mbsabath commented 4 years ago

@emgthomas just checking in here, is there any other info you need/any other help you need with this? When would a good time for a code review be.

emgthomas commented 4 years ago

Thanks for checking in @mbsabath! I worked on this over the weekend and I hit some road blocks. Are you available later today or tomorrow to discuss?

mbsabath commented 4 years ago

Want to hop on a call at 4 East coast today?

emgthomas commented 4 years ago

Yes, that works perfectly- talk to you on Skype at 4PM EST!

mbsabath commented 4 years ago

As a heads up, I have the new ozone data available at

ln -s /nfs/nsaph_ci3/ci3_exposure/ozone/whole_us/daily/requaia_predictions/ywei_ensemble ozone

The format should match the one for PM data. Please let me know if you have any questions.

emgthomas commented 4 years ago

Great, thanks so much Ben! I'll work on this and let you know if I run into any problems.

emgthomas commented 4 years ago

Hi @mbsabath - a quick question about your functions to reverse the zipcodes. I'm getting this warning when I apply zip_int_to_str to a vector of zipcodes:

image

Still seems to work though. Should I be using sapply() or something like that?

emgthomas commented 4 years ago

Hi again,

An update on my previous post: using sapply() gets rid of the warning, but let me know if you suggest a different approach.

I've run into another problem, which is that I'm not able to access the new ozone data. First, the path to the new ozone data you provided might be missing a component? I think I found the ozone data here:

/nfs/nsaph_ci3/ci3_exposure/ozone/whole_us/daily/zipcode/requaia_predictions/ywei_aggregation However, I found that when I create a symbolic link to this directory as follows:

ln -s /nfs/nsaph_ci3/ci3_exposure/ozone/whole_us/daily/zipcode/requaia_predictions/ywei_aggregation ozone I can't access the directory because I get a "Permission denied" message. Could you please help?

Thanks so much!!

-Emma

mbsabath commented 4 years ago

looks like the group of the data got messed up when I uploaded it to the RCE. It should be fixed now if you try again.

emgthomas commented 4 years ago

I'm still getting a "permission denied' message!

mbsabath commented 4 years ago

try one more time! Looks like the RCE is treating the ywei_aggregation directory differently than most other directories

emgthomas commented 4 years ago

Nope, still getting permission denied.

emgthomas commented 4 years ago

@mbsabath just checking in- I'm still getting that permission denied message! Anything we can do?

mbsabath commented 4 years ago

Tried force changing the group and permissions one more time. If it doesn't work now, I'd say open a ticket with the RCE staff.

emgthomas commented 4 years ago

I contacted the RCE staff and they were able to fix the problem :)

mbsabath commented 4 years ago

Awesome! Did they say what the issue was?

emgthomas commented 4 years ago

They said permissions got changed somehow- I don't know more than that unfortunately.

emgthomas commented 4 years ago

@mbsabath - quick question: I am wondering where I can find documentation on the PM2.5, temperature/humidity, and ozone data in the following directories?

/nfs/nsaph_ci3/ci3_exposure/pm25/whole_us/daily/zipcode/qd_predictions_ensemble/ywei_aggregation

/nfs/nsaph_ci3_ci3_confounders/data_for_analysis/earth_engine/temperature/temperature_daily_zipcode_combined.csv

/nfs/nsaph_ci3/ci3_exposure/ozone/whole_us/daily/zipcode/requaia_predictions/ywei_aggregation

Just trying to make sure I am saying the correct things about these datasets in my paper.

Thanks!

mbsabath commented 4 years ago

Summary of the earth engine data I prepared is here.

For the pm and ozone data, I'd reach out to yaguang at weiyg@g.harvard.edu for more detailed information than the readmes that are in the directory /nfs/nsaph_ci3/ci3_exposure/<om25 or ozone>/whole_us/daily/zipcode/requaia_predictions

emgthomas commented 4 years ago

Thanks Ben! What about the PM2.5 data in /nfs/nsaph_ci3/ci3_exposure/pm25/whole_us/daily/zipcode/qd_predictions_ensemble/ywei_aggregation? I did look in NSAPH/data_documentation but I'm not exactly sure where I should be looking within the repo...

mbsabath commented 4 years ago

For exposure, excepting things that we've produced, that's more of a listing of what we have. I'd reach out to Yaguang (I can tag him in the documentation repo) for more specific details on exactly what was done.

My understanding is that QD ran an ensemble model using a neural net, random forest, and gradient boosting model to predict pm at the grid locations, than yaguang did an area weighting to the zip code level. I'm sure he can provide a better explanation though. QD also has a paper on the work coming out soon that would be good to cite, depending on when your paper is published.

mbsabath commented 4 years ago

Yaguang provided a summary here

emgthomas commented 4 years ago

Great, this is super helpful, many thanks to you and Yaguang!

daniellebraun commented 4 years ago

Great, @mbsabath can you also add it to the RCE, thanks!!

mbsabath commented 4 years ago

Put a copy in each directory containing data produced from the new models!

daniellebraun commented 4 years ago

thanks!

mbsabath commented 4 years ago

@emgthomas as a heads up, we found an issue in the sharding process for the hospitalization data for 2006. I've corrected the error in the source data, but you'll need to rerun your processing code for that year to correct your data. Some embedded bad characters in the source .csv likely caused some observations to be excluded from the data for that year.

emgthomas commented 4 years ago

Yes I've been following that issue. Thanks for the heads up.

emgthomas commented 4 years ago

@mbsabath a quick question about zip codes— is there standard code for identifying zip codes belonging to a particular state/region? I'm hoping to test my model by running it on data for the northeast region only.

mbsabath commented 4 years ago

Unfortunately not to my knowledge. The only codes that aren't directly state codes where you can know the state for sure are 5 digit FIPS codes, where the first two digits indicate the state.

emgthomas commented 4 years ago

Ok no worries. I had some trouble finding an easy way to map zip codes to FIPS codes, so I ended up using the mapping available here: https://www.unitedstateszipcodes.org/zip-code-database/

Hopefully this source is sufficiently reliable. Just sharing in case someone else raises this question later!

mbsabath commented 4 years ago

Heads up that state code should also be in the original admissions data (you can run fst.metadata on a given fst file to get a list of variable names in the data).

mbsabath commented 4 years ago

Yup, if you add SSA_STATE_CODE to the list of variables in this line you should be able to easily bring in states. You will need to re-run your pipeline though.

emgthomas commented 4 years ago

Noted, thanks Ben!

emgthomas commented 4 years ago

@mbsabath - looks like the SSA_STATE_CODE variable is a number. Just want to check how I can map these numbers to states - does this look like the correct reference?

mbsabath commented 4 years ago

yup, that looks right to me!

emgthomas commented 4 years ago

Hi @mbsabath - I'm having some issues with requesting memory to run some models on the RCE. Sorry, not sure where to post about this, so just asking in this thread!

I'm having two issues:

  1. When I look at the "RCE Cluster Resources Used" tool, I can't see my name listed so I'm unsure whether I'm requesting too much memory. Is there another way I can check how much memory my jobs are using?
  2. I currently can't seem to get even smallish amounts of memory to start a new job. It looks like there are a few people right now requesting huge amounts of memory and using only a fraction of it. Is there something we can do about this? I probably need < 50Gb and there are several people with hundreds of Gb unused....

-Emma

mbsabath commented 4 years ago

So the used resources utility that's available to all users excludes the NSAPH servers (because not everyone has access to them). I've worked with the RCE staff to create a version of that utility that works just for our servers, so you can see what you're using if you use that utility.

You can run it by running ./ ~/shared_space/nsaph_common/nsaph-info.sh -t used to see what you're using. Removing the -t used will just show what's available.

Looking at it now, seems like there are around 200GB of memory free. SOmething else I noticed is that a lot of your batch jobs seems to have a large amound of unused memory. you could likely reduce the memory you're requesting by a fair amound and still get your work completed.

Screen Shot 2020-03-06 at 11 09 31 AM

Looking at your jobs, seems like each job is only maxing out at ~20GB. You could probably request 30GB of memory to give yourself a comfortable ceiling.

emgthomas commented 4 years ago

Thanks Ben, this is super helpful! Yes I wondered if I was requesting more memory than I needed. I've been trying to reduce it, but some of my older jobs where I probably requested too much memory were still running.

One thing- when I try to run ./ ~/shared_space/nsaph_common/nsaph-info.sh -t used I get this message:

./: is a directory

Is there a command missing at the start of the line?

Also, how were you able to determine how much memory each of my jobs were using, as opposed to all my jobs together? I'm running models on different datasets so they might need quite different amounts of memory.

mbsabath commented 4 years ago

yup, sorry ./ only works if you're not doing tilde expansion. try . ~/shared_space/nsaph_common/nsaph-info.sh -t used instead.

And if you run condor_q ethomas you can see all of your jobs listed. There's a size column that roughly says how much space is being used in memory by your job in MB. I tend to assume it's a bit of an underestimate for safety purposes, but that's a good way to see what you're actually using.

emgthomas commented 4 years ago

Thanks Ben! When I run that command though, I'm not getting the breakdown by user:

image

Also, sorry to pester you with so many questions, but when I run condor_q ethomas I get a job listed that I'm pretty sure I'm no longer running. Job 80196.0 no longer exists as far as I can tell- all my jobs are using RStudio on interactive nodes, and I only have three jobs open at the moment. Any idea how this could happen?

image

mbsabath commented 4 years ago

Try running attach all jobs from the RCE utilities menu. SOmetimes there are jobs that are left running but for some reason or another aren't listed. You can also run condor_rm 80196 to try to get rid of it if you're sure there's nothing you need there.

Regarding your first question, that's weird. When I run the -t used command, it works fine?

mbsabath commented 4 years ago

Looking at the script there's nothing user specific in the used section. Just in case there's a typo, this is what I'm inputting:

. ~/shared_space/nsaph_common/nsaph-info.sh -t used

emgthomas commented 4 years ago

Yes that's exactly the same input that I'm using- it still only gives me the amount of memory available on each machine!

mbsabath commented 4 years ago

yeah, looks like for whatever reason, you're not getting the secondary inputs when you run it. I have no idea why that could be. Can you try seeing what happens if you run rce-info.sh -t used?

emgthomas commented 4 years ago

That seems to work, but I guess it doesn't show the NSAPH servers?

image

mbsabath commented 4 years ago

yeah, try putting in nonsense after the -t in nsaph-info.sh? or maybe try moving your current directory to nsaph_common and trying just ./nsaph-info.sh -t used

mbsabath commented 4 years ago

Are you running it in a normal terminal?