Closed Mujingrui closed 3 years ago
Hi @Mujingrui ,
I got this data from here. This is a tedious task because you have only a drop-down list with single states and if you plan to get data for the whole USA then you must open this list many times and it redirects you to the flat files. Based on those files I've prepared shapefile for analysis (read those txt
with pandas, convert to spatial dataframe with GeoPandas and store transformed data as a shapefile).
Here's screenshot of the webpage block with the state selection list for population centroids:
Hi, @szymon-datalions
Thank you for your nice reply!!! I have downloaded U.S. 2010 census block group level data. And I found the northeastern data is a little different from your cancer_population_base, which are shown in attachments. I am thinking perhaps you allocate population at risk data to a grid of cells based on 2010 census block. So I am trying to construct similar dataset, but it did not work. Would you mind telling me your method or any software you used? Thank you for your time and reply!!
Best,
@Mujingrui thanks for the thought-provoking questions!
I've checked whole data preparation process and at some point I've started using files from here. There are shapefiles with population estimates per block. With those I've:
A. Calculated population size per centroid of each polygon:
import os
import geopandas as gpd
base_path = 'directory_with_census_block_files'
dirs = os.listdir(base_path)
geodataframes_files = []
# Select census block files
for di in dirs:
if di.endswith('pophu') :
new_dir = os.path.join(base_path, di)
new_dirs = os.listdir(new_dir)
for f in new_dirs:
if f.endswith('.shp'):
new_path = os.path.join(new_dir, f)
geodataframes_files.append(new_path)
core = gdf_t[['POP10', 'geometry']].copy() # only those two columns are important
# Append areas
for f in geodataframes_files[1:]:
gdf = gpd.read_file(f)
gdf = gdf[['POP10', 'geometry']]
core = core.append(gdf, ignore_index=True)
# Get centroids
core['centroid'] = core.centroid
# Drop polygon geometry
generated_pop_blocks = core.drop('geometry', axis=1)
# Rename columns - now 'centroid' become 'geometry'
generated_pop_blocks.columns = ['POP10', 'geometry']
generated_pop_blocks.geometry = generated_pop_blocks['geometry']
generated_pop_blocks.to_file('population_centroids.shp', encoding='utf-8')
B. I've created hexbin map over area of the North-Eastern US in QGIS. Aggregate population over each hex in GeoPandas:
import geopandas as gpd
import pandas as pd
pts = gpd.read_file('repro_base_points.shp')
hexes = gpd.read_file('hexbin_raw.shp')
ndf = gpd.sjoin(pts, hexes, how='left', op='within')
grouped_pts = ndf[['id', 'POP10']].groupby('id', as_index=False).sum()
df = pd.merge(hexes, grouped_pts, how='outer', left_on='id', right_on='id')
final = df[~df['POP10'].isna()].copy()
final.to_file('hexes_POP10.shp')
C. Convert hexgrid to centroids -> that's the point where population centroids used for analysis are created. (Also in QGIS).
Hi, @szymon-datalions
Thank you for your so nice reply!!!
I have created the hex bins map with python and QGIS according to your suggestion. And I found if transform NA values in population to 0 in final = df[~df['POP10'].isna()].copy()
, there will be a more clear map, especially for Canada. Since there are no people living in Northern Ontario.
Red bins represent there are no people living around. Thank you for your help!!!
Best,
Terrific :)
Thanks,
(PS. I'm closing this issue for now).
Hi @szymon-datalions
Thank you for reading the message. I am little confused that how you get the population centroid data based on 2010 U.S. census block level data. Thank you for your help. Many thx:)