Closed aaraney closed 2 years ago
@jarq6c
If the data source is inherently spatial, would it make sense for get
to return a geopandas.GeoDataFrame
? The canonical standard already includes a geometry
column for this purpose.
If the data source is inherently spatial, would it make sense for
get
to return ageopandas.GeoDataFrame
? The canonical standard already includes ageometry
column for this purpose.
Thanks for asking! I was planning on bringing this up as I got further along with this PR. My thinking is, as a user I want the ability to have the geospatial data and not to have the geospatial data. The geospatial data column will contribute more to memory usage and I would like the ability to limit memory consumption if my problem does not require geospatial data.
However, as I read back over what I just wrote and consider the meaning of "canonical," I think I am persuading myself towards your view point. We have a canonical format and I think it makes sense to include all fields from the canonical format when a dataset has those fields. Do you share that view? And, what are your thoughts on having a method that returns data that includes the geospatial data and one that excludes the geospatial data?
If the data source is inherently spatial, would it make sense for
get
to return ageopandas.GeoDataFrame
? The canonical standard already includes ageometry
column for this purpose.Thanks for asking! I was planning on bringing this up as I got further along with this PR. My thinking is, as a user I want the ability to have the geospatial data and not to have the geospatial data. The geospatial data column will contribute more to memory usage and I would like the ability to limit memory consumption if my problem does not require geospatial data.
However, as I read back over what I just wrote and consider the meaning of "canonical," I think I am persuading myself towards your view point. We have a canonical format and I think it makes sense to include all fields from the canonical format when a dataset has those fields. Do you share that view? And, what are your thoughts on having a method that returns data that includes the geospatial data and one that excludes the geospatial data?
Emphasis on inherently spatial. If the original data source is a spatial format like a Shapefile, GeoJSON, or GeoCSV then I think we'd want to include a geometry
column and return the data as a GeoDataFrame
. There are other data sources like NWM NetCDF or NWIS JSON/WaterML that include spatial data but aren't necessarily inherently spatial data formats (i.e. lacking in explicit vector geometry). These formats might return latitude
and longitude
columns in a pandas.DataFrame
instead.
If the data source is inherently spatial, would it make sense for
get
to return ageopandas.GeoDataFrame
? The canonical standard already includes ageometry
column for this purpose.Thanks for asking! I was planning on bringing this up as I got further along with this PR. My thinking is, as a user I want the ability to have the geospatial data and not to have the geospatial data. The geospatial data column will contribute more to memory usage and I would like the ability to limit memory consumption if my problem does not require geospatial data. However, as I read back over what I just wrote and consider the meaning of "canonical," I think I am persuading myself towards your view point. We have a canonical format and I think it makes sense to include all fields from the canonical format when a dataset has those fields. Do you share that view? And, what are your thoughts on having a method that returns data that includes the geospatial data and one that excludes the geospatial data?
Emphasis on inherently spatial. If the original data source is a spatial format like a Shapefile, GeoJSON, or GeoCSV then I think we'd want to include a
geometry
column and return the data as aGeoDataFrame
. There are other data sources like NWM NetCDF or NWIS JSON/WaterML that include spatial data but aren't necessarily inherently spatial data formats (i.e. lacking in explicit vector geometry). These formats might returnlatitude
andlongitude
columns in apandas.DataFrame
instead.
For this work, I agree with you and ill make the change to only return geopandas.GeoDataFrame
's from the get
method. However, the design decision to let a data source's format dictate our output format is not something i'm overly keen about. I don't think that's exactly what you meant, but I just wanted to say it more plainly. My opinion is that we try to ensure a common interface for a data source, even if that data source's source format and composition changes.
I think we have some leeway since objects like geopandas.GeoDataFrame
and dask.dataframe.DataFrame
inherit from pandas.DataFrame
and can be coerced into the canonical format. If we wanted to adhere to a strictly pandas
based standard, you could include a switch parameter as you suggested.
For example, the nwm_client_new
get
method includes an analogous boolean switch called compute
that returns a pandas.DataFrame
if True
(default behavior) or a dask.dataframe.DataFrame
if `False.
CDC_Social_Vulnerability_Index_2018 (FeatureServer) https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/arcgis/rest/services/CDC_Social_Vulnerability_Index_2018/FeatureServer
Looking back at this work, I think it would be helpful for me moving forward if we could decide what SVI fields our "flagship" api will support. I would like to support a "raw" svi api for users who want to use our tools to just get data svi in a dataframe format. This comment should provide background and enough information to get us started. Ill include my thoughts / opinions in a separate comment below.
To, hopefully, start the conversation, the SVI has been released 5 times (2000, 2010, 2014, 2016, and 2018). The SVI calculates a relative percentile ranking in four themes categories for a give geographic extent:
The values used to calculate the percentile ranking for each of the four themes are summed, for each record, to calculate an overall percentile ranking.
Rankings are calculated relative to a given state or the entire United States. For all editions of the SVI, rankings are computed at the census tract spatial scale. Meaning, if you were to retrieve the 2018 SVI at the census tract scale, at the state coverage for the state of Alabama, you would receive 1180 records (number of census tracts in AL in 2010 census) where each ranked percentile is calculated relative to census tracts in Alabama. From 2014 onward, the SVI is also offered at the county scale in both the state and U.S. coverage products. The state coverage products allow inter-state comparison and the U.S. coverage allow national comparison at the census tract or county scales (2014 onward).
Luckily, facets used to calculate each theme are included in all datasets. Facets being one of the fields contributing to the calculation of a themes value (e.g. Per capita income). In the 2018 edition, there are 124 column headers. This number has fluctuated (mainly only increased) with new released of the SVI:
number of cols for each release:
The main question I would like to address through discussion is, what fields should be included in the canonical output for the flagship svi client get
method? The question boils down to what location metadata fields and SVI fields do we think should be included?
Below are links for SVI documentation broken down by year:
"2000": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI2000Documentation-H.pdf" "2010": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI-2010-Documentation-H.pdf" "2014": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI2014Documentation_01192022.pdf" "2016": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI2016Documentation_01192022.pdf" "2018": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI2018Documentation_01192022_1.pdf"
My thoughts are this flagship get
api should include a limited subset of fields from the SVI. If a user wants all the raw fields, they can use a get_all_fields
method (I am very open to a more intuitive method name). For starters IMO, the included fields should be
state_fips: State FIPS code state_name: State name county_name: County name fips: Census tract or county fips code svi_edition: year corresponding to svi release (this assumes 2 SVI's will not be release in a given year in the future) geometry: County or census tract simple features geometry rank_theme_1: Socioeconomic rank_theme_2: Household Composition / Disability rank_theme_3: Minority Status / Language rank_theme_4: Housing Type / Transportation rank_svi: aggregated overall percentile ranking value_theme_1: Socioeconomic value_theme_2: Household Composition / Disability value_theme_3: Minority Status / Language value_theme_4: Housing Type / Transportation value_svi: aggregated overall value; sum of values from themes 1, 2, 3, 4.
As always, this is a conversation and I hope we can find the best for hydrotools users together! I would really value some opinions on this topic!
Pinging @jarq6c.
That synopsis is extremely useful! I may share that widely. Thanks a lot.
I have no objections to this initial subset for get
. I think the only thing that irks me is the wide-format. Each dataframe row is a unique county or census tract, so we end up with really wide dataframes. In the future, we might consider a long_format
parameter to generate dataframes more similar to the canonical format.
Something like this:
state_fips state_name county_name fips svi_edition geometry value_theme value
0 0000 ZZ foo 1111 2018 POLYGON Socioeconomic 99.9
1 4444 XX bar 2222 2018 POLYGON All 26.3
As for additional metadata, you might consider:
get
returning a wide dataframe based on some parameterget
returning multiple dataframes based on some parameterget
(i.e. get_metadata
)Thanks for putting this together!
That synopsis is extremely useful! I may share that widely. Thanks a lot.
Great! Please do!
Right, I see that we share the same irk. There are just sooo many columns. I like your proposed solution. I think long term get
should return a long-formatted dataframe. Alternatively, the shape of this data makes it a good candidate for using pandas multi-indexes, however I am hesitant. Ive not seen many libraries that return multi-index dataframes "in the wild" as they are a little niche. If you disagree, please interject.
Moving forward, I think we are on the same page for next steps for this PR. I will work get
returning a wide-formated dataframe that follows the column schema I listed above. I am still learning details about the SVI so I think it makes sense to stay in a wide format during this "discovery" phase. Then as the PR moves forward and hopefully after a first review, we can revisit and formalize the long-format we want to implement and release 0.0.1
with the long format and any other relevant get_x_metadata
methods. Thoughts?
I use MultiIndexes frequently for intermediate processing, but often have to break out of MultiIndexes due to compatibility issues, mostly when attempting to write these dataframes to disk. The only methods I know of that return MultiIndex dataframes by default are groupby
methods.
I agree, I think sticking to a more data-source-native wide format is the best way to go in the short term.
Would it make sense to extend the get
api using some kwargs
? For example
#get the basic data (default fields, "wide format")
basic_data = client.get(arg1, arg2)
#get the basic data (default fields, "long format")
long_data = client.get(arg1, arg2, long_form=True)
all_fields = client.available_fields(year)
#get all data
all_data = client.get(arg1, arg2, fields=all_fields)
#get a single field
single_data = client.get(arg1, arg2, fields=all_fields[0])
#get data with multi index
data = client.get(arg1, arg2, multi_index=True)
The backend can be organized around some common functions/transformations, and the get
api always returns data, but takes into account some of the ways a user may want to use it.
Would it make sense to extend the
get
api using somekwargs
? For example#get the basic data (default fields, "wide format") basic_data = client.get(arg1, arg2) #get the basic data (default fields, "long format") long_data = client.get(arg1, arg2, long_form=True) all_fields = client.available_fields(year) #get all data all_data = client.get(arg1, arg2, fields=all_fields) #get a single field single_data = client.get(arg1, arg2, fields=all_fields[0]) #get data with multi index data = client.get(arg1, arg2, multi_index=True)
The backend can be organized around some common functions/transformations, and the
get
api always returns data, but takes into account some of the ways a user may want to use it.
Yeah, this makes sense to me.
Would it make sense to extend the
get
api using somekwargs
? For example#get the basic data (default fields, "wide format") basic_data = client.get(arg1, arg2) #get the basic data (default fields, "long format") long_data = client.get(arg1, arg2, long_form=True) all_fields = client.available_fields(year) #get all data all_data = client.get(arg1, arg2, fields=all_fields) #get a single field single_data = client.get(arg1, arg2, fields=all_fields[0]) #get data with multi index data = client.get(arg1, arg2, multi_index=True)
The backend can be organized around some common functions/transformations, and the
get
api always returns data, but takes into account some of the ways a user may want to use it.
I am not keen to use kwargs to change the output format in the get
method. It would differ from other hydrotools' get
methods convention. I would rather just use a long format by default and provide an example for going from the long format back to a wide format in documentation.
Alright, so I am getting close to something that is reviewable. Over the weekend I made a fair amount of progress and decided to switch the data source I was getting SVI values from to a CDC hosted ESRI feature service (this is the same feature service @jarq6c suggested above). The source I am now using powers this map for context. This data source was a lot easier and faster to use. In the future it could support more complex SQL queries to retrieve data if we want to support that as well (i.e. get me SVI > 0.9 in the state of Kansas). There are unknowns with this data source (when is maintenance conducted, will they vanish overnight?) so Ive baked in a straight forward path to roll back to other data sources if necessary. This feature has not been introduced yet, but the mechanisms and most of the work has been completed to add this functionality.
However, the only downside is, this data source does not have SVI rankings calculated at the state geographic context, only the national context (that's not completely true, but good enough for conversation). So for now, the SVI client will only support SVI rankings calculated at the national geographic context. All years are supported and this data source has calculated SVI at the county and census tract geographic scales for all years (this is great, county scale SVI data for 2000 and 2010 is not even available from the cdc download tool).
I decided to go with a long dataframe format as requested by @jarq6c. Thanks for the suggestion! Im happy with how it turned out. Below are the fields I went with:
"state_name"
"state_abbreviation"
"county_name"
"state_fips"
"county_fips"
"fips"
"theme"
"rank"
"value"
"svi_edition"
"geometry"
Likewise here are theme names I decided on:
socioeconomic # theme 1
household_comp_and_disability # theme 2
minority_status_and_lang # theme 3
housing_type_and_trans # theme 4
svi # all
These are certainly up for debate and please provide your opinions. My thinking was to remove spaces if someone wanted to easily go back to a wide format and not deal with spaces and shorten the theme names slightly. I am not settled on this, so please let me know what you think.
Ive added a list of TODOs below for myself. Once I add exceptions to the code, this should be reviewable. Obviously unit tests and integration tests will be added before we merge.
TODOs:
@jarq6c and @hellkite500, the code itself is now in a working order and is ready for review. there are still a few non-code related things that I would like to include in the PR, but my thinking is the reviewing effort and me adding other things effort are not really related and can be done async.
A note on the state of the code, I think there is some module renaming and refactoring that should be done in the future. I think it makes sense to rename / refactor modules when the SVI provider rollback functionality is added.
I added integration tests for the SVIClient.get
method. They appear to be failing, so I will look into that tomorrow.
edit: It looks like the subpackage is not being installed. Ill sort this out tomorrow.
The svi_client
was not being installed in the existing gh actions. I pushed up a change that should fix that issue.
@aaraney older versions can use from typing_extensions import Literal
.
See here for backports: https://pypi.org/project/typing-extensions/
Also note: Python 3.7 is EOL 2023-06. Should we write a test/time bomb for that?
Wow there are a lot of tests.
So far, I'm getting four failed tests with output similar to the following:
src/hydrotools/svi_client/clients.py:96: in get
df = gpd.GeoDataFrame.from_features(request.json()) # type: ignore
env/lib/python3.8/site-packages/forge/_revision.py:328: in inner
return callable(*mapped.args, **mapped.kwargs)
env/lib/python3.8/site-packages/hydrotools/_restclient/async_helpers.py:55: in wrap
return self._add_to_loop(coro(*args, **kwargs))
env/lib/python3.8/site-packages/hydrotools/_restclient/async_helpers.py:23: in _add_to_loop
return self._loop.run_until_complete(coro)
/usr/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
return future.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <ClientResponse(https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/ArcGIS/rest/services/Overall_2016_Tracts/FeatureServer/0... 'X-Amz-Cf-Pop': 'IAD66-C2', 'X-Amz-Cf-Id': 'ii7DXKcsjxbgUbtPvvW9ZR6sI8JK5VvNGM1a09vsOPLl7uF5uGsMWQ==', 'Age': '180')>
async def json(
self,
*,
encoding: Optional[str] = None,
loads: JSONDecoder = DEFAULT_JSON_DECODER,
content_type: Optional[str] = "application/json",
) -> Any:
"""Read and decodes JSON response."""
if self._body is None:
await self.read()
if content_type:
ctype = self.headers.get(hdrs.CONTENT_TYPE, "").lower()
if not _is_expected_content_type(ctype, content_type):
> raise ContentTypeError(
self.request_info,
self.history,
message=(
"Attempt to decode JSON with " "unexpected mimetype: %s" % ctype
),
headers=self.headers,
)
E aiohttp.client_exceptions.ContentTypeError: 0, message='Attempt to decode JSON with unexpected mimetype: text/plain', url=URL('https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/ArcGIS/rest/services/Overall_2016_Tracts/FeatureServer/0/query?returnExceededLimitFeatures=true&where=1%3D1&f=pgeojson&outFields=STATE,ST_ABBR,COUNTY,ST,STCNTY,FIPS,RPL_THEME1,RPL_THEME2,RPL_THEME3,RPL_THEME4,RPL_THEMES,SPL_THEME1,SPL_THEME2,SPL_THEME3,SPL_THEME4,SPL_THEMES&returnGeometry=true')
env/lib/python3.8/site-packages/aiohttp/client_reqrep.py:1097: ContentTypeError
I thought I might need to install pytest-aiohttp
, but I get this error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
hydrotools-restclient 3.0.5 requires aiohttp<=3.7.4.post0, but you have aiohttp 3.8.1 which is incompatible.
@aaraney older versions can use
from typing_extensions import Literal
.See here for backports: https://pypi.org/project/typing-extensions/
Thanks for pointing me in the direction of this package. I am now pulling it in as a dep for the svi_client
and using it instead of typing.Literal
. The min python version is now 3.7 again. I also added an inline code note to switching to typing.Literal
when the min python version supported is 3.8.
Also note: Python 3.7 is EOL 2023-06. Should we write a test/time bomb for that?
Its a shame there is not a 😂 emoji as a github reaction. Yeah, I think that is a good idea. I think it makes sense to add that sort of test or alert at the top namespace level. So, I would propose adding a new dir at the repo root, tests
, that would hold package wide scheduled notifications if you will. Thoughts? I can throw in a PR that adds that if we are in agreement.
Wow there are a lot of tests.
Yeah, sorry about that. Ive been using pytest-xdist
locally to speed up testing. It allows you to run pytest
s in parallel (e.g. pytest -n auto
). I think it makes sense to change this in our github actions that run tests to increase testing speeds.
So far, I'm getting four failed tests with output similar to the following:
src/hydrotools/svi_client/clients.py:96: in get df = gpd.GeoDataFrame.from_features(request.json()) # type: ignore env/lib/python3.8/site-packages/forge/_revision.py:328: in inner return callable(*mapped.args, **mapped.kwargs) env/lib/python3.8/site-packages/hydrotools/_restclient/async_helpers.py:55: in wrap return self._add_to_loop(coro(*args, **kwargs)) env/lib/python3.8/site-packages/hydrotools/_restclient/async_helpers.py:23: in _add_to_loop return self._loop.run_until_complete(coro) /usr/lib/python3.8/asyncio/base_events.py:616: in run_until_complete return future.result() _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <ClientResponse(https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/ArcGIS/rest/services/Overall_2016_Tracts/FeatureServer/0... 'X-Amz-Cf-Pop': 'IAD66-C2', 'X-Amz-Cf-Id': 'ii7DXKcsjxbgUbtPvvW9ZR6sI8JK5VvNGM1a09vsOPLl7uF5uGsMWQ==', 'Age': '180')> async def json( self, *, encoding: Optional[str] = None, loads: JSONDecoder = DEFAULT_JSON_DECODER, content_type: Optional[str] = "application/json", ) -> Any: """Read and decodes JSON response.""" if self._body is None: await self.read() if content_type: ctype = self.headers.get(hdrs.CONTENT_TYPE, "").lower() if not _is_expected_content_type(ctype, content_type): > raise ContentTypeError( self.request_info, self.history, message=( "Attempt to decode JSON with " "unexpected mimetype: %s" % ctype ), headers=self.headers, ) E aiohttp.client_exceptions.ContentTypeError: 0, message='Attempt to decode JSON with unexpected mimetype: text/plain', url=URL('https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/ArcGIS/rest/services/Overall_2016_Tracts/FeatureServer/0/query?returnExceededLimitFeatures=true&where=1%3D1&f=pgeojson&outFields=STATE,ST_ABBR,COUNTY,ST,STCNTY,FIPS,RPL_THEME1,RPL_THEME2,RPL_THEME3,RPL_THEME4,RPL_THEMES,SPL_THEME1,SPL_THEME2,SPL_THEME3,SPL_THEME4,SPL_THEMES&returnGeometry=true') env/lib/python3.8/site-packages/aiohttp/client_reqrep.py:1097: ContentTypeError
I thought I might need to install
pytest-aiohttp
, but I get this error:ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. hydrotools-restclient 3.0.5 requires aiohttp<=3.7.4.post0, but you have aiohttp 3.8.1 which is incompatible.
Yeah, I should have mentioned that in a comment. The four integration tests that are failing are requesting census tract scale svi data for the entire country in geojson from an ESRI feature service. I haven't looked into why the requests are failing, but after testing that you can retrieve census tract data for the entire country in the ESRI protobuf format, I think this is likely a data serialization and response timeout issue on the ESRI feature server's side.
On Friday, I verified that can request census tract data for the entire country in ESRI pbf and verified that I can de-serialize the response using this protobuf schema. Moving forward, I think it makes sense to use the esri protobuf format to get around this issue. What this means is I will need to generate python bindings for the aforementioned pbf schema and use the bindings to go from pbf to json to generate the dataframe. I think in general this is a good idea as it will reduce download time and size for users.
While this sounds like a large undertaking, it really shouldnt be that bad and I dont foresee needing to change that much code, so I think the review can proceed while I work to add this feature.
Also note: Python 3.7 is EOL 2023-06. Should we write a test/time bomb for that?
Its a shame there is not a joy emoji as a github reaction. Yeah, I think that is a good idea. I think it makes sense to add that sort of test or alert at the top namespace level. So, I would propose adding a new dir at the repo root,
tests
, that would hold package wide scheduled notifications if you will. Thoughts? I can throw in a PR that adds that if we are in agreement.
Sounds like a good idea to me.
Yeah, I should have mentioned that in a comment. The four integration tests that are failing are requesting census tract scale svi data for the entire country in geojson from an ESRI feature service. I haven't looked into why the requests are failing, but after testing that you can retrieve census tract data for the entire country in the ESRI protobuf format, I think this is likely a data serialization and response timeout issue on the ESRI feature server's side.
On Friday, I verified that can request census tract data for the entire country in ESRI pbf and verified that I can de-serialize the response using this protobuf schema. Moving forward, I think it makes sense to use the esri protobuf format to get around this issue. What this means is I will need to generate python bindings for the aforementioned pbf schema and use the bindings to go from pbf to json to generate the dataframe. I think in general this is a good idea as it will reduce download time and size for users.
While this sounds like a large undertaking, it really shouldnt be that bad and I dont foresee needing to change that much code, so I think the review can proceed while I work to add this feature.
@jarq6c, So having started down the road to use ESRI protobufs, I would need to implement more code than I initially thought. So, instead I decided to use the restclient
's mget
method to make concurrent requests instead of fooling around with protobufs. I think protobufs would make the tool faster overall, but for now I think its a bit of over engineering given this tool is just retrieving static data.
I verified that the tests pass locally, so once I update the README
, PR description, and we are comfortable merging, we should be all set.
All tests passed locally for me, as well. Note the deprecation warnings from geopandas
have been resolved:
https://github.com/geopandas/geopandas/issues/2083
Additionally noting the FutureWarning
when calling to_file
has also been resolved:
You might consider using categories for some values. Categories reduced the memory footprint from 15.0 MB to 1.0 MB in this example.
from hydrotools.svi_client import SVIClient
client = SVIClient()
gdf = client.get("AL", "census_tract", "2018")
print("BEFORE CATEGORIZATION")
print(gdf.info(memory_usage="deep"))
str_cols = gdf.select_dtypes(include=object).columns
gdf[str_cols] = gdf[str_cols].astype("category")
print("AFTER CATEGORIZATION")
print(gdf.info(memory_usage="deep"))
BEFORE CATEGORIZATION
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 29500 entries, 0 to 29499
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state_name 29500 non-null object
1 state_abbreviation 29500 non-null object
2 county_name 29500 non-null object
3 state_fips 29500 non-null object
4 county_fips 29500 non-null object
5 fips 29500 non-null object
6 theme 29500 non-null object
7 rank 29500 non-null float64
8 value 29500 non-null float64
9 svi_edition 29500 non-null object
10 geometry 29500 non-null geometry
dtypes: float64(2), geometry(1), object(8)
memory usage: 15.0 MB
None
AFTER CATEGORIZATION
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 29500 entries, 0 to 29499
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state_name 29500 non-null category
1 state_abbreviation 29500 non-null category
2 county_name 29500 non-null category
3 state_fips 29500 non-null category
4 county_fips 29500 non-null category
5 fips 29500 non-null category
6 theme 29500 non-null category
7 rank 29500 non-null float64
8 value 29500 non-null float64
9 svi_edition 29500 non-null category
10 geometry 29500 non-null geometry
dtypes: category(8), float64(2), geometry(1)
memory usage: 1.0 MB
None
I'm having some trouble understanding the output, even after looking at the documentation. Here's an example:
from hydrotools.svi_client import SVIClient
client = SVIClient()
gdf = client.get(
location="WY",
geographic_scale="county",
year="2016",
geographic_context="national"
)
gdf.to_file("wyoming_svi_county.geojson", driver="GeoJSON")
print(client.svi_documentation_url(2016))
Once I've got wyoming_svi_county.geoson
I load it up into QGIS and view the attribute table
Notice there are five entries for albany county for the svi theme. The rank is the same, but the values are different. This pattern repeats for the other themes. Is this expected output?
I'm having some trouble understanding the output, even after looking at the documentation. Here's an example:
from hydrotools.svi_client import SVIClient client = SVIClient() gdf = client.get( location="WY", geographic_scale="county", year="2016", geographic_context="national" ) gdf.to_file("wyoming_svi_county.geojson", driver="GeoJSON") print(client.svi_documentation_url(2016))
Once I've got
wyoming_svi_county.geoson
I load it up into QGIS and view the attribute tableNotice there are five entries for albany county for the svi theme. The rank is the same, but the values are different. This pattern repeats for the other themes. Is this expected output?
Thanks for finding this issue, @jarq6c! The issue was how I was melting the dataframes. Ive pushed a change that resolves the issue you brought to light. Thanks!
You might consider using categories for some values. Categories reduced the memory footprint from 15.0 MB to 1.0 MB in this example. ...
Awesome! Thanks for suggesting this! Ive pushed a change that now casts string columns to categories!
It looks like pip removed --use-feature=in-tree-build
in 22.1
.
Usage:
/opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] <requirement specifier> [package-index-options] ...
/opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] -r <requirements file> [package-index-options] ...
/opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] [-e] <vcs project url> ...
/opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] [-e] <local project path> ...
/opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] <archive url/path> ...
option --use-feature: invalid choice: 'in-tree-build' (choose from '2020-resolver', 'fast-deps')
I removed this option when installing subpackages in all our github actions.
also, it seems like pytest does not like it when two test files have the same name even if they are in separate directories. see this gh actions log.
alright, so with the additions I made today, this is pretty much ready to be merged. I just need to add documentation in the form of a readme and add description to the PR. However, the tool is now functional! Ill get that done tomorrow! Have a great weekend
@jarq6c, just updated the readme and PR description. Once you and others are satisfied with the code, we should be ready to merge and release!
We'll pin a release before and after merging this in. I'll pin a release after #191 is merged.
Great find regarding to_file
's incompatibility. Do you think it merits mentioning that in documentation somewhere, @jarq6c?
Great find regarding
to_file
's incompatibility. Do you think it merits mentioning that in documentation somewhere, @jarq6c?
Might be a good idea to at least add to the README.md
example.
Live on pypi.
Thanks! I was going to take this chance to look over the other packages before pinning a HydroTools v3.0.0 release with the svi_client
. If you've got any pet issues you want resolved before the next major release feel free to mention them.
Live on pypi.
Thanks! I was going to take this chance to look over the other packages before pinning a HydroTools v3.0.0 release with the
svi_client
. If you've got any pet issues you want resolved before the next major release feel free to mention them.
sorry if I jumped the gun. I should have communicated with you before pushing to pypi.
As for pet issues. I would like to update the svi_client
's readme as you suggested above. The other changes that I would like to include are a little larger in scope, so maybe they can be added in a future minor package release.
This PR adds a client for programmatically accessing the Center for Disease Control's (CDC) Social Vulnerability Index (SVI).
The SVI has been released 5 times (2000, 2010, 2014, 2016, and 2018) and calculates a relative percentile ranking in four themes categories and an overall ranking at a given geographic context and geographic scale. The themes are:
Rankings are calculated relative to a geographic context, state or all states (United States) . Meaning, for example, a ranking calculated for some location at the United States geographic context would be relative to all other locations where rankings was calculated in the United States. Similarly, SVI rankings are calculated at two geographic scales, census tract and county scales. Meaning, the rankings correspond to a county for a census tract. For completeness, for example, if you were to retrieve the 2018 SVI at the census tract scale, at the state context for the state of Alabama, you would receive 1180 records (number of census tracts in AL in 2010 census) where each ranked percentile is calculated relative to census tracts in Alabama. The tool released in this PR only supports querying for ranking calculated at the United States geographic context. Future work will add support for retrieving rankings at the state spatial scale.
Documentation for each year release of the SVI are located below:
Example
Additions
Testing
Todos
Checklist