NOAA-OWP / hydrotools

Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.
Other
53 stars 12 forks source link

Add Social Vulnerability Index (SVI) subpackage #169

Closed aaraney closed 2 years ago

aaraney commented 2 years ago

This PR adds a client for programmatically accessing the Center for Disease Control's (CDC) Social Vulnerability Index (SVI).

"Social vulnerability refers to the potential negative effects on communities caused by external stresses on human health. Such stresses include natural or human-caused disasters, or disease outbreaks. Reducing social vulnerability can decrease both human suffering and economic loss." [source]

The SVI has been released 5 times (2000, 2010, 2014, 2016, and 2018) and calculates a relative percentile ranking in four themes categories and an overall ranking at a given geographic context and geographic scale. The themes are:

Rankings are calculated relative to a geographic context, state or all states (United States) . Meaning, for example, a ranking calculated for some location at the United States geographic context would be relative to all other locations where rankings was calculated in the United States. Similarly, SVI rankings are calculated at two geographic scales, census tract and county scales. Meaning, the rankings correspond to a county for a census tract. For completeness, for example, if you were to retrieve the 2018 SVI at the census tract scale, at the state context for the state of Alabama, you would receive 1180 records (number of census tracts in AL in 2010 census) where each ranked percentile is calculated relative to census tracts in Alabama. The tool released in this PR only supports querying for ranking calculated at the United States geographic context. Future work will add support for retrieving rankings at the state spatial scale.

Documentation for each year release of the SVI are located below:

Example

from hydrotools.svi_client import SVIClient

client = SVIClient()
df = client.get(
    location="AL", # state / nation name (i.e. "alabama" or "United States") also accepted. case insensitive
    geographic_scale="census_tract", # "census_tract" or "county"
    year="2018", # 2000, 2010, 2014, 2016, or 2018
    geographic_context="national" # only "national" supported. "state" will be supported in the future
    )
print(df)
                    state_name state_abbreviation  ... svi_edition                                           geometry
        0        alabama                 al  ...        2018  POLYGON ((-87.21230 32.83583, -87.20970 32.835...
        1        alabama                 al  ...        2018  POLYGON ((-86.45640 31.65556, -86.44864 31.655...
        ...          ...                ...  ...         ...                                                ...
        29498    alabama                 al  ...        2018  POLYGON ((-85.99487 31.84424, -85.99381 31.844...
        29499    alabama                 al  ...        2018  POLYGON ((-86.19941 31.80787, -86.19809 31.808...

Additions

Testing

  1. Integration tests are included that test all valid SVI queries currently supported by the tool.
  2. Some unit tests included for utility functions.

Todos

Checklist

aaraney commented 2 years ago

@jarq6c

jarq6c commented 2 years ago

If the data source is inherently spatial, would it make sense for get to return a geopandas.GeoDataFrame? The canonical standard already includes a geometry column for this purpose.

aaraney commented 2 years ago

If the data source is inherently spatial, would it make sense for get to return a geopandas.GeoDataFrame? The canonical standard already includes a geometry column for this purpose.

Thanks for asking! I was planning on bringing this up as I got further along with this PR. My thinking is, as a user I want the ability to have the geospatial data and not to have the geospatial data. The geospatial data column will contribute more to memory usage and I would like the ability to limit memory consumption if my problem does not require geospatial data.

However, as I read back over what I just wrote and consider the meaning of "canonical," I think I am persuading myself towards your view point. We have a canonical format and I think it makes sense to include all fields from the canonical format when a dataset has those fields. Do you share that view? And, what are your thoughts on having a method that returns data that includes the geospatial data and one that excludes the geospatial data?

jarq6c commented 2 years ago

If the data source is inherently spatial, would it make sense for get to return a geopandas.GeoDataFrame? The canonical standard already includes a geometry column for this purpose.

Thanks for asking! I was planning on bringing this up as I got further along with this PR. My thinking is, as a user I want the ability to have the geospatial data and not to have the geospatial data. The geospatial data column will contribute more to memory usage and I would like the ability to limit memory consumption if my problem does not require geospatial data.

However, as I read back over what I just wrote and consider the meaning of "canonical," I think I am persuading myself towards your view point. We have a canonical format and I think it makes sense to include all fields from the canonical format when a dataset has those fields. Do you share that view? And, what are your thoughts on having a method that returns data that includes the geospatial data and one that excludes the geospatial data?

Emphasis on inherently spatial. If the original data source is a spatial format like a Shapefile, GeoJSON, or GeoCSV then I think we'd want to include a geometry column and return the data as a GeoDataFrame. There are other data sources like NWM NetCDF or NWIS JSON/WaterML that include spatial data but aren't necessarily inherently spatial data formats (i.e. lacking in explicit vector geometry). These formats might return latitude and longitude columns in a pandas.DataFrame instead.

aaraney commented 2 years ago

If the data source is inherently spatial, would it make sense for get to return a geopandas.GeoDataFrame? The canonical standard already includes a geometry column for this purpose.

Thanks for asking! I was planning on bringing this up as I got further along with this PR. My thinking is, as a user I want the ability to have the geospatial data and not to have the geospatial data. The geospatial data column will contribute more to memory usage and I would like the ability to limit memory consumption if my problem does not require geospatial data. However, as I read back over what I just wrote and consider the meaning of "canonical," I think I am persuading myself towards your view point. We have a canonical format and I think it makes sense to include all fields from the canonical format when a dataset has those fields. Do you share that view? And, what are your thoughts on having a method that returns data that includes the geospatial data and one that excludes the geospatial data?

Emphasis on inherently spatial. If the original data source is a spatial format like a Shapefile, GeoJSON, or GeoCSV then I think we'd want to include a geometry column and return the data as a GeoDataFrame. There are other data sources like NWM NetCDF or NWIS JSON/WaterML that include spatial data but aren't necessarily inherently spatial data formats (i.e. lacking in explicit vector geometry). These formats might return latitude and longitude columns in a pandas.DataFrame instead.

For this work, I agree with you and ill make the change to only return geopandas.GeoDataFrame's from the get method. However, the design decision to let a data source's format dictate our output format is not something i'm overly keen about. I don't think that's exactly what you meant, but I just wanted to say it more plainly. My opinion is that we try to ensure a common interface for a data source, even if that data source's source format and composition changes.

jarq6c commented 2 years ago

I think we have some leeway since objects like geopandas.GeoDataFrame and dask.dataframe.DataFrame inherit from pandas.DataFrame and can be coerced into the canonical format. If we wanted to adhere to a strictly pandas based standard, you could include a switch parameter as you suggested.

For example, the nwm_client_new get method includes an analogous boolean switch called compute that returns a pandas.DataFrame if True (default behavior) or a dask.dataframe.DataFrame if `False.

jarq6c commented 2 years ago

CDC_Social_Vulnerability_Index_2018 (FeatureServer) https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/arcgis/rest/services/CDC_Social_Vulnerability_Index_2018/FeatureServer

aaraney commented 2 years ago

Looking back at this work, I think it would be helpful for me moving forward if we could decide what SVI fields our "flagship" api will support. I would like to support a "raw" svi api for users who want to use our tools to just get data svi in a dataframe format. This comment should provide background and enough information to get us started. Ill include my thoughts / opinions in a separate comment below.

Background

To, hopefully, start the conversation, the SVI has been released 5 times (2000, 2010, 2014, 2016, and 2018). The SVI calculates a relative percentile ranking in four themes categories for a give geographic extent:

  1. Socioeconomic
  2. Household Composition & Disability
  3. Minority Status & Language
  4. Housing Type & Transportation

The values used to calculate the percentile ranking for each of the four themes are summed, for each record, to calculate an overall percentile ranking.

Rankings are calculated relative to a given state or the entire United States. For all editions of the SVI, rankings are computed at the census tract spatial scale. Meaning, if you were to retrieve the 2018 SVI at the census tract scale, at the state coverage for the state of Alabama, you would receive 1180 records (number of census tracts in AL in 2010 census) where each ranked percentile is calculated relative to census tracts in Alabama. From 2014 onward, the SVI is also offered at the county scale in both the state and U.S. coverage products. The state coverage products allow inter-state comparison and the U.S. coverage allow national comparison at the census tract or county scales (2014 onward).

Luckily, facets used to calculate each theme are included in all datasets. Facets being one of the fields contributing to the calculation of a themes value (e.g. Per capita income). In the 2018 edition, there are 124 column headers. This number has fluctuated (mainly only increased) with new released of the SVI:

number of cols for each release:

Question

The main question I would like to address through discussion is, what fields should be included in the canonical output for the flagship svi client get method? The question boils down to what location metadata fields and SVI fields do we think should be included?

Additional Information

Below are links for SVI documentation broken down by year:

"2000": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI2000Documentation-H.pdf" "2010": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI-2010-Documentation-H.pdf" "2014": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI2014Documentation_01192022.pdf" "2016": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI2016Documentation_01192022.pdf" "2018": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI2018Documentation_01192022_1.pdf"

aaraney commented 2 years ago

My thoughts are this flagship get api should include a limited subset of fields from the SVI. If a user wants all the raw fields, they can use a get_all_fields method (I am very open to a more intuitive method name). For starters IMO, the included fields should be :

state_fips: State FIPS code state_name: State name county_name: County name fips: Census tract or county fips code svi_edition: year corresponding to svi release (this assumes 2 SVI's will not be release in a given year in the future) geometry: County or census tract simple features geometry rank_theme_1: Socioeconomic rank_theme_2: Household Composition / Disability rank_theme_3: Minority Status / Language rank_theme_4: Housing Type / Transportation rank_svi: aggregated overall percentile ranking value_theme_1: Socioeconomic value_theme_2: Household Composition / Disability value_theme_3: Minority Status / Language value_theme_4: Housing Type / Transportation value_svi: aggregated overall value; sum of values from themes 1, 2, 3, 4.

As always, this is a conversation and I hope we can find the best for hydrotools users together! I would really value some opinions on this topic!

Pinging @jarq6c.

jarq6c commented 2 years ago

That synopsis is extremely useful! I may share that widely. Thanks a lot.

I have no objections to this initial subset for get. I think the only thing that irks me is the wide-format. Each dataframe row is a unique county or census tract, so we end up with really wide dataframes. In the future, we might consider a long_format parameter to generate dataframes more similar to the canonical format.

Something like this:

  state_fips state_name county_name  fips svi_edition geometry    value_theme value
0       0000         ZZ         foo  1111        2018  POLYGON  Socioeconomic  99.9
1       4444         XX         bar  2222        2018  POLYGON            All  26.3

As for additional metadata, you might consider:

  1. get returning a wide dataframe based on some parameter
  2. get returning multiple dataframes based on some parameter
  3. punting retrieval of metadata to a separate method with an interface similar to get (i.e. get_metadata)

Thanks for putting this together!

aaraney commented 2 years ago

That synopsis is extremely useful! I may share that widely. Thanks a lot.

Great! Please do!

Right, I see that we share the same irk. There are just sooo many columns. I like your proposed solution. I think long term get should return a long-formatted dataframe. Alternatively, the shape of this data makes it a good candidate for using pandas multi-indexes, however I am hesitant. Ive not seen many libraries that return multi-index dataframes "in the wild" as they are a little niche. If you disagree, please interject.

Moving forward, I think we are on the same page for next steps for this PR. I will work get returning a wide-formated dataframe that follows the column schema I listed above. I am still learning details about the SVI so I think it makes sense to stay in a wide format during this "discovery" phase. Then as the PR moves forward and hopefully after a first review, we can revisit and formalize the long-format we want to implement and release 0.0.1 with the long format and any other relevant get_x_metadata methods. Thoughts?

jarq6c commented 2 years ago

I use MultiIndexes frequently for intermediate processing, but often have to break out of MultiIndexes due to compatibility issues, mostly when attempting to write these dataframes to disk. The only methods I know of that return MultiIndex dataframes by default are groupby methods.

I agree, I think sticking to a more data-source-native wide format is the best way to go in the short term.

hellkite500 commented 2 years ago

Would it make sense to extend the get api using some kwargs? For example

#get the basic data (default fields, "wide format")
basic_data = client.get(arg1, arg2)
#get the basic data (default fields, "long format")
long_data = client.get(arg1, arg2, long_form=True)
all_fields = client.available_fields(year)
#get all data
all_data = client.get(arg1, arg2, fields=all_fields)
#get a single field
single_data = client.get(arg1, arg2, fields=all_fields[0])
#get data with multi index
data = client.get(arg1, arg2, multi_index=True)

The backend can be organized around some common functions/transformations, and the get api always returns data, but takes into account some of the ways a user may want to use it.

jarq6c commented 2 years ago

Would it make sense to extend the get api using some kwargs? For example

#get the basic data (default fields, "wide format")
basic_data = client.get(arg1, arg2)
#get the basic data (default fields, "long format")
long_data = client.get(arg1, arg2, long_form=True)
all_fields = client.available_fields(year)
#get all data
all_data = client.get(arg1, arg2, fields=all_fields)
#get a single field
single_data = client.get(arg1, arg2, fields=all_fields[0])
#get data with multi index
data = client.get(arg1, arg2, multi_index=True)

The backend can be organized around some common functions/transformations, and the get api always returns data, but takes into account some of the ways a user may want to use it.

Yeah, this makes sense to me.

aaraney commented 2 years ago

Would it make sense to extend the get api using some kwargs? For example

#get the basic data (default fields, "wide format")
basic_data = client.get(arg1, arg2)
#get the basic data (default fields, "long format")
long_data = client.get(arg1, arg2, long_form=True)
all_fields = client.available_fields(year)
#get all data
all_data = client.get(arg1, arg2, fields=all_fields)
#get a single field
single_data = client.get(arg1, arg2, fields=all_fields[0])
#get data with multi index
data = client.get(arg1, arg2, multi_index=True)

The backend can be organized around some common functions/transformations, and the get api always returns data, but takes into account some of the ways a user may want to use it.

I am not keen to use kwargs to change the output format in the get method. It would differ from other hydrotools' get methods convention. I would rather just use a long format by default and provide an example for going from the long format back to a wide format in documentation.

aaraney commented 2 years ago

Alright, so I am getting close to something that is reviewable. Over the weekend I made a fair amount of progress and decided to switch the data source I was getting SVI values from to a CDC hosted ESRI feature service (this is the same feature service @jarq6c suggested above). The source I am now using powers this map for context. This data source was a lot easier and faster to use. In the future it could support more complex SQL queries to retrieve data if we want to support that as well (i.e. get me SVI > 0.9 in the state of Kansas). There are unknowns with this data source (when is maintenance conducted, will they vanish overnight?) so Ive baked in a straight forward path to roll back to other data sources if necessary. This feature has not been introduced yet, but the mechanisms and most of the work has been completed to add this functionality.

However, the only downside is, this data source does not have SVI rankings calculated at the state geographic context, only the national context (that's not completely true, but good enough for conversation). So for now, the SVI client will only support SVI rankings calculated at the national geographic context. All years are supported and this data source has calculated SVI at the county and census tract geographic scales for all years (this is great, county scale SVI data for 2000 and 2010 is not even available from the cdc download tool).

I decided to go with a long dataframe format as requested by @jarq6c. Thanks for the suggestion! Im happy with how it turned out. Below are the fields I went with:

"state_name"
"state_abbreviation"
"county_name"
"state_fips"
"county_fips"
"fips"
"theme"
"rank"
"value"
"svi_edition"
"geometry"

Likewise here are theme names I decided on:

socioeconomic # theme 1
household_comp_and_disability # theme 2
minority_status_and_lang # theme 3
housing_type_and_trans # theme 4
svi # all

These are certainly up for debate and please provide your opinions. My thinking was to remove spaces if someone wanted to easily go back to a wide format and not deal with spaces and shorten the theme names slightly. I am not settled on this, so please let me know what you think.

Ive added a list of TODOs below for myself. Once I add exceptions to the code, this should be reviewable. Obviously unit tests and integration tests will be added before we merge.

TODOs:

aaraney commented 2 years ago

@jarq6c and @hellkite500, the code itself is now in a working order and is ready for review. there are still a few non-code related things that I would like to include in the PR, but my thinking is the reviewing effort and me adding other things effort are not really related and can be done async.

A note on the state of the code, I think there is some module renaming and refactoring that should be done in the future. I think it makes sense to rename / refactor modules when the SVI provider rollback functionality is added.

aaraney commented 2 years ago

I added integration tests for the SVIClient.get method. They appear to be failing, so I will look into that tomorrow.

edit: It looks like the subpackage is not being installed. Ill sort this out tomorrow.

aaraney commented 2 years ago

The svi_client was not being installed in the existing gh actions. I pushed up a change that should fix that issue.

jarq6c commented 2 years ago

@aaraney older versions can use from typing_extensions import Literal.

See here for backports: https://pypi.org/project/typing-extensions/

jarq6c commented 2 years ago

Also note: Python 3.7 is EOL 2023-06. Should we write a test/time bomb for that?

jarq6c commented 2 years ago

Wow there are a lot of tests.

jarq6c commented 2 years ago

So far, I'm getting four failed tests with output similar to the following:

src/hydrotools/svi_client/clients.py:96: in get
    df = gpd.GeoDataFrame.from_features(request.json())  # type: ignore
env/lib/python3.8/site-packages/forge/_revision.py:328: in inner
    return callable(*mapped.args, **mapped.kwargs)
env/lib/python3.8/site-packages/hydrotools/_restclient/async_helpers.py:55: in wrap
    return self._add_to_loop(coro(*args, **kwargs))
env/lib/python3.8/site-packages/hydrotools/_restclient/async_helpers.py:23: in _add_to_loop
    return self._loop.run_until_complete(coro)
/usr/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
    return future.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <ClientResponse(https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/ArcGIS/rest/services/Overall_2016_Tracts/FeatureServer/0... 'X-Amz-Cf-Pop': 'IAD66-C2', 'X-Amz-Cf-Id': 'ii7DXKcsjxbgUbtPvvW9ZR6sI8JK5VvNGM1a09vsOPLl7uF5uGsMWQ==', 'Age': '180')>

    async def json(
        self,
        *,
        encoding: Optional[str] = None,
        loads: JSONDecoder = DEFAULT_JSON_DECODER,
        content_type: Optional[str] = "application/json",
    ) -> Any:
        """Read and decodes JSON response."""
        if self._body is None:
            await self.read()

        if content_type:
            ctype = self.headers.get(hdrs.CONTENT_TYPE, "").lower()
            if not _is_expected_content_type(ctype, content_type):
>               raise ContentTypeError(
                    self.request_info,
                    self.history,
                    message=(
                        "Attempt to decode JSON with " "unexpected mimetype: %s" % ctype
                    ),
                    headers=self.headers,
                )
E               aiohttp.client_exceptions.ContentTypeError: 0, message='Attempt to decode JSON with unexpected mimetype: text/plain', url=URL('https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/ArcGIS/rest/services/Overall_2016_Tracts/FeatureServer/0/query?returnExceededLimitFeatures=true&where=1%3D1&f=pgeojson&outFields=STATE,ST_ABBR,COUNTY,ST,STCNTY,FIPS,RPL_THEME1,RPL_THEME2,RPL_THEME3,RPL_THEME4,RPL_THEMES,SPL_THEME1,SPL_THEME2,SPL_THEME3,SPL_THEME4,SPL_THEMES&returnGeometry=true')

env/lib/python3.8/site-packages/aiohttp/client_reqrep.py:1097: ContentTypeError

I thought I might need to install pytest-aiohttp, but I get this error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
hydrotools-restclient 3.0.5 requires aiohttp<=3.7.4.post0, but you have aiohttp 3.8.1 which is incompatible.
aaraney commented 2 years ago

@aaraney older versions can use from typing_extensions import Literal.

See here for backports: https://pypi.org/project/typing-extensions/

Thanks for pointing me in the direction of this package. I am now pulling it in as a dep for the svi_client and using it instead of typing.Literal. The min python version is now 3.7 again. I also added an inline code note to switching to typing.Literal when the min python version supported is 3.8.

aaraney commented 2 years ago

Also note: Python 3.7 is EOL 2023-06. Should we write a test/time bomb for that?

Its a shame there is not a 😂 emoji as a github reaction. Yeah, I think that is a good idea. I think it makes sense to add that sort of test or alert at the top namespace level. So, I would propose adding a new dir at the repo root, tests, that would hold package wide scheduled notifications if you will. Thoughts? I can throw in a PR that adds that if we are in agreement.

aaraney commented 2 years ago

Wow there are a lot of tests.

Yeah, sorry about that. Ive been using pytest-xdist locally to speed up testing. It allows you to run pytests in parallel (e.g. pytest -n auto). I think it makes sense to change this in our github actions that run tests to increase testing speeds.

aaraney commented 2 years ago

So far, I'm getting four failed tests with output similar to the following:

src/hydrotools/svi_client/clients.py:96: in get
    df = gpd.GeoDataFrame.from_features(request.json())  # type: ignore
env/lib/python3.8/site-packages/forge/_revision.py:328: in inner
    return callable(*mapped.args, **mapped.kwargs)
env/lib/python3.8/site-packages/hydrotools/_restclient/async_helpers.py:55: in wrap
    return self._add_to_loop(coro(*args, **kwargs))
env/lib/python3.8/site-packages/hydrotools/_restclient/async_helpers.py:23: in _add_to_loop
    return self._loop.run_until_complete(coro)
/usr/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
    return future.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <ClientResponse(https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/ArcGIS/rest/services/Overall_2016_Tracts/FeatureServer/0... 'X-Amz-Cf-Pop': 'IAD66-C2', 'X-Amz-Cf-Id': 'ii7DXKcsjxbgUbtPvvW9ZR6sI8JK5VvNGM1a09vsOPLl7uF5uGsMWQ==', 'Age': '180')>

    async def json(
        self,
        *,
        encoding: Optional[str] = None,
        loads: JSONDecoder = DEFAULT_JSON_DECODER,
        content_type: Optional[str] = "application/json",
    ) -> Any:
        """Read and decodes JSON response."""
        if self._body is None:
            await self.read()

        if content_type:
            ctype = self.headers.get(hdrs.CONTENT_TYPE, "").lower()
            if not _is_expected_content_type(ctype, content_type):
>               raise ContentTypeError(
                    self.request_info,
                    self.history,
                    message=(
                        "Attempt to decode JSON with " "unexpected mimetype: %s" % ctype
                    ),
                    headers=self.headers,
                )
E               aiohttp.client_exceptions.ContentTypeError: 0, message='Attempt to decode JSON with unexpected mimetype: text/plain', url=URL('https://services3.arcgis.com/ZvidGQkLaDJxRSJ2/ArcGIS/rest/services/Overall_2016_Tracts/FeatureServer/0/query?returnExceededLimitFeatures=true&where=1%3D1&f=pgeojson&outFields=STATE,ST_ABBR,COUNTY,ST,STCNTY,FIPS,RPL_THEME1,RPL_THEME2,RPL_THEME3,RPL_THEME4,RPL_THEMES,SPL_THEME1,SPL_THEME2,SPL_THEME3,SPL_THEME4,SPL_THEMES&returnGeometry=true')

env/lib/python3.8/site-packages/aiohttp/client_reqrep.py:1097: ContentTypeError

I thought I might need to install pytest-aiohttp, but I get this error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
hydrotools-restclient 3.0.5 requires aiohttp<=3.7.4.post0, but you have aiohttp 3.8.1 which is incompatible.

Yeah, I should have mentioned that in a comment. The four integration tests that are failing are requesting census tract scale svi data for the entire country in geojson from an ESRI feature service. I haven't looked into why the requests are failing, but after testing that you can retrieve census tract data for the entire country in the ESRI protobuf format, I think this is likely a data serialization and response timeout issue on the ESRI feature server's side.

On Friday, I verified that can request census tract data for the entire country in ESRI pbf and verified that I can de-serialize the response using this protobuf schema. Moving forward, I think it makes sense to use the esri protobuf format to get around this issue. What this means is I will need to generate python bindings for the aforementioned pbf schema and use the bindings to go from pbf to json to generate the dataframe. I think in general this is a good idea as it will reduce download time and size for users.

While this sounds like a large undertaking, it really shouldnt be that bad and I dont foresee needing to change that much code, so I think the review can proceed while I work to add this feature.

jarq6c commented 2 years ago

Also note: Python 3.7 is EOL 2023-06. Should we write a test/time bomb for that?

Its a shame there is not a joy emoji as a github reaction. Yeah, I think that is a good idea. I think it makes sense to add that sort of test or alert at the top namespace level. So, I would propose adding a new dir at the repo root, tests, that would hold package wide scheduled notifications if you will. Thoughts? I can throw in a PR that adds that if we are in agreement.

Sounds like a good idea to me.

aaraney commented 2 years ago

Yeah, I should have mentioned that in a comment. The four integration tests that are failing are requesting census tract scale svi data for the entire country in geojson from an ESRI feature service. I haven't looked into why the requests are failing, but after testing that you can retrieve census tract data for the entire country in the ESRI protobuf format, I think this is likely a data serialization and response timeout issue on the ESRI feature server's side.

On Friday, I verified that can request census tract data for the entire country in ESRI pbf and verified that I can de-serialize the response using this protobuf schema. Moving forward, I think it makes sense to use the esri protobuf format to get around this issue. What this means is I will need to generate python bindings for the aforementioned pbf schema and use the bindings to go from pbf to json to generate the dataframe. I think in general this is a good idea as it will reduce download time and size for users.

While this sounds like a large undertaking, it really shouldnt be that bad and I dont foresee needing to change that much code, so I think the review can proceed while I work to add this feature.

@jarq6c, So having started down the road to use ESRI protobufs, I would need to implement more code than I initially thought. So, instead I decided to use the restclient's mget method to make concurrent requests instead of fooling around with protobufs. I think protobufs would make the tool faster overall, but for now I think its a bit of over engineering given this tool is just retrieving static data.

I verified that the tests pass locally, so once I update the README, PR description, and we are comfortable merging, we should be all set.

jarq6c commented 2 years ago

All tests passed locally for me, as well. Note the deprecation warnings from geopandas have been resolved:

https://github.com/geopandas/geopandas/issues/2083

Additionally noting the FutureWarning when calling to_file has also been resolved:

https://github.com/geopandas/geopandas/pull/2322

jarq6c commented 2 years ago

You might consider using categories for some values. Categories reduced the memory footprint from 15.0 MB to 1.0 MB in this example.

from hydrotools.svi_client import SVIClient

client = SVIClient()
gdf = client.get("AL", "census_tract", "2018")

print("BEFORE CATEGORIZATION")
print(gdf.info(memory_usage="deep"))

str_cols = gdf.select_dtypes(include=object).columns
gdf[str_cols] = gdf[str_cols].astype("category")

print("AFTER CATEGORIZATION")
print(gdf.info(memory_usage="deep"))
BEFORE CATEGORIZATION
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 29500 entries, 0 to 29499
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   state_name          29500 non-null  object  
 1   state_abbreviation  29500 non-null  object  
 2   county_name         29500 non-null  object  
 3   state_fips          29500 non-null  object  
 4   county_fips         29500 non-null  object  
 5   fips                29500 non-null  object  
 6   theme               29500 non-null  object  
 7   rank                29500 non-null  float64 
 8   value               29500 non-null  float64 
 9   svi_edition         29500 non-null  object  
 10  geometry            29500 non-null  geometry
dtypes: float64(2), geometry(1), object(8)
memory usage: 15.0 MB
None

AFTER CATEGORIZATION
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 29500 entries, 0 to 29499
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   state_name          29500 non-null  category
 1   state_abbreviation  29500 non-null  category
 2   county_name         29500 non-null  category
 3   state_fips          29500 non-null  category
 4   county_fips         29500 non-null  category
 5   fips                29500 non-null  category
 6   theme               29500 non-null  category
 7   rank                29500 non-null  float64 
 8   value               29500 non-null  float64 
 9   svi_edition         29500 non-null  category
 10  geometry            29500 non-null  geometry
dtypes: category(8), float64(2), geometry(1)
memory usage: 1.0 MB
None
jarq6c commented 2 years ago

I'm having some trouble understanding the output, even after looking at the documentation. Here's an example:

from hydrotools.svi_client import SVIClient

client = SVIClient()
gdf = client.get(
    location="WY",
    geographic_scale="county",
    year="2016",
    geographic_context="national"
    )

gdf.to_file("wyoming_svi_county.geojson", driver="GeoJSON")

print(client.svi_documentation_url(2016))

Once I've got wyoming_svi_county.geoson I load it up into QGIS and view the attribute table sample_output

Notice there are five entries for albany county for the svi theme. The rank is the same, but the values are different. This pattern repeats for the other themes. Is this expected output?

aaraney commented 2 years ago

I'm having some trouble understanding the output, even after looking at the documentation. Here's an example:

from hydrotools.svi_client import SVIClient

client = SVIClient()
gdf = client.get(
    location="WY",
    geographic_scale="county",
    year="2016",
    geographic_context="national"
    )

gdf.to_file("wyoming_svi_county.geojson", driver="GeoJSON")

print(client.svi_documentation_url(2016))

Once I've got wyoming_svi_county.geoson I load it up into QGIS and view the attribute table sample_output

Notice there are five entries for albany county for the svi theme. The rank is the same, but the values are different. This pattern repeats for the other themes. Is this expected output?

Thanks for finding this issue, @jarq6c! The issue was how I was melting the dataframes. Ive pushed a change that resolves the issue you brought to light. Thanks!

aaraney commented 2 years ago

You might consider using categories for some values. Categories reduced the memory footprint from 15.0 MB to 1.0 MB in this example. ...

Awesome! Thanks for suggesting this! Ive pushed a change that now casts string columns to categories!

aaraney commented 2 years ago

It looks like pip removed --use-feature=in-tree-build in 22.1.

Usage:   
  /opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] <requirement specifier> [package-index-options] ...
  /opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] -r <requirements file> [package-index-options] ...
  /opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] [-e] <vcs project url> ...
  /opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] [-e] <local project path> ...
  /opt/hostedtoolcache/Python/3.9.12/x64/bin/python3 -m pip install [options] <archive url/path> ...

option --use-feature: invalid choice: 'in-tree-build' (choose from '2020-resolver', 'fast-deps')

I removed this option when installing subpackages in all our github actions.

aaraney commented 2 years ago

also, it seems like pytest does not like it when two test files have the same name even if they are in separate directories. see this gh actions log.

aaraney commented 2 years ago

alright, so with the additions I made today, this is pretty much ready to be merged. I just need to add documentation in the form of a readme and add description to the PR. However, the tool is now functional! Ill get that done tomorrow! Have a great weekend

aaraney commented 2 years ago

@jarq6c, just updated the readme and PR description. Once you and others are satisfied with the code, we should be ready to merge and release!

jarq6c commented 2 years ago

We'll pin a release before and after merging this in. I'll pin a release after #191 is merged.

aaraney commented 2 years ago

Great find regarding to_file's incompatibility. Do you think it merits mentioning that in documentation somewhere, @jarq6c?

jarq6c commented 2 years ago

Great find regarding to_file's incompatibility. Do you think it merits mentioning that in documentation somewhere, @jarq6c?

Might be a good idea to at least add to the README.md example.

aaraney commented 2 years ago

Live on pypi.

jarq6c commented 2 years ago

Live on pypi.

Thanks! I was going to take this chance to look over the other packages before pinning a HydroTools v3.0.0 release with the svi_client. If you've got any pet issues you want resolved before the next major release feel free to mention them.

aaraney commented 2 years ago

Live on pypi.

Thanks! I was going to take this chance to look over the other packages before pinning a HydroTools v3.0.0 release with the svi_client. If you've got any pet issues you want resolved before the next major release feel free to mention them.

sorry if I jumped the gun. I should have communicated with you before pushing to pypi.

As for pet issues. I would like to update the svi_client's readme as you suggested above. The other changes that I would like to include are a little larger in scope, so maybe they can be added in a future minor package release.