cenpy-devs / cenpy

Explore and download data from Census APIs
Other
183 stars 43 forks source link

`place`string matches are oddly case sensitive #44

Closed knaaptime closed 5 years ago

knaaptime commented 5 years ago

Place queries appear oddly case sensitive. The following queries work perfectly:

la = products.ACS(2015).from_place('Los Angeles, CA', level='tract', 
chi = products.ACS(2015).from_place('Chicago, IL', level='tract', 
                                   variables=['B00002*', 'B01002H_001E'])

return as they should Matched: Los Angeles, CA to Los Angeles city within layer Incorporated Places Matched: Chicago, IL to Chicago city within layer Incorporated Places

but these return errors from deep in pandas:

la2 = products.ACS(2015).from_place('los angeles, ca', level='tract', 
                                   variables=['B00002*', 'B01002H_001E'])
chi2 = products.ACS(2015).from_place('chicago, il', level='tract', 
                                   variables=['B00002*', 'B01002H_001E'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-32-5f21b1c0ef2e> in <module>
      1 la2 = products.ACS(2015).from_place('los angeles, ca', level='tract', 
----> 2                                    variables=['B00002*', 'B01002H_001E'])

~/anaconda3/lib/python3.7/site-packages/cenpy/products.py in from_place(self, place, variables, level, strict_within, return_bounds)
    386                                   .from_place(place, variables=variables, level=level,
    387                                               strict_within=strict_within,
--> 388                                               return_bounds=return_bounds)
    389         variables['GEOID'] = variables.GEO_ID.str.split('US').apply(lambda x: x[1])
    390         return_table = geoms[['GEOID', 'geometry']]\

~/anaconda3/lib/python3.7/site-packages/cenpy/products.py in from_place(self, place, variables, level, geometry_precision, strict_within, return_bounds)
     84         name, state = place.split(',')
     85         place_ix, placematch = _fuzzy_match(name.strip(),
---> 86                                 _places.query('STATE == "{}"'.format(state.strip()))
     87                                        .TARGETNAME)
     88         placerow = _places.loc[place_ix]

~/anaconda3/lib/python3.7/site-packages/cenpy/products.py in _fuzzy_match(matchtarget, matchlist)
    415             ixmax, rowmax = _break_ties(matchtarget, table)
    416         else:
--> 417             ixmax = table.score.idxmax()
    418             rowmax = table.loc[ixmax]
    419         return ixmax, rowmax

~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in idxmax(self, axis, skipna, *args, **kwargs)
   1949         """
   1950         skipna = nv.validate_argmax_with_skipna(skipna, args, kwargs)
-> 1951         i = nanops.nanargmax(com.values_from_object(self), skipna=skipna)
   1952         if i == -1:
   1953             return np.nan

~/anaconda3/lib/python3.7/site-packages/pandas/core/nanops.py in _f(*args, **kwargs)
     71             if any(self.check(obj) for obj in obj_iter):
     72                 msg = 'reduction operation {name!r} not allowed for this dtype'
---> 73                 raise TypeError(msg.format(name=f.__name__.replace('nan', '')))
     74             try:
     75                 with np.errstate(invalid='ignore'):

TypeError: reduction operation 'argmax' not allowed for this dtype
ljwolf commented 5 years ago

Yeah, they're not matching because the two strings are using different characters.

Can change the string match to be against lowercased strings, which would make matches case insensitive.

That'd be in the lambda in the _fuzzy_match() function.

Sent from Mobile

On Mon, Apr 8, 2019, 15:39 eli knaap notifications@github.com wrote:

Place queries appear oddly case sensitive. The following queries work perfectly:

la = products.ACS(2015).from_place('Los Angeles, CA', level='tract', chi = products.ACS(2015).from_place('Chicago, IL', level='tract', variables=['B00002*', 'B01002H_001E'])

return as they should Matched: Los Angeles, CA to Los Angeles city within layer Incorporated Places Matched: Chicago, IL to Chicago city within layer Incorporated Places

but these return errors from deep in pandas:

la2 = products.ACS(2015).from_place('los angeles, ca', level='tract', variables=['B00002', 'B01002H_001E']) chi2 = products.ACS(2015).from_place('chicago, il', level='tract', variables=['B00002', 'B01002H_001E'])


TypeError Traceback (most recent call last)

in 1 la2 = products.ACS(2015).from_place('los angeles, ca', level='tract', ----> 2 variables=['B00002*', 'B01002H_001E']) ~/anaconda3/lib/python3.7/site-packages/cenpy/products.py in from_place(self, place, variables, level, strict_within, return_bounds) 386 .from_place(place, variables=variables, level=level, 387 strict_within=strict_within, --> 388 return_bounds=return_bounds) 389 variables['GEOID'] = variables.GEO_ID.str.split('US').apply(lambda x: x[1]) 390 return_table = geoms[['GEOID', 'geometry']]\ ~/anaconda3/lib/python3.7/site-packages/cenpy/products.py in from_place(self, place, variables, level, geometry_precision, strict_within, return_bounds) 84 name, state = place.split(',') 85 place_ix, placematch = _fuzzy_match(name.strip(), ---> 86 _places.query('STATE == "{}"'.format(state.strip())) 87 .TARGETNAME) 88 placerow = _places.loc[place_ix] ~/anaconda3/lib/python3.7/site-packages/cenpy/products.py in _fuzzy_match(matchtarget, matchlist) 415 ixmax, rowmax = _break_ties(matchtarget, table) 416 else: --> 417 ixmax = table.score.idxmax() 418 rowmax = table.loc[ixmax] 419 return ixmax, rowmax ~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in idxmax(self, axis, skipna, *args, **kwargs) 1949 """ 1950 skipna = nv.validate_argmax_with_skipna(skipna, args, kwargs) -> 1951 i = nanops.nanargmax(com.values_from_object(self), skipna=skipna) 1952 if i == -1: 1953 return np.nan ~/anaconda3/lib/python3.7/site-packages/pandas/core/nanops.py in _f(*args, **kwargs) 71 if any(self.check(obj) for obj in obj_iter): 72 msg = 'reduction operation {name!r} not allowed for this dtype' ---> 73 raise TypeError(msg.format(name=f.__name__.replace('nan', ''))) 74 try: 75 with np.errstate(invalid='ignore'): TypeError: reduction operation 'argmax' not allowed for this dtype — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .
knaaptime commented 5 years ago

ah, i thought fuzzywuzzy would just penalize the case mismatch

ljwolf commented 5 years ago

it does, but the issue is with the way place filters by state, because there're a ton of identically-named places across the US. I'm using the state component to filter the products._places dataframe first, then match on the place within the state. Because il wasn't matching IL in products._places.query(), we were sending an empty dataframe down to _fuzzy_match.

Now, this sends a filtered dataframe if a state is provided, and searches states in a case insensitive way.

ljwolf commented 5 years ago

as always, thanks :)