cenpy-devs / cenpy

Explore and download data from Census APIs
Other
183 stars 43 forks source link

getting B00001_001E when not requested #103

Open dfolch opened 4 years ago

dfolch commented 4 years ago

B00001_001E is being returned when not requested.

tucson = products.ACS(2017).from_place('Tucson, AZ', level='tract', variables=['B00002*'])

tucson

GEOID geometry B00001_001E B00002_001E state county tract
04019000100 POLYGON ((-12353986.95 3791891.58, -12353934.6... 100.0 68.0 04 019 000100
04019002602 POLYGON ((-12352400.09 3798883.47, -12352399.9... 211.0 134.0 04 019 002602
04019001600 POLYGON ((-12350387.1 3795258.49, -12350376.19... 322.0 171.0 04 019 001600
ronnie-llamado commented 3 years ago

This is the intended behavior. re is the default package so the pattern B00002* would match both B00001_001E and B00002_001E.

To fix this, either: 1) Change your search string to B00002.* 2) Reformat your code to allow you to pass in fnmatch or a custom function as your engine in the search. See here for documentation: cenpy.products.ACS.filter_variables

I'd recommend going with option 1 in this case.

dfolch commented 3 years ago

Thank you for clarifying this @ronnie-llamado. Some thoughts on this.

Since this is not really a bug, maybe we just update the examples (i.e., Notebooks) and docs, e.g.: https://github.com/cenpy-devs/cenpy/blob/fde2ad6c71d4a81aab6d8950d76c39b36b9d377c/cenpy/tools.py#L33 https://github.com/cenpy-devs/cenpy/blob/fde2ad6c71d4a81aab6d8950d76c39b36b9d377c/cenpy/tools.py#L42

I noticed that the ^P004 style syntax works, when it would seem that it shouldn't under standard re rules.

Currently this note is in the code acknowledging some weirdness. https://github.com/cenpy-devs/cenpy/blob/fde2ad6c71d4a81aab6d8950d76c39b36b9d377c/cenpy/remote.py#L288-L291

ronnie-llamado commented 3 years ago

@dfolch, do you have any suggestions on which string pattern would be the best (most intuitive) for examples/docs?

Here's a quick snippet showing off some potential possibilities:

import cenpy

conn = cenpy.remote.APIConnection("ACSDT5Y2017")

# unintended variables returned
print( '0', list( conn.varslike('B00002*').index ) )       # original 
print( '' )

# intended variables returned
print( '1', list( conn.varslike('B00002.*').index ) )      # B00002.*
print( '2', list( conn.varslike('B00002\w+').index ) )     # B00002\w+
print( '3', list( conn.varslike('B00002').index ) )        # B00002 
print( '4', list( conn.varslike('^B00002').index ) )       # ^B00002
print( '5', list( conn.varslike('^B00002\w+$').index ) )   # ^B00002\w+$

Returns:

0 ['B00001_001E', 'B00002_001E']

1 ['B00002_001E']
2 ['B00002_001E']
3 ['B00002_001E']
4 ['B00002_001E']
5 ['B00002_001E']

Option 3 (B00002) is the friendliest, but doesn't fully utilize re. Since the Census variables are already formatted and cenpy just searches for a substring within the variable, this works but may not be as intuitive.

dfolch commented 3 years ago

Your point is well taken that there is some mystery with Option 3. I didn't realize this query would return 166 items: conn.varslike('1002').

Since Option 0 is not a great re example and it's not simple substring matching, I think it should be changed.

Option 3 covers most use cases and doesn't require people to even think about re so I would make this the standard in the docs and examples. Maybe insert an example somewhere showing that fancy re are possible. For example, getting just the variables for females (B01001_026 to B01001_049) from table B01001. There are some tables with a Puerto Rico specific counterpart (e.g., B05001 vs. B05001PR) which could make an interesting re example.