Open dfolch opened 4 years ago
This is the intended behavior. re
is the default package so the pattern B00002*
would match both B00001_001E
and B00002_001E
.
To fix this, either:
1) Change your search string to B00002.*
2) Reformat your code to allow you to pass in fnmatch
or a custom function as your engine in the search. See here for documentation: cenpy.products.ACS.filter_variables
I'd recommend going with option 1 in this case.
Thank you for clarifying this @ronnie-llamado. Some thoughts on this.
Since this is not really a bug, maybe we just update the examples (i.e., Notebooks) and docs, e.g.: https://github.com/cenpy-devs/cenpy/blob/fde2ad6c71d4a81aab6d8950d76c39b36b9d377c/cenpy/tools.py#L33 https://github.com/cenpy-devs/cenpy/blob/fde2ad6c71d4a81aab6d8950d76c39b36b9d377c/cenpy/tools.py#L42
I noticed that the ^P004
style syntax works, when it would seem that it shouldn't under standard re
rules.
Currently this note is in the code acknowledging some weirdness. https://github.com/cenpy-devs/cenpy/blob/fde2ad6c71d4a81aab6d8950d76c39b36b9d377c/cenpy/remote.py#L288-L291
@dfolch, do you have any suggestions on which string pattern would be the best (most intuitive) for examples/docs?
Here's a quick snippet showing off some potential possibilities:
import cenpy
conn = cenpy.remote.APIConnection("ACSDT5Y2017")
# unintended variables returned
print( '0', list( conn.varslike('B00002*').index ) ) # original
print( '' )
# intended variables returned
print( '1', list( conn.varslike('B00002.*').index ) ) # B00002.*
print( '2', list( conn.varslike('B00002\w+').index ) ) # B00002\w+
print( '3', list( conn.varslike('B00002').index ) ) # B00002
print( '4', list( conn.varslike('^B00002').index ) ) # ^B00002
print( '5', list( conn.varslike('^B00002\w+$').index ) ) # ^B00002\w+$
Returns:
0 ['B00001_001E', 'B00002_001E'] 1 ['B00002_001E'] 2 ['B00002_001E'] 3 ['B00002_001E'] 4 ['B00002_001E'] 5 ['B00002_001E']
Option 3 (B00002
) is the friendliest, but doesn't fully utilize re
. Since the Census variables are already formatted and cenpy
just searches for a substring within the variable, this works but may not be as intuitive.
Your point is well taken that there is some mystery with Option 3. I didn't realize this query would return 166 items: conn.varslike('1002')
.
Since Option 0 is not a great re
example and it's not simple substring matching, I think it should be changed.
Option 3 covers most use cases and doesn't require people to even think about re
so I would make this the standard in the docs and examples. Maybe insert an example somewhere showing that fancy re
are possible. For example, getting just the variables for females (B01001_026
to B01001_049
) from table B01001
. There are some tables with a Puerto Rico specific counterpart (e.g., B05001
vs. B05001PR
) which could make an interesting re
example.
B00001_001E
is being returned when not requested.tucson = products.ACS(2017).from_place('Tucson, AZ', level='tract', variables=['B00002*'])
tucson