djsutherland / pummeler

Utilities to analyze ACS PUMS files, especially for distribution regression / ecological inference
MIT License
21 stars 7 forks source link

Subset using pd.DataFrame.query syntax #11

Closed flaxter closed 7 years ago

flaxter commented 8 years ago

In order to make predictions for demographic subgroups, we need the embeddings for those subgroups. So, e.g.:

python pummel featurize --subset "SEX == 2" regions regions/embedding_SEX

Will use pandas.DataFrame.query("SEX == 2") to just women.

Note that this is a very time-consuming approach (i.e. in the case above, it'll take about half as long as the original featurization...and then you'll probably want to rerun it for SEX == 1), but I'm not sure there's a better way to do it without loading everything into memory.

djsutherland commented 8 years ago

I was thinking of doing a bunch of subsets at once, by passing different weights into the embedding. That would save re-doing the IO, at least. I like the idea of using pandas queries, though. Let me think about it a bit.

djsutherland commented 7 years ago

Multi-subset version done in 3db5c82; pass e.g. --subset "SEX == 2, SEX == 1" to get both men and women at the same time.